1. Text preprocessing: Pre-process the submitted papers, including removing stop words, punctuation, numbers, etc., before duplicate checking. , thus reducing the interference of irrelevant information. At the same time, long sentences are divided into clauses to improve the accuracy of duplicate checking.
2. Feature extraction: the preprocessed text is converted into feature vectors. Commonly used feature extraction methods include word bag (Bow), TF-IDF ($ TERM frequency-inverse document frequency) and so on. These methods can represent the text as a combination of words or phrases, which is convenient for subsequent similarity calculation.
3. Similarity calculation: calculate the similarity between the text to be detected and the existing documents in the database through the feature vector. Commonly used similarity calculation methods are cosine similarity and Jaccard similarity. These methods can measure the similarity of two texts at the lexical or phrase level.
4. Threshold judgment: according to the set threshold, judge whether the similarity between the text to be detected and the existing documents in the database exceeds the threshold. If it exceeds the threshold, it is considered that the article is suspected of plagiarism. The setting of threshold can be adjusted according to the actual demand to balance the precision and recall.
5. Semantic analysis: In addition to the similarity-based duplicate checking method, China HowNet also adopted some semantic analysis techniques, such as dependency syntax analysis and sentiment analysis, to improve the accuracy of duplicate checking. These techniques can help identify some plagiarism by simply replacing words.
6. Manual review: China Knowledge Network will manually review the documents suspected of plagiarism in the duplicate check results to ensure the accuracy of the duplicate check results. Manual audit can effectively identify some complex and hidden plagiarism and improve the accuracy of duplicate checking.
In a word, China Knowledge Network's duplicate checking algorithm integrates various technical means, including text preprocessing, feature extraction, similarity calculation, threshold judgment, semantic analysis and manual review, aiming at providing users with accurate and reliable duplicate checking services.