Automatic duplicate checking system is generally based on text similarity algorithm, which calculates the similarity between test papers through comparative analysis. Among them, the number of words is an important reference index. Generally speaking, there are two main ways to calculate the number of repeated words.
The first method is based on the number of characters. Extract all characters on the paper, and then count the number of characters. This method is simple and clear, but it is easily influenced by typesetting format. For example, some characters may be formatted as special symbols or line breaks, which need to be normalized in calculation.
The second method is based on the number of words. Divide the content of the paper into words, and then count the words after the word segmentation. This method is commonly used because it can better reflect the semantic information of the paper. However, compared with the calculation of the number of characters, the calculation of the number of words may face difficulties in word sense disambiguation and new word recognition.
In addition to the number of words, there are some other indicators that can also be used for paper double checking. Such as sentence similarity, paragraph similarity and so on. These indicators can comprehensively consider the different levels and structures of the paper and improve the accuracy of the duplicate checking system.