First, the duplicate checking software will preprocess the code, including deleting comments, spaces and other unnecessary characters, and replacing all variable names and function names with placeholders. The purpose of this is to make the code only pay attention to its logical structure and algorithm when comparing, and it is not affected by specific variable names and function names.
Next, the double-checking software will transform the preprocessed code into a data structure called "Abstract Syntax Tree" (AST). AST is a tree data structure that can represent the code structure, in which each node represents an element (such as variable, function or expression) in the code, and the relationship between nodes represents the logical relationship between these elements.
Then, the duplicate checking software will use a technique called "sequence alignment" to compare the similarities between the two ASTs. The basic idea of this method is to treat two ASTs as two sequences, and then calculate the editing distance between the two sequences (that is, how many times it takes to insert, delete or replace one sequence into another). If the editing distance is small, it means that the two ASTs are very similar, so the corresponding codes are also very similar.
Finally, the duplicate checking software will judge whether the two pieces of code are duplicated according to the calculated similarity. Generally speaking, if the similarity between two pieces of code exceeds a certain threshold, it will be considered as duplication. This threshold is generally set by the school or tutor, and the specific value may vary according to different disciplines and majors.