Current location - Education and Training Encyclopedia - Graduation thesis - How does simhash check duplicate text?
How does simhash check duplicate text?
There are 65438+ billion non-repetitive 64-bit 0 1 strings, and give a 64-bit 0 1 string f at will. How to quickly find a string with Hamming distance less than 3 from F?

Approximate duplicate checking of large-scale web pages

The approximate repetition of detecting web page crawling is mainly translated from WWW07.

There are a large number of web pages with similar contents on the WWW. For search engines, removing similar web pages can improve retrieval efficiency and reduce storage overhead.

Crawlers must be able to quickly find out whether there are duplicate pages in massive text sets when crawling web pages.

This paper has two main contributions:

1. This shows that simhash can be used for duplicate checking of massive texts.

2. A feasible algorithm in practical application is proposed.

Simhash algorithm

After extracting the content from the text, after basic preprocessing, such as removing stop words, reducing roots, and even blocking, a vector can finally be obtained.

Each $ term is converted into a hash code with a length of f bits, and the value of 1-0 on each bit is converted into positive and negative weights. For example, when the bit of f 1 is 1, the weight is set to +weight, and when the bit of FK is 0, the weight is set to -weight.

All the weight vectors converted by $ term in spoken text are accumulated according to the corresponding bits of F, and finally an F-bit weight array is obtained. If the bit is positive, it is set to 1, if the bit is negative, it is set to 0, and then the text is converted into a new f-bit 1-0 array, which is a new hash code.

Simhash has two "conflicting attributes":

1. This is a hash method.

2. Similar texts have similar hash values. If the simhash of two texts is closer, that is, the Hamming distance is smaller, the texts are more similar.

Therefore, how to quickly determine whether there is a fingerprint with a small Hamming distance in massive simhash through the task conversion bit of duplicate checking in massive text.

That is, among n f-bit fingerprints, the fingerprint whose Hamming distance is less than k is queried.

In the experiment of the article (see the end), simhash uses 64-bit hash function. Hamming distance =3 is just right under the scale of 8 billion web pages.

Therefore, the f bit of the task = 64, k = 3, n = 8 *101.

Clear task, first look at two very intuitive methods:

1. Enumerates all simhash fingerprints with Hamming distance less than 3, and queries each fingerprint among 8 billion sorted fingerprints.

(This method needs to fingerprint the word C (64,3) = 465438+C (64), and then query each word. )

2. All the close fingerprints are sorted together, and the maximum is 4 1664, which requires huge space.

The proposed method is between the two, which is a reasonable compromise between space and time.

Suppose we have a set of sorted 2d and F fingerprints. Look at the high d bit of each fingerprint. The high and low bits have the following properties: although there are many 2d bit combinations, only a few of them are repeated in the high D bit.

Now find a number d' close to d, because the whole table is ordered, so a search can find a fingerprint set f' with the same high d' bit as the target fingerprint f, because d' and d are very close, so the found set f' will not be very large.

Finally, it is very fast to find the fingerprint with hamming distance k between f and f in the set f'.

First, narrow the retrieval set, and then search the Hamming distance of f-d' position in the small set.

According to the example, there are 2 34 pages out of 8 billion pages, so theoretically, 34 bits can represent 8 billion unique fingerprints.

We assume that the first 34 digits represent 8 billion fingerprints, and the first 30 digits of fingerprints are the same, so the last 4 digits can also represent 24 fingerprints, and we only need to compare whether the Hamming distance of these 16 fingerprints is less than 3.

Suppose: You can do this for 30 of any 34 bits.

Therefore, in a complete search, the first q bits must be completely matched (assuming that these fingerprints are Q-ordered, method of bisection can be used, and if the number of fingerprints is very large and evenly distributed, even interpolation search can be used), and the remaining 64-q bits of subsequent 2d-q fingerprints need to be compared, and the Hamming distance is less than 3.

So the question becomes how to truncate the 64-bit Q.

Divide the 64 bits into several parts, such as 4 parts of ABCD, and each part is 16 bits.

Assuming that these fingerprints have been sorted according to part A, we first accurately match an interval according to the 16 bit of A, and check whether the Hamming distance after BCD bit of this interval is less than 3.

Under the same assumption, secondly, we accurately match to another interval according to the 16 bit of B. All fingerprints in this interval need to compare whether the Hamming distance on the ACD bit is less than 3.

Similarly, there are c and d.

So here we need to make 4 copies of all fingerprints T, T 1 T2 T3 T4, T 1 sorted by A, T2 sorted by B ... 4 copies can be queried in parallel, and finally the results can be merged. In this way, even in the worst case, three places fall in three areas, ABC, ACD, BCD, ABD… will not be missed.

Only 16 digits are matched accurately, and the number of fingerprints that need to be matched one by one is still huge, which may reach 2d- 16, and we can also match more accurately.

For example, divide 64 bits into 4 parts of ABCD, each part is 16 bits. On the 48th bit of BCD, we are divided into four parts, WXZY, each part 12. The 3 bits of the Hamming distance can be scattered in any three parts, so any one of A and WXZY can be combined into an exact 28 bits … the remaining 3 parts are used to check the Hamming distance. Similarly, b, c and d can also be used. Then t needs to be copied 16 times, and the combination of ABCD and WXYZ matches accurately. After each exact match, the number of matching needs to be reduced to 2d-28. Different combinations are a trade-off between time and space.

In the worst case, three of them may have 1 bit, and the difference of Hamming distance is 1.

The algorithm is described as follows:

1) First, copy the original table T into Tt copies: T 1, T2, ... TT.

2) Each Ti is associated with a pi and a πi, where pi is an integer and πi is a permutation function, which is responsible for changing the pi bit into a high bit.

3) Apply the permutation function πi to the corresponding Ti table, and then rank the Ti.

4) Then, the following operations are performed on each Ti, the fingerprint f to be matched and the Hamming distance k:

A) then use the high pi bits of F' to search and find out the same group of high pi bits in Ti.

B) comparing the f-pi bits in the retrieved set to find out the fingerprint with Hamming distance less than or equal to k.. 5) Finally, merge all the retrieval results in Ti.