BM25 algorithm is a common formula for scoring correlation, and its thinking is relatively simple. It mainly calculates the relevance of all $ term and documents in a query, and then accumulates the scores. The relevance score of each word is mainly influenced by tf/idf. The formula is as follows:
Where: is the correlation value between each word and the document; Represents $ term in the query; Representing relevant documents; It is the weight of words.
Can be set externally, and the default value is value. The basic idea of idf formula is that the importance of a word is inversely proportional to its frequency of occurrence in the whole document collection. The formula is as follows:
Where: is the total number of documents; Number of documents containing this word; 0.5 is the adjustment coefficient to avoid the situation of 0. The purpose of taking logarithm is to make idf value smoother under the influence of n and.
As can be seen from this formula, the larger the smaller, the larger the value.
The following is the formula,
Where:,, is the regulatory factor, generally,,; Is the number of words in the document, representing the number of words in the query sentence; Is the length of the document and the average length of the document;
Visible, other factors being the same, the greater the correlation, the lower the correlation; As for dividing by one, I think it is to compare the length of this document with that of the whole document to avoid being too large when taking the value alone.
The left factor of the product represents the frequency relationship of words in the document, and the right factor of the product represents the frequency relationship of words in the query statement. In most cases, the query word appears once in the query statement, so it can be regarded as 1. Since it is 1, the right factor is actually equal to 1, so the formula can be simplified as:
After simplifying the formula, we can get:
The factors influencing the formula of BM25 are as follows
1: The higher the score, the higher the score.
2. The higher the score, the higher the score.
3: If the document length in the document level is high, the score will be low.
4 is the adjustment factor of the score.
Generally speaking, an article is divided into several parts, such as title, content, description, anchor point and so on. In the BM25F formula, these parts are called fields. There are two articles. The correlation score between the title part of an article and the BM25 of the query is A, and the correlation score between the content part of another article and the BM25 of the query is also A. Assuming that the relevance of other parts of these two articles to the query is not considered, according to experience, the first article should generally be more relevant than the first one. BM25F introduces the information of each domain in Part D, which deals with the correlation of each $ term in each domain in Part D, and the formula is as follows:
BM25 is also called okapi BM25, and okatp is the formula for calculating the correlation between BM25 and $ term proximity fusion.
The weight of the first $ term in the query is defined as:
The definition that can be found is more like the product of query part and IDF in BM25. The difference lies in the setting of constant parameters. In the newspaper.
$ term proximity is defined as follows
Is the distance in words between the search term and.
The following formula reflects the fusion of BM25 and $ term, and tf in BM25 is replaced by
The definition of is the same as BM25.
OkaTP is finally defined as
Where s is the set of all pairs of $ term combinations in the query.
The algorithm is very similar to OkaTP
Among them, TP is the $ term proximity, and the proximity information is introduced into the algorithm to optimize the calculation effect of correlation.
Suppose a query q contains n $ term, representing an article, and the distance between any two different $ term positions in the article is expressed as. These two $ term
Note: the first $ term in the query may appear in the document many times, and it is used every time.
Refer to the $ TERM proximity score, and make a special search for the very large text set of the original paper
You can understand the above two formulas with the article "Selective $ TERM Neighborhood Scoring by BP-ANN".
This paper holds that there are two problems with OkaTP: the second half of the formula 1. OkaTP (which can be regarded as the score for prase) and the first part of BM25 overlap, that is, a $ term will appear in both parts at the same time; 2. The linear combination of the scores of single words and loose phrases can break the nonlinear characteristics of word frequency.
Based on these two points, newTP algorithm is proposed. The concept of span is introduced into newTP. Span is to divide the whole hit list into multiple segments according to the hit position of query words in the document, and each segment is called an extended span. The rules for determining the span are as follows.
Where MAX_DIS is considered set.
According to the density and quantity of query words in span, the contribution of $ term to relevance is determined. Thereby replacing the tpi and tf parts in OkaTP.
The contribution of a $ term t to correlation is expressed as:
These include:
The contribution of a $ term test to the whole document relevance is:
It can be seen that rc contains the information of proximity and tf.
The tf in BM25 is directly replaced by rc, and a new related formula is obtained:
Where TOP is the $ term sequence proximity. The algorithm is optimized on the basis of BM25TP. The algorithm introduces $ term order information. If two $ term appear in the query in the reverse order of documents, they will be punished, and if the order is the same, they will be rewarded.
It is used to calculate the proximity in BM25TP algorithm, but this formula is insensitive to $ term order. You will think that John is faster than Mary, and Mary is faster than John.
To illustrate the BM25TOP algorithm, the following two definitions are introduced.
: position in query q
: position in document d
The new formula is used to replace the. This formula should satisfy the following three conditions.
Satisfying the first two items can also be achieved by dist function in BM25TP algorithm, and the third item is to increase the consideration of $ term order.
There are many formulas that satisfy the above three conditions, and the following formula is selected in this paper.
In ...
The size of represents the relative distance between and in document d. The symbol indicates whether the two $ term are in the same order in the query and document, the plus sign (+) indicates the same order, and the symbol (-) indicates that the two $ term are in the opposite order in the query and document.
The image changes with the function as follows:
[The picture is being uploaded ... (picture-8D6891-1599898211594-0)]
Note: the proximity is the reciprocal, and the greater the value, the smaller the proximity.
As can be seen from the above figure, the farther the distance between two $ term, the greater the proximity and the smaller the proximity;
When two pairs of $ term are equal in size but opposite in sign, a pair of $ term with negative sign (-) will be less close.
The following is a reasonable arrangement for the calculation of BM25TOP:
These include: