Practice of similar search engine (algorithm)

Note: Transferred from Youzan

In the last article (engineering article), we introduced the basic framework of the like search engine. Search engine is mainly composed of three parts. First, using hadoop cluster to generate large-scale search and real-time index; Second, the ElasticSearch cluster provides a distributed search scheme; Third, advanced search cluster is used to provide special functions of commercial search.

Due to the particularity of commercial e-commerce search, independent elastic search cluster can not meet the diverse algorithm requirements. We have corresponding algorithm plug-ins in all parts of the search to build the algorithm system of the commercial e-commerce search engine.

The process of creating an index is the process of creating an inverted index from the original data. In this process, we analyze the product (doc), calculate the static score of the product and calculate the similarity of the product. The static rating of products plays a vital role in improving the quality of search engines, which is equivalent to pagerank of web search. Imagine if there were no pagerank algorithm. How bad the quality of network search will be. In e-commerce search, the most common problem is that there are too many similar commodities, so it is necessary to calculate the similarity between commodities in advance in the process of establishing the index in order to effectively eliminate duplication in the retrieval process.

The process of creating an index is as follows.

Step 1. Calculate the static score of each document.

Step two. Calculate the similarity of each document.

Step three. Classify data according to similarity and other information.

Step four. Establish ES index.

The retrieval process is a process in which a search engine receives a user's query, processes it and returns relevant results. Commercial search engines need to consider two factors in the retrieval process: 1) relevance 2) importance.

Relevance refers to whether the returned results are related to the input query, which is one of the basic problems of search engines. At present, the commonly used algorithms are BM25 and space vector model. Both of these algorithms are supported by ElasticSearch, and general commercial search engines use BM25 algorithm. The BM25 algorithm will calculate the relevance score of each document and query, which is expressed by Dscore.

Importance refers to the degree to which goods are trusted. We should return the most trusted goods to consumers instead of letting consumers identify them themselves. Especially in the e-commerce search where goods are fully competitive, a reasonable importance score must be given to the goods to ensure the quality of the search results. Importance score, also known as static score, is expressed by Tscore.

The final ranking of search engines is based on:

Score = Dscore * Tscore

That is to say, considering static and dynamic points comprehensively, we can give users relevant and important goods.

The process of retrieval is roughly summarized as the following steps.

Step 1. Analyze the original query.

Step two. Rewrite the query according to the query analysis results in as.

Step three. Use rewritten query to retrieve.

Step four. According to the static points and dynamic points in the es query process, the comprehensive sorting is carried out.

Step five. Reorder the results returned by es to.

Step six. Return the results.

The following sections describe several key technologies.

In e-commerce search engine, the static point of goods has the same value and importance as pagerank in web search. Are all intrinsic value measures of doc that have nothing to do with the query. Pagerank operates through the voting relationship between docs. Relatively speaking, there will be more factors in the static score of goods. Like pagerank, the static calculation of goods needs to solve the following two problems: 1. Stable. Pagerank can guarantee that a website will not linearly improve its ranking because of simple link stacking. Similarly, the calculation of product static score can not make the product linearly increase the score by increasing a single index (such as the impact of brushing a single page on the quality of search engines).

2. Degree of discrimination. On the basis of ensuring the stability, the static classification of goods should have enough discrimination to ensure that the quality of the goods ranked first is higher than that of the goods ranked behind under the same search conditions.

We assume that there are three decisive factors in the static state of goods, namely 1. Next odd number, 2. The favorable rate and 3. Delivery speed.

For static points, we use Tsocre, and Tscore can be written as:

Tscore = a * f (lower singular) +b * g (favorable rate) +c * h (delivery speed)

A, B and C are the weight parameters to balance the influence degree of each index. F, g, h, g and h are representative functions that transform original indicators into reasonable measures.

First, we need to find a reasonable representative function.

Z score standardization method

This method is very unstable. Suppose a singularity is 1000 times of the second largest value, then most of the values will be concentrated in 0~0.0 1, which will lose the purpose of normalization.

(figure 3: log-zscore standardization)

Finally, the appropriate weights are selected and normalized by log-zscore, which basically explains the representative functions of F, G, H, G and H. Tscore = a f (lower odd number) +b g (favorable rate) +c*h (delivery speed), and then the parameters of A, B, c B and C are determined. Generally, there are two methods:

A) expert method. Dynamically adjust the weight parameters according to our daily experience;

B) experimental methods. First, with the help of experts, assign an initial value, then change the univariate method and dynamically adjust the parameters according to the results of abtest.

De-duplication of product titles plays an important role in e-commerce search. The data shows that 80% of the goods purchased through search pages are the first four pages selected by users. Repeated product titles will lead to important pages without gold content, which will greatly reduce the purchase rate of search.

For example:

Title 1: delicious/banana/postage/Guangdong/Gaozhou/banana//none/ripening agent/

Title2: Delicious/Banana/Guangdong/Gaozhou/Banana//No/Banana/postage/

Firstly, feature vectorization is carried out.

Here, the "word bag" technique is used, in which words are taken as the dimensions of the space vector and the word frequency of each $ term in the title is taken as the value of this feature. In this example, the dimensions of this word are delicious (0), banana (1), postage (2), Guangdong (3), Gaozhou (4) and banana (5).

Title 1: 1，2， 1， 1， 1， 1， 1， 1，0

Title 2: 1, 2, 1, 1, 1, 0, 0, 1

Each title is represented by a vector of fixed length.

Third, calculate the pairwise similarity.

Without losing generality, similarity is usually achieved by calculating the distance between two vectors. Here we use 1- cosine (x, y) to represent the distance between two vectors. This is an "all-to-all similarity" problem, that is, pairwise comparisons are needed, and the complexity is O (n 2). When the quantity of goods is huge, it is difficult to handle with a single machine. We give two implementation methods.

Method 1: Matrix operation of sparks.

Method 2: map-reduce linear method. This method refers to the paper "Realizing Paired Document Similarity in Large Sets with Mapreduce". It can achieve almost linear time complexity. Compared with matrix operation, the scale is larger (more than 654.38 billion). Paired similarity operation has the above advantages. This method is briefly described as follows: firstly, according to the calculation method of inverted index, the mapping from each $ term to doc is calculated. For example, three documents:

Convert to the reverse format, which requires fewer mappers.

Then, for filters with only one value element, for pairwise combinations of documents with values greater than 2:

Finally, summarize the output, and value is the ratio of the number of repetitions of the product of two DOCs to the root number.

For two titles 1, title 2, if X (title 1, title 2) >; 0.7 think that title 1 is similar to title2. For two similar documents, the definition of static division is the primary document and the definition of static division is the auxiliary document. The main document and the auxiliary document establish databases respectively.

Different from web search (web search directly deletes auxiliary documents), we establish main documents and auxiliary documents respectively. For each search, we search the main document and the auxiliary document in proportion and fuse the results back. This can ensure the diversity of results.

Store deduplication is slightly different from product title deduplication. Because of the needs of specific e-commerce scenarios, we don't want the search results to be unique, which will lead to a strong Matthew effect. We can't do storage deduplication by the above method, because the main basis of the above method is that the texts are similar, and appropriate choices are made on the premise that the results are all relevant. But eliminating storage duplication is not such a feature.

Imagine if we divide the goods in the same store into master warehouses and slave warehouses according to whether the stores are the same or not, as shown in the following figure.

A and B stand for different shops.

When searching for bananas, it is indeed possible to control the number of results in store A, but when searching for "pears", it is wrong for the pears in store B to rank first (assuming that the static score of A: pears is higher than that of B: pears).

In the process of searching, 25% of the search tasks are shared equally in each bucket, and the results are merged into one page according to the static state, which ensures the same relative order of the results and achieves the purpose of eliminating duplication in the store.

As shown in the above figure, although there are 10 results that meet the requirements when searching for "banana", only 5 results can be displayed on each page when searching for drunkenness.

The above introduces several technologies in the indexing process, and there are many key technologies in the retrieval process. The most famous is query analysis technology. The query analysis techniques we use mainly include core word recognition, synonym expansion, brand word recognition and so on. Most query analysis techniques are within the research scope of NLP. A lot of theoretical knowledge will not be detailed in this article. We will focus on synonym expansion technology. This technology generally needs specific training according to its own products and user logs, and cannot be applied to a standard library like word segmentation technology and brand word recognition.

Synonym expansion is generally obtained by analyzing the user's session log. If the user enters "Apple Phone" and doesn't get the desired result, he then enters "iphone", and we create a transfer relationship between "Apple Phone" and "iphone". Based on statistics, we can create an interrelated weight graph for user queries.

The user inputs the query "Apple Phone". According to the query analysis, "Apple Phone" has two synonyms: "iPhone”0.8 and" iPhone 6 "0.5. 0.8 and 0.5 respectively indicate the degree of synonyms. We want to enter three queries of "Apple Phone", "iPhone" and "iPhone 6" at the same time. Different queries are given different weights according to the degree of synonyms. The Boosting query provided by ElasticSearch can support this demand. Reference:).

Can pvar test tampering values?

Simple thesis topic education

Soul review

A composition about fishing

Argumentative essay of grade three: personal accomplishment

Exploration on the Cultivation of Young Teachers' Teaching Literacy in Colleges and Universities

Urgent! Urgent! Paper: Summary of community monitoring system scheme.

Application and development trend of remote sensing technology

An 800-word model essay on literary criticism of Water Margin

How to write papers for English students?