Before introducing search, two concepts are introduced: information quantity and information quantity.
(they all use explanations that they think are easy to understand. You can read related books if you are interested. )
1. 1 Information is something to reduce uncertainty, and information is also something to increase certainty.
The first half of the sentence is the Shannon information definition, and the second half is the inverse Shannon information definition. Take a chestnut and think back to the process of interacting with a heterosexual. You didn't know this person existed in this world before you met TA. Later, you saw what TA looked like. Later, you learned about TA's personality, mantra and past events. Then step by step, you know nothing about TA and get familiar with it. This period is the process in which you keep getting TA information. It is this information that makes you completely unsure who TA is and completely sure that TA is right for you.
1.2 Information quantity is a measure that information can reduce uncertainty, and information quantity is also a measure that information can increase certainty.
There are many mathematical descriptions about the amount of information, but in general, it can be understood so simply. Give a chestnut, and the witness describes the suspect. A witness's message is "He is alone". The information of witness B is "TA is a high school boy", and the information of witness C is "TA is a high school student with long hair about 170." Witness D's message is "I know him, he is Chen Haonan, the shoulder of the school". We intuitively feel the relationship between information quantity: A.
The proportion of local men is 50%, the proportion of local high school boys is 8%, the proportion of local high school boys' long hair is about 170 is 4%, and the proportion of local handkerchief named Chen Haonan is 0.000 1%. Because P(A)>P(B)>P(C)>P(D), the relationship of information quantity is: A.
2. Product logic of search
Search meets the needs of users to find interesting content quickly. When a user inputs a query, the search system screens out the content that the system thinks the user is interested in according to the information input by the user, and displays it in turn according to the importance of system identification. Please pay attention to this expression. In short, search can be divided into three steps.
Step 1: Interpretation of user input information
Step 2: Filter the content according to the user input information.
Step 3: Sort the filtered results.
To understand how these three steps are completed in the search system, we must first know how the search server stores information.
3. The storage principle of search data
In the above picture, suppose we make a news website, then its structure is the following picture. The content is simplified, assuming that a news has only a title, a lead and a text. The data is just the number of readings, comments and shares.
Figure 1- 1
It is almost the structure on the right of the above picture. On the right is the storage of news content: just like books in a library, they are arranged neatly in an easy-to-find order (the name of this storage structure is index, which is the language from the library). On the left is the thesaurus: as long as a searched input word can match the thesaurus, the corresponding content of the thesaurus can be quickly found.
Each search system has its own thesaurus, and the search behavior that can't correspond to word segmentation will not have results. Each search system will have a corresponding thesaurus according to different target users, just like a dictionary. The entries in metallurgical dictionary and biological dictionary are different, and so are the thesaurus of Zhihu and Taobao. Many optimizations of search focus on the optimization of thesaurus.
To sum up briefly, the storage principle of search is: a systematic thesaurus, a neatly arranged content index database, and the systematic thesaurus and the content index database can be quickly associated.
On the basis of the storage structure of this search system, the three search steps we mentioned will be carried out in turn.
4. Step 1: Interpretation of user input information
As mentioned earlier, the search thesaurus is limited, but the user's input is infinite. So how to turn infinite search into a limited thesaurus and match the corresponding results? Here we need to introduce a new concept: word segmentation, which is simply to split the input string.
Take the news search system in figure 1- 1 as an example. If the query entered by the user is "genetically modified food in China", there is actually no such word in the system. If there is no word segmentation function, even if there is corresponding content in the system, the search will end immediately. The working principle of word segmentation is to further split the user's input when it can't match accurately. So we got the following results.
Genetically Modified Foods in China —— China, German and Genetically Modified Foods.
Not all words have information. If you recall the result of "winning", it will be in almost all news content. It is obviously wrong to recall so many results. For example, the word "de" in this query will actually be directly ignored in the word segmentation system. Because of the different probability of appearing in the content, the more news a word appears, the less information it contains, and the words with too little information will be ignored, that is, stop words. At the same time, news content with more words will be more important. Then, after the stop words is removed, the result is further simplified.
Genetically modified food in China —— China, Genetically modified food.
After processing, the user's non-standard query is transformed into a standard thesaurus, and the corresponding content can be found quickly. As shown in figure 1- 1.
5. Step 2: Filter the content according to the user input information.
After the user's query and interpretation, we actually get some standardized words, which will correspond to some search target content, and then filter the content.
The user conducted a search and found some results. Then all the content is divided into four parts according to the two dimensions of "whether the content is relevant" and "whether the content is recalled".
Recall related content: the part of the searched content related to the user's search.
Irrelevant content recalled: the part of the searched content that has nothing to do with the user's search.
Related content that has not been recalled: the part of content that has not been searched out and is related to the user's search.
Irrelevant content that has not been recalled: the part of content that has not been searched and has nothing to do with the user's search.
Generally speaking, the decision whether to filter out will be measured from two angles: accuracy and recall.
Accuracy is the proportion of relevant content in all searched content. Accuracy:
The recall ratio is the proportion of all content that should be searched. Recall rate:
Accuracy and recall are contradictory indicators. Need to weigh. The final measurement will take the harmonic average of 2 as the objective function. That is, f value:
These three concepts are key indicators of search optimization, involving manual scoring and more advanced optimization. There's nothing here. We just need to remember one thing: not all results containing user query keywords should be recalled.
6. Step 3: Sort the filtered results.
Ranking affects the quality of search results, and the higher the results, the easier it is for users to click. A good search is not only to search out the content that should be searched as much as possible, but also to consider the content that is most likely to attract users to be displayed in front.
The basic logic of search ranking is universal:
The text input by users is converted into words in the standard thesaurus, and the search system decides whether to display these contents according to whether each specific content contains these words. At the same time, the search system scores the contents to be displayed according to the relevance of the text. The final ranking is based on the score of each content.
The principle of Lucene's core sorting formula is introduced online. But the actual situation is actually more complicated. Let's take the news search system mentioned earlier as an example (for the sake of understanding, post the picture again)
If users search for "genetically modified", then whether the title, introduction or text appears genetically modified words, the score should be different. Obviously, there should be a higher score in the topic. Business data should also be considered. For example, a post with a reading of 65438+ million+is more relevant than a post with a reading of 3, but obviously the post with a reading of 65438+ million+should be in the front.
In fact, all data can be divided into two categories, text and data. Text is used to calculate the relevance of content, and this part of the score is given to Lucene's mature algorithm. At present, there are also formed open source solutions on the market. How to deal with the relationship between text and data is the core part of a search system design.
Taking Solr system based on Lucene as an example, the text and data configuration code is actually very simple. At < str name="bf "> and
& ltstr name="bf "> Business data is weighted.
& ltstr name="qf "> Text data is given weight.
After studying the mechanism of Solr system, the Solr core formula is deformed and a formula is obtained:
Represents the text score weight that we give to the text. For example, there are three kinds of texts in this system: title, introduction and text. According to the importance, the title weight is 10, the lead weight is 5, and the text weight is 1.
Represents the text relevance score given by Lucene algorithm for text. This will comprehensively consider the number of words in the text, the probability of this search word appearing in all texts and other factors (students who want to know more about the principle can see the introduction of TF-IDF and cosine similarity).
Represents the data weight we give to the data. For example, this system has three kinds of data, comment volume, sharing volume and reading volume. According to the importance, the title comment weight is 100, the sharing weight is 200, and the reading weight is 1. (Generally, time decay will be introduced, which will not be discussed here for the time being.)
Represents a specific value of data. For example, this system has three kinds of data, comment volume, sharing volume and reading volume.
Represents the normalized coefficient, that is to say, it can give a lot of weight, and the final total score will be within a reasonable range.
This time it is judged according to the algorithm index. On behalf of this score, the user enters the amount of information provided by the query. If the input query provides more information, the greater the S.
Increase, unchanged, the previous coefficient unchanged, the previous coefficient increased. The greater the contribution of text data to the overall score, the greater the weight of text data relative to business data. For example, if you enter "Beijing National Day Traffic Congestion", compared with "Beijing National Day Traffic Congestion", "Beijing National Day Traffic Congestion" provides more information to the system. The greater the S value, the greater the text score in the total score.
So we can see that what ultimately affects the ranking is the weight we give to text data and business data, that is, the weight given to text and the weight given to data.
These two sets of data affect the final ranking of the search, and the assignment of this set of data is the understanding of the search system to the business.