Current location - Education and Training Encyclopedia - Graduation thesis - $ TermWeight word weight algorithm
$ TermWeight word weight algorithm
In the user's search query, some words occupy a dominant position in the user's core semantic demands, which need to be focused on when recalling and sorting, while some words are of low importance, even if they are lost, they will not affect the user's core semantic expression.

$ TermWeight is a module that automatically calculates the relative importance of each $ term in a user query. By distinguishing the importance of different $ term in the query and giving corresponding scores, we can recall the results that are most relevant to the user' s intention, thus improving the user' s search experience.

Methods: taking the document set of query and document as the word weight of $ term, tf-idf was calculated and normalized.

Advantages: simple and easy to implement.

Disadvantages: the word weight of each $ term is static, which cannot be changed according to the context, and the effect is not good

$ term weight training methods are mainly divided into two categories: (1) based on click words * * * and (2) based on partial order relation.

Based on the click word * * *, the $ term weight is regarded as a regression task, and the $ term recall score is used to express the importance of each $ TERM in the query.

Word * * * now: Click the data in the query title based on uclog to build a training set, that is, calculate the word weight with the $ term recall rate as the index.

The calculation formula of $ term recall rate is as follows:

If it is in the cold start stage or the accuracy of * * * is low, the manual data can be marked by hierarchical regression score. Examples are as follows:

Note: The number of levels and scores can be adjusted according to specific business scenarios.

The method based on partial order relation regards $ term weight as a sorting task, and uses partial order relation to represent the important relation of each $ term in the query when labeling data, for example:

This method is suitable for users with short queries, and most clicks on the doc will include all the $ term in the query, thus making the method of word * * * invalid.

Different application scenarios will correspond to different selection schemes:

( 1) DeepCT/DeepRT

Word weight based on deep contextual semantics. (a) generating word embedments of upper and lower cultures through a depth model; (2) Linear regression to predict the word weight.

Advantages: Using context semantics, the effect is good.

Disadvantages: The model is slightly complicated, and the complexity of the model needs to be controlled to meet the real-time reasoning.

(2) feature +ML

Word weight prediction based on feature design and machine learning regression model.

Advantages: efficient calculation and real-time reasoning.

Disadvantages: Design features need to be refined manually.

The following details the two types of selection.

The overall framework of DeepCT/DeepRT is: (a) generating word embedding of upper and lower culture through depth model; (b) Linear regression to predict word weights.

If the data set is based on the word * * *, the loss function such as MSE can be used directly, and if the training set is based on the partial order relation, the paired hinge loss can be used.

BERT is used to extract context semantics in this paper, and BiLSTM+Attention is used in my own practice. Either way, its core essence is to dynamically judge the importance of $ term in the current context by combining NMT and context semantics.

International practice, cover a picture.

BiLSTM is recommended if the RT requirement of the system is very high. If the pre-training language model is more advantageous in the pursuit of effect, then you should choose it according to your own business scenario.

The whole idea of Feature+ML is to design effective features manually, and then use GBDT/LR model in machine learning to make regression prediction or ranking. Commonly used models are Xgboost, LightGBM and so on.

Obviously, the effect of this method depends on the quality of functional design, and the specific functions of different business scenarios will be different. Here are some common features.

Static characteristics of $ term: idf value, word frequency, $ term length, $ term part of speech, word position, whether it is stop words, whether it is a modal particle, whether it is a proper noun (name/place name) and so on.

The characteristics of $ term interaction: the ratio of $ term length to query length, text ranking value, the relative position of $ term in the query, the contribution of $ term to the query, etc.

Ngram characteristics: including the statistical characteristics of $ term, the statistical characteristics of ngram starting with $ term, the statistical characteristics of ngram ending with $ term, and so on (usually using binary model and ternary model).

After feature design, ML model can be used for regression prediction or ranking, which is relatively simple and will not be repeated here.