The recall phase corresponds to the various recommendation algorithms mentioned in the previous article. For example, according to the data, Spotify used at least three algorithms to generate its widely acclaimed Discover Weekly song list, including:
These algorithms have their own characteristics, audio analysis can obviously be used to solve the cold start problem, and NLP can learn the domain knowledge of professionals when dealing with music reviews. They run independently and give their own results. Because of their independence, the number of algorithms can be increased or decreased, and they can also be iteratively changed independently.
This process will screen out hundreds of candidate sets from tens of millions of entries, and then select 30 songs for each user in the sorting stage. This ranking can be understood as a function. The inputs are users, items and environment, and a score between 0 and 1 is output, and the song with the highest score is taken. This process is often called CTR estimation.
This article will talk about the common form and basic operation of this "function".
The simplest is logistic regression, a generalized linear model.
Take a user portrait (a vector) of a user, such as [3, 1], and splice an item portrait of an item, such as [4,0], and add the vector representing the context [0, 1, 1] to get [3, 1, 4]. If the user has touched the product, the label is 1, which adds up to a positive sample. At the same time, products that users "skip" or popular products that have not touched users can be used as negative samples with a label of 0, and the following equation can be fitted:
Where is the vector above, the weight and intercept of each element of X. The loss function is:
Where the sample label0 or 1 is a number between 0 and 1 predicted according to the model.
By reducing this loss function to fit the training sample, the training of the model is completed, and the new data is predicted by the model, which is the completion of scoring. Referring to sklearn's LogisticRegression, the training process is easy to complete.
Traditional LR can only process a large amount of data offline in batches, and cannot effectively process large-scale online data streams. Model updating may take a day or more, which is not timely enough. In 20 13, Google proposed following the regularized leadership (FTRL), an online logistic regression algorithm. This method modifies the objective function of logistic regression, plus various system engineering optimizations, so that the parameters of the model can be dynamically updated at each online data point.
You can find many open source implementations of FTRL on the Internet, such as libftrl-python.
FM and FFM are the abbreviations of factorization machine and field perception factorization machine respectively.
As a generalized linear model, LR has difficulties in dealing with the nonlinear relationship between feature vectors and labels. At this time, it is necessary to combine features, such as using linear models to predict the area of various approximate rectangles. The two characteristics are length and width. Obviously, you can't learn a good model. At this time, adding a new feature can get good results.
In practical application, the dimension of feature vectors is very high, so it is difficult to directly see this meaningful combination like the above example. Considering all pairs of feature combinations, the linear regression equation becomes:
Besides the weights of the original features, we should also learn the weights corresponding to each feature combination. For parameter training, we need a large number of non-zero samples. However, due to the sparsity caused by thermal coding and other reasons, this requirement cannot be met, so insufficient training samples will lead to inaccuracy, which will affect the quality of the model.
The solution is to use matrix decomposition. In the recommendation system, user_item_matrix will be decomposed to learn a low-dimensional vector for users and projects to express themselves. Then the situation here can be compared with it, and all the weights of feature combination are represented as a matrix with the shape of (i * i), so that the values in the I-th row and the J-th column of this matrix can be decomposed into a high-dimensional matrix, and a hidden vector about the weights can be obtained for each feature, which can be obtained by point multiplication. At this point, the linear equation becomes:
The model above is called factorization machine. After some mathematical transformation and processing, the model can be trained and predicted under complexity, and it is a relatively efficient model.
On the basis of FM, someone proposed the field perception factorization machine. For example, there are more than 200 dimensions in the feature vector to represent the user's country, such as country.uk and country.us. So these more than 200 features can be considered as belonging to one field. The difference is that when learning hidden vectors for features, a corresponding hidden vector should be learned for each field, and the feature combination weight is obtained by multiplying the hidden vector about the field by the hidden vector about the field. The linear equation becomes:
This method is more effective, and the prediction time complexity rises to. There is an implementation of the open source library libffm for use.
Facebook's method to estimate the click-through rate of advertisements is to use the gradient lifting decision tree (GBDT)&; LR scheme
The idea is to filter and combine the feature vectors that were originally input to LR through GBDT to generate new feature vectors, and then send them to LR. As shown in the figure:
As an integrated model, GBDT will use multiple decision trees, and each tree will fit the residual of the previous tree to obtain a good fitting effect. When a sample is input into a tree, it will go down to a leaf node according to the conditions of each node, set the node value to 1, and set the rest to 0. For example, three decision trees are used in the training, each decision tree has five leaf nodes, and the samples fall on the 1, 2, and 3 nodes of each tree from left to right, then three one-hot codes are obtained [1, 0, 0, 0], [0, 1, 0, 0]. Spliced together as transformed feature vectors: [1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0], and input into LR model to get a score.
This model has obviously improved the advertising effect of Facebook. In his published paper, various engineering practices and details are also discussed, including the update frequency of GBDT and LR, and the practice of down sampling rate, which is worth reference. The open source XGBoost package can be used to implement GBDT.
Google uses a method called Wide &; Depth-width model of Deep. As shown in the figure below:
The wide part is a generalized linear model, and some feature combinations are appropriately added on the basis of the original features. The deep part is a feedforward neural network, which can learn a low-dimensional dense vector from some sparse features, add width and depth information, and still use Sigmond for function prediction, which is expressed as:
Where is Sigmond function, which is the weight of the wide part, represents the combination characteristics of the wide part, is the output of the last layer of the deep network, and is the deviation of the linear model.
The two models are put together for joint training (different from integrated training, each model needs to be trained separately, and then the results are converged) to make up for each other's shortcomings (difficult feature engineering and poor interpretability). Compared with the pure width mode, this mode has brought 3.9% increase in online revenue of Google Play. The implementation can refer to tensorflow/models project.