Thesis title: User interest modeling based on search and click-through rate prediction of improving sequence behavior data.
Address: https://arxiv.org/pdf/2006.05639.pdf.
This is another excellent work published by Ali's mother in 2020SIGIR. Let's look at this paper.
In the field of CTR/CVR prediction, the historical behavior of users is very instructive to the modeling of CTR/CVR prediction. The user's rich "interest points" are hidden in the user's historical behavior sequence, and every behavior of the user is a response to some interest. For example, I like all kinds of lipstick products and facial cleanser, but I am not particularly interested in a certain brand. Driven by these interests, I may have browsed and clicked on many related fields. Will these historical behaviors help me predict my future behavior? The answer is yes. It is based on the above subjective behavior patterns that we need to model the historical behavior of users. The longer the user behavior queue, the richer the user's interest will be, but it will also bring greater challenges. In fact, users' interests are divergent and diverse. How to find the interest that is really helpful to the current task from the divergent and diverse user interests is very important.
Before introducing this paper, I suggest reading another paper by Ali, MIMN, which is also a paper on CTR prediction based on long user sequences. However, MIMN has several problems. One is that when the length of user behavior sequence is further increased (for example, 10 or more than ten times), MIMN cannot accurately capture the user interest of a given specific candidate. Another reason is that MIMN can't solve the bottleneck of delay and storage well, that is to say, how can the delay be similar to other lightweight models when deploying online?
In Taobao, the length of users' browsing sequence may reach thousands or even tens of thousands. How to use this long sequence information efficiently and effectively? Mom Ali proposed the SIM model, which further explored users' valuable interest points from the long-term historical behavior queue, and provided a feasible scheme for online service of long-term behavior sequence. Let's take a look at this paper.
Model overview:
SIM is divided into two stages, both of which have their own core parts. In this paper, long-sequence user behavior feature modeling is divided into two modules, namely, General Search Unit (GSU) and Precise Search Unit (ESU), which are the core modules in two stages. Let's briefly introduce the functions of these two modules. GSU, as shown in the figure, is simply understood as selecting k items most similar to candidate items from hundreds of thousands of long user sequences, and comparing and recommending recall modules in the system, reducing the length of long sequence items first, and then carrying out subsequent tasks. The other is ESU. The function of this module is to model the K-item sequence just extracted by GSU, get a vector that can represent the long-term interest pairs of users, and use this vector to sort the following pairs.
The main task of GSU is to extract k similar items from the sequence with length t. GSU has two methods to select TopK items, namely hard search and soft search. As mentioned above, GSU is similar to the recall stage in recommendation system, but in multi-channel recall, there are generally embedded-based recall and policy-based recall, in which hard search is rule-based recall and soft search is embedded-based recall. Let's talk about these two methods in detail.
This method is intuitive and simple to implement, that is, we filter out the candidate set related to the current target task from the candidate behavior sequence according to the given rules. For example, I have browsed very different kinds of goods in history (such as electronic products, lipsticks, men's shoes, etc.). ) on Taobao. When the candidate advertisement is iphone 12, the hard search method will select behaviors related to electronic products from my historical behavior queue to model for PCTR prediction, while lipstick and men's shoes have no influence on this prediction. Through the above example, you should be able to understand this idea based on rules and strategies. It is pointed out that the hard search method takes commodity category as the screening standard.
This method is based on embedded extraction, and the overall structure of soft search can be seen from the left side of the model diagram above. This part is also a submodel. The input of the model is candidate items and long sequences, and the target is CTR estimation. In this way, the embedding information of candidates and long sequence items is learned. Through embedding, we can calculate the inner product similarity between candidate advertisement embedding and historical behavior embedding, and use the approximate nearest neighbor retrieval method (ALSH is used in this paper) to obtain the candidate behavior sequence related to topK.
In this model, the input of DNN is a candidate? Series of and Ur, where Ur:
Note that if the user behavior grows to a certain extent, it is impossible to directly input the whole user behavior into the model. In this case, we can randomly extract sequence sets from long sequence user behaviors, and these behaviors still need to follow the same distribution of the original sequence.
The disadvantage of this method is that the calculation cost is relatively high and it is not as convenient as the hard search based on rules. The advantage is that the effect should be better. However, it is also mentioned that the difference between the two methods in effect is not particularly great, so finally, based on the compromise between performance and effect, hard search is adopted as a relatively simple way.
On the whole, this part mainly uses the K term extracted from GSU to get a vector that can represent users' long-term interests, and sends it to DNN with other features to do the overall CTR prediction task.
In this paper, K from GSU models the project through self-attention:
These include:
The first one in concat is original embedding, and the second one is about time embedding.
In the way of self-concern, we get another vector h(K).
The ctr of the second sub-model is also estimated here, which is characterized by the input drawn on the model diagram, as well as Dean's previous articles and introductions, so I won't go into details here.
The final loss is:
Where α and β are hyperparameters for controlling weightlessness. In our experiment, if GSU uses soft search model, both α and β are set to 1. The GSU of the hard search model is nonparametric, and α is set to 0.
Advertising recommendation system requires strict time-consuming online computing, because it needs to ensure the most basic user experience. With the further growth of user behavior sequence, it takes time and memory to calculate long sequence user behavior directly by traditional methods, so it is necessary to upgrade the online system. It is mentioned that in the choice of hard search and soft search, based on a large number of off-line experimental results, it is decided to adopt a convenient, fast and effective hard search method, and the information loss is acceptable.
The overall system architecture of online deployment is as follows:
In order to make SIM better provide users with a low-latency experience, Ali built an online service architecture for SIM:
It can be seen that for the user's behavior sequence, the paper adopts a two-level index structure for each user: key-key-value, the first key is user_id, the second key is category ids, and value is an entry belonging to the corresponding category in the user's behavior sequence. In this way, you can quickly find the items belonging to the unified category through this index tree.
Experimental effect of online A/B test;
The historical behavior of users is becoming more and more important to the whole task of CTR/CVR estimation. If time and storage are not considered, all sequences are input into the model as key points of long-term interests, which can accurately locate users' long-term interests. However, for performance reasons, a special method must be considered to screen this long sequence. The k items selected are all similar to the candidates, and the cutting effect will not bring the loss of CTR estimation. In the process of filtering, there are two methods, but for online deployment, we should consider the best performance hard search method to filter TopK, which is similar to embedded filtering, but faster than embedded filtering, so this method is adopted.
There should be more papers on sequence recommendation in the future, which will only bring some interest deviation to the long sequence stage, so how to effectively tap the commercial value behind users' richer behavior characteristics needs to be considered.