Current location - Education and Training Encyclopedia - Graduation thesis - Ali can: a new idea of characteristic interaction
Ali can: a new idea of characteristic interaction
This paper introduces the latest work of Alibaba's mom's targeted advertising team: Common Action Network (hereinafter referred to as CAN). CAN puts forward a new idea of feature interaction, that is, the features to be interacted (user-side and commodity-side features) are used as the input and weight of DNN respectively, and the output of DNN is the result of feature interaction. CAN not only improves the expressive ability of feature interaction, but also reduces the computational complexity of traditional cartesian product intersection.

Paper address: /p/287898562

In the task of CTR prediction, the interaction between features has always been a hot topic in the industry. Because DNN learns input features in an implicit way, it is difficult to learn effective feature interaction from the huge sparse feature set only by DNN, so many works use artificial feature interaction in feature engineering. FM, FNN, PNN, DCN and DeepFM all expounded this point from different aspects. Interested students can refer to the author's previous article: From FM to DeepFM: On the Model Evolution in Recommendation System.

Feature engineering plays a very important role in the construction of recommendation system model. In the mass input feature, the interaction between user behavior and recommended items can accurately model users' interests. In this paper, the author named this interactive feature as co-action, as shown in figure 1: A and B represent the input of the model, and Target can be the estimated value of ctr. Generally speaking, the relationship between A and B can be understood through DNN. But if we manually interact with A and B at the input end, the learning difficulty will be greatly reduced.

For feature interaction, the most basic way is to do cartesian product. In tensorflow, this operation is cross-column [1]. For features A and B, the cartesian product combines them into a new feature (A, B); Change the value of a or b, and you will get a brand-new feature. In this way, all combinations between a and b can be described. Cartesian product is the best feature interaction method when there are enough training samples and performance is not considered. But cartesian product has two disadvantages:

As the name implies, the purpose of CAN is to model the interaction between different features, which can be understood as a new way of feature interaction. To put it simply, this paper implements a pluggable CAN network unit, which takes two types of features that need to interact as the input and weight of the CAN network respectively, and the output of the network as the result of feature interaction.

Figure 2 shows the basic structure of CAN. Input features can be divided into user behavior sequence, target items, user portrait features (user age, etc.). ) and other characteristics. Among them, user sequences, target items and other features are input into Dean network structure after passing through the embedding layer. For the CAN part, the embedding of user sequence and target item is taken as the input and weight parameters of CAN, and the final output results are summed and passed through the following DNN part together with Dean's output concat. The following focuses on the key of the whole network structure: cooperation unit.

The overall realization logic of CAN is relatively simple. The fully connected network in CAN is recorded as follows: the candidate product features are the weights and deviations of the network, and the user behavior sequence features are the inputs of the network. Here are the numbers of all unique IDs, that is, the parameter space of the project ID; And are embedded sizes, and

Compared with other feature interaction methods, CAN has the following advantages:

The aforementioned CAN structure can only explicitly simulate the first-order characteristic interaction. For high-order cases, it can be realized by high-order input, that is

Where c is the order of feature interaction.

This paper introduces three methods to ensure the independence of CAN learning.

As CAN be seen from Table 2, the AUC index of CAN is better than that of PNN, NCF[2] and DeepFM on two experimental data sets. In addition, as the most basic means of feature interaction, Cartesian product is better than PNN, NCF and DeepFM. But the experimental result of CAN is even better than that of Cartesian product. I think there are two reasons:

In order to verify the generalization ability of CAN, the author deleted all the feature combinations that appeared in the training set from the test data set and constructed a cold data test set. All the feature combinations in the test set are unprecedented in the model. The experimental results are shown in Table 5:

It can be seen that the results of NCF and DeepFM are better than Cartesian product. Compared with the conclusions in Table 2, it is proved that there is indeed a problem in the popularization of Cartesian products. At the same time, CAN has the highest AUC, which also proves that the generalization ability of CAN structure is obviously stronger than that of cartesian product and other feature interaction methods

I think this is one of the most important parts of this paper. In this section, the author discusses in detail some difficulties and solutions when CAN model is deployed to Alibaba's display advertising system, which has important guiding significance for the landing of complex ctr prediction model.

Feature interaction is to increase the combination of feature pairs on the basis of the original features, which will inevitably introduce additional storage and calculation overhead. We know that most parameters of CTR model are concentrated in the embedding part, and the Cartesian product will linearly increase the embedding size. For the features of M and N dimensions (the dimension here refers to the number of unique IDs), the Cartesian product needs to introduce an embedding matrix with the size of (M×N, embedding_size); In addition, the new embedding will introduce more search operations, which will seriously affect the response time of the model and increase the delay. The author mentioned that even if IDs frequency filtering is used (personal understanding is to filter out some low-frequency IDs according to the frequency of ID appearance to reduce the parameter amount, that is, low-frequency filtering), it can not be alleviated.

For the CAN model, although the parameters are greatly reduced, the above problems will still affect the deployment of the model. In this paper, six ad-side features and 15 user-side features are used to interact, and in theory, 15×6 = 90 feature combinations will be introduced. On the other hand, the characteristics of the client are mostly the user's behavior sequence, and the length is generally above 100, which will bring a great burden.

In order to solve the above problems, this paper adopts the following scheme:

The interaction between features is of great significance to ctr prediction model. The shortcomings of Cartesian product and some common model structures (FM, PNN, DeepFM, etc.) are expounded. ), and a new network energy is proposed to combine the features of the model. CAN uses the input and weight of DNN to model feature interaction, which not only solves the spatial complexity and generalization of Cartesian product, but also obtains better feature interaction effect (reflected in the auc index of the model). At the same time, multi-level enhancement and multi-level independence between modules are introduced to make the function of CAN more perfect. Finally, the difficulties and solutions encountered in the online model are introduced, which has important reference significance for the deployment of large-scale ctr prediction model.

.