Current location - Education and Training Encyclopedia - Graduation thesis - 07_ Detailed explanation of algorithm of recommendation system
07_ Detailed explanation of algorithm of recommendation system
? Demographic recommendation and user portrait, content-based recommendation, collaborative filtering-based recommendation.

1 Demographic-based recommendation mechanism is one of the most easily implemented recommendation methods. It only finds the relevance of users according to the basic information of system users, and then recommends other items that similar users like to the current users.

2. For user information with no clear meaning (such as login time, region and other contextual information), users can be labeled by clustering.

3. For users with specific tags, you can recommend corresponding items according to preset rules (knowledge) or models.

4. The process of tagging user information is usually called user profile.

(1) User Profile is a basic way for enterprises to apply big data technology to perfectly abstract a user's business panorama by collecting and analyzing the data of consumers' social attributes, living habits, consumption behaviors and other major information.

(2) User portraits provide enterprises with sufficient information base, which can help enterprises quickly find accurate user groups, user needs and other broader feedback information.

(3) As the foundation of big data, it perfectly abstracts the whole picture of a user's information, providing sufficient data basis for further accurate and rapid analysis of important information such as users' behavior habits and consumption habits.

1, content-based recommendation (CB) finds the relevance of items according to the metadata of recommended items or contents, and then recommends similar items to users according to their past preference records.

2. Calculate the similarity by extracting the intrinsic or extrinsic feature values of the project. For example, a movie has a director, actors, user tags UGC, user comments, duration, style and so on. Can be regarded as a feature.

3. By matching the characteristics of users' personal information (based on preference records or preset interest tags) with the characteristics of items, users' interest in items can be obtained. It has been successfully applied to some social networking sites for movies, music and books, and some websites also invite professionals to carry out genetic coding/tagging (PGC) on the items.

4. Similarity calculation:

5. Feature extraction of article annotation.

-expert label (PGC)

-User-defined label (UGC)

-dimensionality reduction analysis data, extracting hidden semantic tags (LFM)

? Feature Extraction of Text Information —— Keywords

-Word Segmentation, Semantic Processing and Sentiment Analysis (NLP)

-Latent Semantic Analysis (LSA)

6. High-level structure of content recommendation system

7, characteristic engineering

(1) feature: information extracted from data that is useful for result prediction.

? The number of features is the observation dimension of data.

? Feature engineering is a process of processing data by using professional background knowledge and skills, so that features can play a better role in machine learning algorithms.

? Feature engineering generally includes feature cleaning (sampling and cleaning abnormal samples), feature processing and feature selection.

? Features are classified according to different data types, and there are different feature processing methods: numerical, classified, temporal and statistical.

(2) Digital feature processing

? When continuous numerical values are used to represent the current dimensional features, the numerical features are usually mathematically processed, and the main methods are normalization and discretization.

? * standardization of amplitude adjustment:

Features should be equivalent, and differences should be reflected in features.

For example, the range of house price and housing area is different. The house price may be between 3 million and 65,438+0,500,000 (ten thousand), while the house area is between 40 and 300 (square meters). Obviously, these two characteristics are equal, and it is unreasonable to input them into the same model, because the amplitude is different and the effect is different.

?

* numerical feature processing-discretization

Two discretization methods: equal step size-simple but not necessarily effective; Equal frequency-minimum--> 25%-& gt;; 75%-& gt; maximum

Comparison of two methods:

? The equal frequency discretization method is very accurate, but the data distribution needs to be recalculated every time. Because the price distribution of things that users bought on Taobao yesterday is not necessarily the same as today, the subdivision point of equal frequency yesterday may not be applicable, but the most important thing to avoid online is that it is not fixed and needs on-site calculation, so the model trained yesterday may not be available today.

Equal frequency is not fixed, but it is very accurate, and equal step size is fixed and very simple, so both of them have applications in industry.

(3) category feature processing

? Category data itself has no size relationship and needs to be coded into numbers, but there cannot be a preset size relationship between them. Therefore, it is necessary to be fair and distinguish, so it is necessary to directly open up multiple spaces.

A hot-coded/dummy variable: A hot-coded/dummy variable expands the category data in parallel, that is, the space of this feature will be expanded after a hot-coded dummy variable.

(4) Time feature processing

Time-based features can be regarded as continuous values and discrete values.

Continuous value: duration (web browsing time); Time interval (time interval since last purchase/click).

Discrete value: which time of day; What day is it; What month/week of the year; Weekdays/weekends.

(5) Statistical feature processing

? Average addition and subtraction: how much the commodity price is higher than the average price and how much the user consumes under a certain category.

Divide line: the dividing line where the goods belong to the price of the goods sold.

Order: Where are the hot items?

Proportion category: the proportion of good/medium/bad reviews of goods in e-commerce.

8. Recommended system commonly used feedback data:

9. Recommendation based on UGC

? Users use tags to describe their views on projects, so User Generated Tags (UGC) is the link between users and projects, and it is also an important data source to reflect users' interests.

The data set of user tag behavior is usually represented by a set of triples (user, item and tag), where the record (u, i, b) indicates that user u tagged item b. ..

One of the simplest algorithms:

-Count the most commonly used tags of each user.

-For each tag, count the items that have been tagged the most times.

-For a user, first find his usual labels, and then find the hottest products with these labels and recommend them to him.

Therefore, the formula that user U is interested in item I is that the number of times that user U is marked as tag B is the number of times that item I is marked as tag B. ..

In a simple algorithm, the user's interest in a certain feature of an item can be simply expressed by the direct multiplication of the number of times the user typed the label and the number of times the item obtained the label.

This method tends to give more weight to popular tags (tags that anyone can give, such as "blockbuster" and "funny") and popular projects (with the largest number of tags). If a hot product corresponds to a hot label at the same time, it will "dominate the list" and the personalization and novelty of the recommendation will be reduced.

Similar problems also appear in the keyword extraction of news content. For example, in the following news, which keyword should get higher weight?

10, TF-IDF: $ TERM frequency-inverse document frequency (TF-DF) is a commonly used weighting technique in information retrieval and text mining.

TFDF is a statistical method used to evaluate the importance of words to a document in a document set or corpus. The importance of a word is directly proportional to the number of times it appears in the document, but at the same time it is inversely proportional to its frequency in the corpus.

TFIDF=TF IDF

The main idea of TF-IDF is that if a word or phrase appears frequently in one article but rarely in other articles, it is considered that the word or phrase has good classification ability and is suitable for classification.

Search engines often use various forms of TF-DF weighting as a measure or rating of the degree of relevance between documents and user queries.

? $ TERM frequency (TF): refers to the frequency at which a given word appears in a document. This number is the standardization of the number of words, in order to prevent bias towards longer documents. The same word may have more words in a long file than in a short file, regardless of whether the word is important or not. ), which indicates the frequency at which the word I appears in the document j, the number of times I appears in the document j, and the total number of words in the document j. ..

Inverse Document Frequency (IDF): It is a measure of the universal importance of a word. The IDF of a specific word can be obtained by dividing the total number of documents by the number of documents containing the word, and then taking the logarithm of the obtained quotient, where n represents the inverse document frequency of the word I in the document set, and n represents the total number of documents in the document set and the number of documents containing the word I in the document set.

(1 1) TF-IDF's improvement on recommendation based on UGC: In order to prevent hot labels and hot products from gaining more weight, we need to punish "hot".

? Based on the idea of TF-IDF, all tags of a project are regarded as "documents" and tags as "words", so as to calculate "word frequency" (the frequency between all tags of a project) and "reverse document frequency" (the frequency commonly appearing in tags of other projects).

Since "all labels of article I" should have no influence on the label weight, and "total number of all labels" n is determined for all labels, these two items can be omitted. On the basis of simple algorithm, the penalty items of hot labels and hot items are directly added:, which records how many different users used label B and how many different users marked item I.

(a) Collaborative filtering, see)

1. Recommendation based on collaborative filtering (CF): Content-based (CB) mainly uses the content characteristics of projects evaluated by users, while CF method can also use the content of projects evaluated by other users.

CF can solve some limitations of CB:

? -When the content of the item is incomplete or difficult to obtain, you can still give recommendations through feedback from other users.

-CF is based on the evaluation of article quality between users, which avoids the possible interference of CB in judging article quality only by content.

-CF recommendation is not limited by content. As long as other users of the same kind show interest in different items, CF can recommend items with very different contents to users (but with certain internal relations).

Divided into two categories: neighbor-based and model-based.

2. Nearest neighbor-based recommendation system: based on the same "word of mouth" criterion. Should I recommend Titanic to Cary?

(b) Collaborative filtering based on nearest neighbors

1, User-CF: The basic principle of user-based collaborative filtering recommendation is to find a "neighbor" user group with similar tastes and preferences to the current user according to all users' preferences for items, and recommend items preferred by neighbors.

? In general application, the algorithm of calculating "K- nearest neighbor" is adopted. Based on the historical preference information of these k neighbors, recommendations are made for the current users.

User experience and recommendation mechanism based on demography;

-Calculate both the similarity of users and recommendations based on similar "neighbor" user groups.

The difference between them lies in how to calculate the similarity of users: the mechanism based on demography only considers the characteristics of users, while the collaborative filtering mechanism based on users can calculate the similarity of users on the data of users' historical preferences. The basic assumption is that users who like similar items may have the same or similar tastes and preferences.

2.Item-CF: The basic principle of item-based collaborative filtering recommendation is similar to that of user-based, only using all the preferences of users to find similarities between items, and then recommending similar items to users according to their historical preference information.

Project -CF and Content-Based (CB) Recommendation

-In fact, they are all based on item similarity prediction and recommendation, but the similarity calculation method is different. The former is inferred from the user's historical preference, and the latter is based on the attribute information of the item itself.

The same is collaborative filtering. How to choose user-based strategy and project-based strategy?

E-commerce, movies and music websites have more users than items.

-News websites, the number of items (news body) may be greater than the number of users.

3. Comparison between user CF and project CF

? It is also collaborative filtering. How should I choose between User-CF and LTE-CF?

? Item-CF application scenario

-? Item-CF recommendation mechanism is a strategy that Amazon improves on the user-based mechanism. In most websites, the number of items is far less than the number of users, and the number and similarity of items are relatively stable. At the same time, Item-CF recommendation mechanism is superior to real-time performance based on users, so Item-CF has become the mainstream of recommendation strategy.

? User -CF application scenario

-Imagine that in some news recommendation systems, the number of items, that is, news, may be greater than the number of users, and news updates quickly, so its similarity is still unstable. It may be better to use User-cf at this time.

Therefore, the choice of recommendation strategy actually has a great relationship with specific application scenarios.

4. Advantages and disadvantages of recommendation based on collaborative filtering.

? (1) Advantages of recommendation mechanism based on collaborative filtering:

It does not require strict modeling of objects or users, nor does it require the description of object features to be machine-understandable, so this method is also domain-independent.

The recommendation calculated by this method is open, and the experience of others can be used to support users to discover potential interest preferences.

(2) Existing problems

The core of this method is based on historical data, so there is a "cold start" problem for both new projects and new users.

The effect of recommendation depends on the quantity and accuracy of users' good historical data.

In most implementations, users' historical preferences are stored in sparse matrix, and there are some obvious problems in the calculation on sparse matrix, including the wrong preferences of a few people may have a great impact on the accuracy of recommendation.

For some users with special tastes, they can't give good recommendations.

(c) Model-based collaborative filtering

1, basic idea

(1) users have certain characteristics, which determine their preferences.

(2) Items have certain characteristics, which affect whether users need to choose it.

(3) The reason why users choose a commodity is because the characteristics of users and commodities match each other.

Based on this idea, the establishment of the model is equivalent to extracting features from behavior data and labeling users and items at the same time; This is essentially the same as user tags based on demographics and item tags based on content methods, both of which are feature extraction and matching.

When there are obvious features (such as user labels and item classification labels), we can directly match and recommend them; If not, we can issue hidden features according to the existing preference data, which requires the use of hidden semantic model (LFM).

2. Model-based collaborative filtering recommendation is to train a recommendation model based on sample user preference information, and then predict the score of new items and calculate recommendations according to real-time user preference information.

Neighbor-based recommendation and model-based recommendation.

Neighbor-based recommendation is to use the existing user preference data directly in the prediction, and predict the preference for new items (similar classification) through neighbor data.

-Model-based method is to use these preference data to train the model, find the internal law, and then use the model to predict (similar to regression).

When training the model, you can extract the characteristics of the item according to the label content, or you can let the model send out the potential characteristics of the item. Such a model is called latent factor model (LFM).

(1) Hidden Semantic Model (LFM): the target of collaborative filtering of hidden semantic model;

-Reveal hidden features that can explain why corresponding prediction scores are given.

-This feature may not be directly explained and described by language. In fact, we don't need to know, which is similar to "metaphysics"

Matrix decomposition dimension reduction analysis

-Collaborative filtering algorithm relies heavily on historical data, while in general recommendation systems, preference data is often sparse; This requires dimensionality reduction of the original data.

-The decomposed matrix represents the hidden features of users and projects.

Examples of hidden semantic models: probability-based hidden semantic analysis (pLSA), hidden Dirichlet distribution model (LDA) and matrix decomposition model (SVD).

(2)LFM dimension reduction method-matrix decomposition

(3) Further understanding of LFM

We can think that there are inherent reasons for users to rate movies like this. We can dig out the hidden factors that affect the user's rating, and then determine the predictive rating of unrated movies according to the correlation between unrated movies and these hidden factors.

There should be some hidden factors that affect users' ratings, such as movies: actors, themes, years … and even hidden factors that people can directly understand.

If the hidden factors are found, users can be associated with Iiem (find out what makes users like/dislike this item, and what determines whether users like/dislike this item), and then guess whether users will like a movie that they have never seen.

(4) Matrix decomposition

(5) solving the model loss function

(6) ALS, the algorithm for solving the model.

? Now, the problem of matrix decomposition has been transformed into a standard optimization problem, and P and Q need to be solved to minimize the objective loss function.

Usually, stochastic gradient descent algorithm or alternating least squares (ALS) is used to solve the minimization process.

The idea of ALS is that because the two matrices P and Q are unknown, they are coupled together by matrix multiplication. In order to decouple, we can fix Q first, take P as a variable, and get P by minimizing the loss function. This is a classic least squares problem. Then fix the obtained p in turn, and take q as a variable to solve q: this is done alternately until the error meets the reading condition or reaches the upper iteration limit.

(7) Gradient descent algorithm