Current location - Education and Training Encyclopedia - Graduation thesis - Is the online shopping review of data mining true or false?
Is the online shopping review of data mining true or false?
Source | 36 Big Data

When you buy goods online, there are thousands of similar goods. What factors will affect your purchase of a product? Commodity reviews must be an important reference. Generally, we will always look at historical sales and user comments, and then place an order.

In the recent online shopping festivals of double 1 1 and double 12, countless netizens started the buy buy mode under the banner of various e-commerce websites. However, when you buy goods online, there are thousands of similar goods. What factors will affect your purchase of a product? Commodity reviews must be an important reference. Generally, we will always look at historical sales and user comments, and then place an order.

However, you must have heard that it is better to buy than to sell, and the comments on the Internet are always rampant. Maybe all the comments you see are brushed by the seller himself. In fact, many savvy Taobao sellers will sell "explosive models" and "withdraw with one vote" during the peak period of online shopping such as Double Eleven, which is a hotbed of false comments. Sometimes when we buy goods, we often find many seemingly exaggerated comments, such as the comments on a lady's shoes:

"Super beautiful shoes, if you wear them casually, you will feel like a goddess, and you will not be tired after standing for a day. Come and buy it next time, and get a new one! "

"The most satisfactory shoes ever, my mother said it was genuine leather, and the seller had a good attitude. The delivery is super fast, the seller is very honest, and it is a particularly satisfactory shopping! "

Hundreds of thousands of "true feelings" praises have been brushed down like this, and I am afraid that many customers will be brainwashed: this product has a high sales volume and good reviews, so buy this one! As a result, online explosions bought home and became waste. We buyers are absolutely weak in information, and we don't know whether the description given by the seller is true or not, so it is difficult to stop people from brushing praise. So, how do we identify individual comments? This paper introduces a method of cracking with the help of text mining model.

First of all, solve the problem of data source. You can download these comments in batches from the website, that is, the crawler. At present, there are two methods, one is programming, which can use python, java and other programming languages to write crawler programs; Another is to use mature crawler software, which can use interface operation to crawl. I decided to use the free gooseeker software. This software is a plug-in of Firefox browser, which avoids the problem that many websites are difficult to analyze dynamically. With the help of the browser function, as long as you see the elements in the browser, you can download them conveniently. The software provides a detailed tutorial and user community, which can guide users to set up the content, route, continuous action and repeated crawling of the same type of webpage step by step, and everyone can learn and use it by themselves.

The author finally captured the review data of four shoes of the same type, including member name, product description, purchase date, purchase model, review date, review text and so on. * * * Statistics more than 5000 pieces of data. We have specially selected products with a tendency to brush. It can be seen that many comments have consecutive dates, similar members' names and low-level buyers. After people recognize each other, brushing comments accounts for about 30%. We intend to use these data to build a single review identification model, and then we can use the rules obtained here to identify single reviews of other footwear products.

SAS Enterprise Miner 13.2 is a well-known data mining tool, which can analyze large-scale data and establish an accurate prediction and description model according to the analysis results, so it is chosen by us, but other software also has the same analysis idea.

We divided the 5000 comments we got before into two parts, 70% as training samples and 30% as verification samples. Firstly, text analysis is used to decompose the comments in the training samples. When splitting words, you can choose to ignore pronouns, exclamation points, prepositions and conjunctions that lack practical meaning, and ignore numbers and punctuation marks. The above word segmentation process is equivalent to transforming unstructured data into structured data. A previous paragraph can now be expressed in several columns, each column representing a word. If the word appears in the text, the value of this column is 1, otherwise it is 0.

At present, it is not possible to model directly. From the above picture, we can find that many words only appear in a few articles, and we can use text filtering nodes to remove words with low frequency.

In the text filter, you can set the minimum number of documents, specify to exclude entries less than the number of documents, or exclude words with high frequency but little meaning, such as "Jiu", "This", "Yes" and "You". In addition, we can also deal with synonyms. We can add synonyms manually or import them into an external thesaurus. For example, "warmth" and "tenderness" are synonyms, and "good-looking" and "beautiful" can replace each other. ...

You can also view the link relationship between words in the software:

Next, we can use the text rule generator node to model and find out which phrase combinations are directly related to brushing:

We set the true comments in the training sample to 0 (blue) and the false positive comments to 1 (red). As can be seen from the above picture, the comments are probably true when the word "warm" is mentioned (including the synonym "warm"); And those who say "shoes are fashionable" and "they are very fine in workmanship and will buy them again" without mentioning whether they are warm or not are mostly false compliments.

Having said that, you may be curious: why does an ordinary word like "warmth" become the touchstone for commenting on truth and falsehood?

We might as well recall our shopping experience as an ordinary buyer: after receiving the goods and trying them out, we usually only briefly describe our feelings, which is certain. The water army has never really received the goods, let alone tried them on. In order to achieve business targets, we have to emphasize the characteristics of the goods themselves from the aspects of quality, logistics, service attitude and even collocation according to the description of the goods provided by the sellers. Judging from the cases we have done, "warmth" naturally belongs to personal experience, while "leather" and "workmanship" are probably not the properties that ordinary buyers want to feedback most.

So what is the overall effect of this model? We can use the cumulative promotion index to evaluate:

We also left 30% verification samples, and now they can show up to verify the results. Please look at the pink curve in the above picture: if this model is used to score comments, it will be ranked according to the probability of suspected false comments ("1"), and when the top 5% comments are taken, it will be increased by 3 times; We know that false comments account for about 30% of the total, which means that 90% of the top 5% comments are brushed, which proves that our model captures brush comments quite accurately.

Finally, we have to be fair to the seller: Taobao has a serious vicious competition, and I am afraid there are not many shops that don't brush praise at all. It can't be said that the shop that brushes reviews can't be opened at all. 90% brushed goods are really shocking, and the quality of 10% brushed shops is probably ok. This further illustrates the role of our model: it is more practical to judge the proportion of goods swiped than to judge whether the comments are false one by one.

Nowadays, the network water army is constantly evolving, and the comments written are more and more sincere and misleading. It is a waste of time and confusing to distinguish with the naked eye. But false comments can bring forth new ideas, and our model can follow up "learning" at any time. If the method in this paper is extended, a standard process of capturing comments-text analysis-modeling-judging the proportion of false comments can be formed, which is undoubtedly quite practical.

end