Not only the chosen algorithm is 10, but actually the algorithm of 18 just wants to come up with a data mining field that can be called a classic algorithm, which has far-reaching influence.
C4.5
C4.5 algorithm is a classification decision tree algorithm and a machine learning algorithm. The core algorithm is ID3 algorithm. C4.5 algorithm inherits the advantages of ID3 algorithm, which has been improved in the following aspects:
1) Improve the information gain rate, and select attributes to overcome the deficiency of deviation value selection attribute information gain selection attribute;
2) Pruning in tree structure;
3) discretization of continuous attributes in the process of completion;
4) The data is incomplete.
C4.5 algorithm has the following advantages: the generated classification rules are easy to understand and have high accuracy. Its disadvantage is that in the tree structure, the data sets are scanned and sorted sequentially, which leads to the inefficiency of the algorithm.
2。 K- means algorithm
K-means algorithm K-means algorithm is a clustering algorithm, which is divided into K and K.
support vector machine
Support Vector Machine (SVM) is the abbreviation of SV machine (usually called SVM). This is a supervised learning method, widely used in statistical classification and regression analysis. The support vector machine is mapped to a high-dimensional space, and a hyperplane with the largest interval is created in this space. Two parallel hyperplanes, located on both sides of the hyperplane of a single data. Separate hyperplanes to maximize the distance between two parallel hyperplanes. It is assumed that the greater the distance or gap between parallel hyperplanes, the smaller the total error of the classifier. CJC· Berg, an excellent tour guide, Guide to Pattern Recognition Support Vector Machines. Van der Waals and Barnard's support vector machines are compared.
Apriori algorithm
Apriori algorithm is the most influential algorithm for mining frequent itemsets of Boolean association rules, and its core is two-stage frequency based on a set of recursive algorithm ideas. Association rules are divided into one-dimensional, single and Boolean association rules. Here, all itemsets with support greater than the minimum support are called frequent itemsets as frequency settings.
The maximum expectation (EM) algorithm seeks the maximum expectation of the parameters in the statistical calculation of the maximum expectation (EM) algorithm, which is often used in the fields of machine learning and computer vision data collection (probability likelihood estimation algorithm in data clustering model, in which the probability model depends on the unobservable latent variable (latent variabl). )
6。 PageRank of
Google's PageRank algorithm was granted a US patent in September 2006, 5438+0. This patent belongs to Larry Page, the founder of Google. PageRank and years do not refer to pages, but are named at this level.
PageRank measures the value of a website according to its quantity and quality, internal and external links. The concept behind PageRank is that every linked page is a voting page. Links and voting mean other websites, which is called "link popularity"-a measure of how many people are willing to be linked to their website, your website. The concept of PageRank is often quoted in academic papers-that is, from other more general authoritative judgments.
7 AdaBoost
Adaboost is an iterative algorithm, and its core idea is that different classifiers (weak classifiers) have the same training set, and then these weak classifiers * * * together form a stronger final classifier (strong classifier). The algorithm itself is modified by changing the data distribution, and the weight of each sample is determined according to the classification of each sample in each training set and the final total classification accuracy. The weight of the new data set is given to the training of lower classifiers, and the final classification on each training is fused as the final decision classification.
KNN: K nearest neighbor classification
K nearest neighbor (KNN) classification algorithm is a mature method in theory and the simplest idea in machine learning algorithm? The method is as follows: If the samples with the most similar K in the feature space (that is, most of the closest samples in the feature space) belong to a category, then the samples also belong to this category. BR p & gt9。 Naive Bayes
Among many classification models, the two most commonly used classification models are decision tree model and Na? Bayesian classification model (NBC) Naive Bayesian model originated from classical mathematical theory, which has a solid mathematical foundation and stable classification efficiency. At the same time, the parameters needed to estimate the NBC model are few, the missing data is insensitive, and the algorithm is relatively simple. Theoretically, compared with other classification methods, NBC model has the smallest error rate. But in fact, this is not always the case, because the assumptions of NBC model are independent of each other. In practical application, this assumption is often untenable and has certain influence on the correct classification of NBC model. The classification of NBC model compares the efficiency of decision tree model when the number of attributes or the correlation between attributes is large. Property-related is less, and NBC mode is the most favorable.
10。 Car: classification and regression tree
Cars, classification and regression trees. There are two key ideas under the classification tree. The first thought is yes? Recursive partition space of independent variables; The second idea is to prune and validate the data.