Current location - Education and Training Encyclopedia - University ranking - What are the algorithms of machine learning?
What are the algorithms of machine learning?
Naive Bayesian classification algorithm is one of the most popular learning methods. According to similarity classification, a machine learning model is established through the popular Bayesian probability theorem, which is especially suitable for disease prediction and document classification. It is a simple classification based on Bayesian probability theorem to analyze the content of words subjectively.

When to use the machine learning algorithm-Naive Bayesian classifier?

(1) If you have a medium or large training data set.

(2) If the instance has several properties.

(3) Given the classification parameters, the attributes describing the instance should be conditionally independent.

A. the application of naive Bayesian classifier

(1) Emotion Analysis-Used to analyze the status updates of positive or negative emotions on Facebook.

(2) Document classification-Google uses document classification to index documents and find the relevance score, namely PageRank. PageRank mechanism parses and classifies the pages marked as important in the database through document classification technology.

(3) Naive Bayesian algorithm is also used to classify news articles about science and technology, entertainment, sports and politics.

(4) E-mail spam filtering-Google Mail uses Na? VeBayes algorithm classifies your email as spam or non-spam.

B. Advantages of Naive Bayesian Classifier Machine Learning Algorithm

(1) When the input variable is classification, the naive Bayesian classifier algorithm performs well.

(2) When the hypothesis of conditional independence of Naive Bayes holds, Naive Bayes classifier converges faster and needs relatively less training data, which is different from other discriminant models, such as logistic regression.

(3) Using naive Bayesian classifier algorithm, it is easier to predict the categories of test data sets. A good bet for multi-level forecasting.

(4) Although conditional independence assumption is needed, Naive Bayesian classifier shows good performance in various application fields.

The data science library in Python realizes Na? Bayesian science toolkit learning

The data science library realizes naive Bayes -e 107 1 in R.

3.2 K mean clustering algorithm

K-means is an unsupervised machine learning algorithm widely used in clustering analysis. K-Means is an uncertain iterative method. The algorithm operates on a given data set through a predetermined number of k clusters. The output of K-means algorithm is K clusters, and the input data is divided among clusters.

For example, let's consider K-means clustering of Wikipedia search results. The search term "Jaguar" on Wikipedia will return all pages containing the word Jaguar. It can be called jaguar car, jaguar Mac OS version, jaguar animal. K-means clustering algorithm can be used to group web pages describing similar concepts. Therefore, the algorithm will group all the web pages about jaguar into one cluster as animals, jaguar into another cluster as cars, and so on.

Advantages of learning algorithm of a.k-means clustering machine

(1) In the case of spherical clustering, K-Means produces more compact clustering than hierarchical clustering.

(2) Given a small value of k, K-Means clustering calculation is faster than hierarchical clustering with a large number of variables.

Application of b.k-means clustering

K means that the clustering algorithm is used by most search engines (such as Yahoo and Google), clustering web pages through similarity and identifying the "relevance rate" of search results. This helps search engines reduce users' computing time.

The data science library in Python realizes K-means clustering -SciPy, Sci-Kit learning and Python packaging.

R in data science library realizes K-means clustering-statistics.

3.3 Support Vector Machine Learning Algorithm

Support Vector Machine (SVM) is a supervised machine learning algorithm for classification or regression problems, in which data sets teach SVM about classes so that SVM can classify any new data. Its working principle is to find lines (hyperplanes) that divide training data sets into different classes. Because there are many such linear hyperplanes, SVM algorithm tries to maximize the distance between the involved classes, which is called marginal maximization. If the line that maximizes the distance between classes is determined, the possibility of good synthesis of invisible data will increase.

A.SVM is divided into two categories:

Linear SVM- In linear SVM, the training data, namely the classifier, is separated from the hyperplane.

Nonlinear SVM In nonlinear SVM, it is impossible to separate training data with hyperplane. For example, the training data for face detection consists of one set of images that are faces and another set of images that are not faces (in other words, all other images except faces). In this case, the training data is too complicated to find the representation of each feature vector. It is a complex task to linearly separate face sets from non-human face sets.

B. Advantages of using SVM

(1)SVM provides the best classification performance (accuracy) for training data.

(2)SVM provides higher efficiency for the correct classification of future data.

The best thing about SVM is that it doesn't make any strong assumptions about the data.

(4) The data will not be over-fitted.

C. application of support vector machine

(1)SVM is usually used to predict the stock markets of various financial institutions. For example, it can be used to compare the relative performance of stocks with other stocks in the same industry. Based on the classification of SVM learning algorithm, the relative comparison of stocks is helpful to management investment decision.

(2) The data science library in Python has realized support vector machines-SCiKit learning, PyML, SVMStruct Python, LIBSVM.

(3) The data science library in R implements the support vector machine-klar, e 107 1.

3.4 Apriori machine learning algorithm

Apriori algorithm is an unsupervised machine learning algorithm, which generates association rules from given data sets. Association rules mean that if item A appears, item B also appears with a certain probability. Most generated association rules are in IF_THEN format. For example, when people buy an iPad, they will also buy an iPad case. In order to get the algorithm of this conclusion, it first observes the number of people who buy iPad. This ratio is like 100 people bought an iPad, and 85 people also bought an iPad case.

A.a. Basic principle of prior machine learning algorithm;

If an item set appears frequently, all subsets of the item set will also appear frequently.

If an itemset does not appear frequently, all supersets of the itemset will not appear frequently.

B. Advantages of prior algorithm

(1) is easy to implement and parallelize.

(2) 2) The Apriori implementation uses the large itemset attribute.

Application of C.C.Apriori algorithm

Detection of Adverse Drug Reactions

Apriori algorithm is used for correlation analysis of medical data, such as drugs taken by patients, characteristics of each patient, patients' experience of adverse reactions, initial diagnosis and so on. This analysis produces association rules to help identify the adverse side effects of drugs caused by the combination of patient characteristics and drugs.

Market basket analysis

Many e-commerce giants, such as Amazon, use Apriori to get data insights, which products may be purchased together and which are most responsive to promotions. For example, retailers may use Apriori to predict that people who buy sugar and flour are likely to buy eggs to bake cakes.

Auto-complete application

Google Auto-Complete is another popular application of Apriori. When a user types a word, the search engine will look for other related words that people usually type after a specific word.

The data science library in python has realized Apriori machine learning algorithm. One Python in PyPi has realized Apriori.

Realization of Apriori Machine Learning Algorithm in R-arules by Data Science Library

3.5 Linear Regression Machine Learning Algorithm

Linear regression algorithm shows the relationship between two variables and how the change of one variable affects the other. The algorithm shows the influence on the dependent variable when the independent variable changes. Independent variables are called explanatory variables because they explain the influence of dependent variables on dependent variables. Dependent variables are often called concern factors or predictors.

A. Advantages of linear regression machine learning algorithm

(1) It is one of the most interpretable machine learning algorithms and easy to explain to others.

(2) Easy to use because it requires minimal adjustment.

(3) It is the most widely used machine learning technology with fast running speed.

B. Application of linear regression algorithm

Estimated sales

Based on the trend sales forecast, linear regression is very useful in business. If the company's monthly sales increase steadily-linear regression analysis of monthly sales data will help the company predict the sales in the next few months.

risk assessment

Linear regression helps to evaluate risks involving insurance or finance. Health insurance companies can make linear regression analysis on the number of claims and the age of each customer. This analysis helps insurance companies find that elderly customers tend to make more insurance claims. The results of this analysis play a vital role in important business decisions and are aimed at solving risks.

Linear regression of data science library in Python-statsmodel and SciKit

The data science database in R has realized linear regression-statistics.

3.6 Decision Tree Machine Learning Algorithm

Because of your parents' visit, you are making a weekend plan to go to the best restaurant in the city, but you are hesitant and don't know which restaurant to choose. Whenever you want to go to a restaurant, you will ask your friend Tyrion if he thinks you will like a particular place. In order to answer your question, Tyrion must first find out what kind of restaurant you like. You give him a list of restaurants you have been to and tell him whether you like each restaurant (give a marked training data set). When you ask Tyrion if you want a special restaurant, he will ask you all kinds of questions, such as "yes" or "rooftop restaurant?" Does restaurant "R" serve Italian food? Live music? Is r restaurant open until midnight? "and so on. Tyrion asks you to provide several information questions to maximize the information benefits, and give a yes or no answer according to your answers to the questionnaire. Here, Tyrion is the decision tree of your favorite restaurant preference.

Decision tree is a graphical representation, which uses branching method to explain all possible outcomes of decisions based on specific conditions. In decision tree, internal nodes represent the test of attributes, each branch of the tree represents the test results, and leaf nodes represent specific class labels, that is, decisions made after all attributes are calculated. The classification rule is represented by the path from the root to the leaf node.

A. Types of decision trees

(1) classification trees-These are considered as default decision trees, which are used to classify data sets into different classes according to response variables. These are usually used when response variables are naturally classified.

(2) Regression Tree-When the response or target variable is continuous or numerical, the regression tree is used. Compared with classification, these are usually used to predict the type of problem.

According to the types of target variables-continuous variable decision tree and binary variable decision tree, decision trees can also be divided into two types. It is a target variable, which helps to decide what kind of decision tree is needed for a specific problem.

B. Why did you choose the decision tree algorithm?

(1) These machine learning algorithms are helpful to make decisions under uncertainty and help you improve communication, because they provide a visual representation of decisions.

(2) Decision tree machine learning algorithm helps data scientists to capture the idea that if different decisions are taken, the operational nature of the situation or model will change greatly.

(3) Decision tree algorithm helps to make the best decision by allowing data scientists to traverse forward and backward computing paths.

C. When to use the decision tree machine learning algorithm

(1) decision tree is robust to errors. If the training data contains errors, the decision tree algorithm will be the most suitable to solve such problems.

(2) The decision tree is most suitable for the problem of representing instances with attribute value pairs.

(3) If the training data has missing values, decision trees can be used, because they can handle the missing values well by looking at the data in other columns.

(4) When the objective function has discrete output values, the decision tree is the most suitable.

D. Advantages of decision trees

(1) Decision trees are very instinctive and can be easily explained to anyone. People with non-technical background can also explain the assumptions drawn from the decision tree, because they are self-evident.

(2) When using decision tree machine learning algorithm, data type is not a constraint, because it can handle classification and numerical variables.

(3) The decision tree machine learning algorithm does not need to make any assumptions about the linearity in the data, so it can be used when the parameters are nonlinear. These machine learning algorithms make no assumptions about the structure and spatial distribution of classifiers.

(4) These algorithms are useful in data exploration. Decision tree implicitly performs feature selection, which is very important in predictive analysis. When the decision tree is suitable for the training data set, the nodes divided at the top of the decision tree are regarded as important variables in the given data set, and feature selection is completed by default.

(5) Decision trees help to save data preparation time, because they are insensitive to missing and abnormal values. Missing values will not prevent you from splitting the data that builds the decision tree. Outliers will not affect the decision tree, because data splitting is based on some samples within the splitting range rather than the exact absolute value.

E. Disadvantages of decision trees

The more decisions are made in the (1) tree, the less accurate any expected result will be.

(2) The main disadvantage of decision tree machine learning algorithm is that the results may be based on expectations. When making real-time decisions, the benefits and results may be different from expectations or plans. It is very likely that this may lead to unrealistic decision trees and wrong decisions. Any unreasonable expectation may lead to major mistakes and defects in decision tree analysis, because it is impossible to always plan all the possibilities that may arise from decision making.

(3) Decision tree is not suitable for continuous variables, which leads to instability and classification platform.

(4) Compared with other decision models, decision tree is easy to use, but it is a complicated and time-consuming task to create a large decision tree with multiple branches.

(5) The decision tree machine learning algorithm only considers one attribute at a time, which is not necessarily the most suitable for the actual data in the decision space.

(6) Large-scale decision trees with multiple branches are incomprehensible and cause some difficulties in representation.

F. Application of decision tree machine learning algorithm

(1) decision tree is one of the popular machine learning algorithms, which is very useful for option pricing in finance.

(2) Remote sensing is an application field of decision tree pattern recognition.

(3) Banks use decision tree algorithm to classify loan applicants according to their default payment probability.

(4)Gerber Products, a popular baby products company, uses decision tree machine learning algorithm to decide whether they should continue to use plastic PVC (polyvinyl chloride) in their products.

(5) Rush University Medical Center has developed a tool called Guardian, which uses decision tree machine learning algorithm to identify patients at risk and disease trends.

The decision tree machine learning algorithms based on data science library in Python language are -SciPy and Sci-Kit learning.

The machine learning algorithm of decision tree realized by R language of data science library is inserting symbols.

3.7 Random Forest Machine Learning Algorithm

Let's continue with the same example we used in the decision tree to explain how the random forest machine learning algorithm works. Tyrion is the decision tree of your restaurant preference. However, as a person, Tyrion doesn't always publicize your restaurant preferences accurately. In order to get more accurate restaurant recommendation, you asked a couple's friends. If most people said you would like it, you decided to go to R restaurant. Besides asking Tyrion, you also want to ask jon snow, Sandor, Bronn and Bran who voted to decide whether you like R restaurant or not. This means that you have built an integrated classifier for decision trees, also known as forests.

You don't want all your friends to give you the same answer-so you give each friend slightly different data. You are not sure whether your restaurant preference is in a dilemma. You told Tyrion that you like the restaurant with a roof, but maybe, just because it's in summer, you might like it when you visit the restaurant. In the cold winter, you may not be a fan of the restaurant. So friends, don't use the data points of rooftop restaurants you like to open to make your own suggestions on your restaurant preferences.

By providing your friends with slightly different restaurant preference data, you can ask your friends different questions at different times. In this case, just change your restaurant preferences slightly, and you will inject randomness at the model level (unlike the randomness of the decision tree at the data level). Your friends now form a random forest of your restaurant preferences.

Random forest is a machine learning algorithm, which uses bagging method to create a decision tree of a bunch of random data subsets. The model is trained many times on the random samples of the data set to obtain good prediction performance from the random forest algorithm. In this holistic learning method, the outputs of all decision trees in the random forest are combined to make the final prediction. The final prediction of random forest algorithm is obtained by polling the results of each decision tree or only by using the prediction that appears most frequently in the decision tree.

For example, if five friends decide that you will like restaurant R, but only two friends decide that you won't like restaurant R, then the final prediction is that you will like restaurant R best and always win.

A. Why use random forest machine learning algorithm?

(1) There are many good open source algorithms in Python and R.

(2) Maintaining accuracy without data can also resist outliers.

(3) Simply based on random forest algorithm, it can be realized with only a few lines of code.

(4) Random forest machine learning algorithms help data scientists save data preparation time, because they don't need any input preparation, and they can handle numerical, binary and classification features without scaling, transformation or modification.

(5) Implicit feature selection, because it gives an estimate of what variables are important in classification.

B. Advantages of using random forest machine learning algorithm

(1) Different from decision tree machine learning algorithm, over-fitting is not a problem for random forests. There is no need to prune random forests.

(2) These algorithms are fast, but not in all cases. The random forest algorithm runs on an 800MHz machine. The data set is 100 variables, and 50,000 cases generate 100 decision trees in1min.

(3) Random forests are one of the most effective and universal machine learning algorithms for various classification and regression tasks, because they are more robust to noise.

(4) It is difficult to build a bad random forest. In the implementation of random forest machine learning algorithm, it is easy to determine which parameters to use because they are not sensitive to the parameters used to run the algorithm. People can easily build a decent model without too much adjustment.

(5) Random forest machine learning algorithm can grow in parallel.

(6) The algorithm runs efficiently on a large database.

(7) The classification accuracy is high.

C. disadvantages of using random forest machine learning algorithm

They may be easy to use, but it is difficult to analyze them theoretically.

A large number of decision trees in random forest will slow down the speed of real-time prediction algorithm.

If the data is composed of classified variables with different levels, the algorithm will give priority to selecting attributes with more levels. In this case, the variable importance score seems unreliable.

RandomForest algorithm will not exceed the range of response values in training data when it is used in regression tasks.

D. application of random forest machine learning algorithm

(1) Random forest algorithm is used by banks to predict whether loan applicants may be high-risk.

(2) It is used to predict the failure of mechanical parts in automobile industry.

(3) These algorithms are used in health care industry to predict whether patients may suffer from chronic diseases.

(4) They can also be used for regression tasks, such as predicting the average of social media share and performance score.

(5) Recently, this algorithm has been used to predict patterns in speech recognition software and classify images and texts.

The random forest machine learning algorithm realized by data science library in Python language is Sci-Kit learning.

Realization of randomForest machine learning algorithm by r language data science library.