Current location - Education and Training Encyclopedia - Graduation thesis - How to become a "master" of data mining through self-study
How to become a "master" of data mining through self-study
Basic terms:

1. Read the introduction of data mining. This book is easy to understand and has no complicated and profound formulas. Very suitable for beginners. In addition, you can use this book as a reference data mining: concepts and technologies. The second one is rough and has more knowledge of data warehouse. If you like algorithms, you can read Introduction to Machine Learning again.

2. Implement the classical algorithm. There are several parts:

A. association rule mining (Apriori, FPTree, etc. )

B. classification (C4.5, KNN, logistic regression, SVM, etc. )

C. clustering (kmeans, dbscan, spectral clustering, etc. )

D. dimension reduction (PCA, LDA, etc. )

E recommendation system (content-based recommendation, collaborative filtering, such as matrix decomposition, etc. )

Then test it on public data sets to see the effect of implementation. You can find a large number of public data sets on the following websites:/Take part in several competitions of 10 1 to learn how to abstract a problem into a model and build an effective feature project from the original data.

At this point, basically several large domestic companies will give you an interview.

Advanced articles:

1. Reading, the following parts are all big books, but they have made great progress after learning.

A. Pattern Recognition and Machine Learning

B. Elements of Statistical Learning

C. Machine Learning: A Probabilistic Perspective

The first book is more Bayesian; ; The second book is more frequent; ; The third book is somewhere in between, but I think it is similar to the first book, but it adds a lot of new content. Of course, in addition to these large and comprehensive books, there are many books that introduce different fields, such as Boosting Foundations and Algorithms and Probabilistic Graphical Models Principles and Techniques. And some theoretical basis of machine learning, optimization of machine learning and so on. The after-class exercises in these books are also very useful. Only when you do them can you deduce formulas when you write your own paper.

2. read the newspaper. Including several related meetings: KDD, ICML, NIPS, International Commission of Jurists, AAAI, WWW, SIGIR, IJCAI;; And several related journals: TKDD, TKDE, JMLR, PAMI, etc. Track new technologies and new hot issues. Of course, if you do relevant research, this step is necessary. For example, the style of our group is to read papers in the first half of the year, find problems in the summer vacation, do experiments in the autumn, and write/hand in papers around the Spring Festival.

3. Track hot issues. For example, in recent years, the recommendation system, social network, behavior orientation and so on, many companies' businesses will involve these aspects. There are some popular technologies, such as deep learning, which is very popular now.

4. Learn large-scale parallel computing technologies, such as MapReduce, MPI, GPU computing, etc. Basically, every big company will use these technologies, because the actual amount of data is very large, and it is basically realized on the computing cluster.

5. Participate in actual data mining competitions, such as KDDCUP, or/and above. This process will train you how to solve a practical problem in a short time and be familiar with the whole process of the whole data mining project.

6. Participate in an open source project, such as the shogunate mentioned above or Mahout of scikit-learn and Apache, or provide more effective and faster implementation for some popular algorithms, such as SVM under the Map/Reduce platform. This is also an exercise in coding ability.

At this point, large domestic companies can basically go wherever they want, and the treatment is not bad; If English is good, it is not difficult to go to a company in America.