1. 1 What is a reptile? Crawler usually refers to a web crawler, which is a program or script that automatically crawls information on the World Wide Web according to certain rules. Generally, it is automatically crawled according to the defined behavior, and the smarter crawler will automatically analyze the structure of the target website. It also has some unusual names. Such as: web spiders, ants, automatic indexers, netting machines, web robots, etc.
1.2 reasons for learning reptiles:
1.2. 1 Learning reptiles is a very interesting thing. I used to catch many interesting things with reptiles, and interest is the best teacher. I can quickly learn and remember what I am interested in, and I feel a sense of accomplishment after learning.
Learn from the crawler, you can learn from a search engine in personal tailor, and you can have a deeper understanding of the working principle of data collection of search engines. Some friends want to know more about the working principle of reptiles in search engines, or want to develop a private search engine. At this time, it is very necessary to learn reptiles. Simply put, after learning how to write a crawler, we can use a crawler to automatically collect information from the Internet and then store or process it accordingly. When we need to retrieve some information, we only need to retrieve it from the collected information, that is, we can realize a private search engine. Of course, there are how to capture information, how to store it, how to divide words, how to calculate the degree of correlation and so on. Need our design. Crawler technology mainly solves the problem of information capture.
@ Learning crawler can get more data sources. These data sources can be collected according to our purpose, and a lot of irrelevant data can be removed. When doing big data analysis or data mining, data sources can be obtained from some websites that provide data statistics, or from some literature or internal materials. However, these methods of obtaining data are sometimes difficult to meet our demand for data, and it takes too much energy to find these data manually from the Internet. At this time, we can use the crawler technology to automatically obtain the data content that we are interested in from the Internet, and crawl these data contents back as our data source, so as to conduct deeper data analysis and obtain more valuable information.
@ For many SEO practitioners, learning crawler can help them understand the working principle of search engine crawler in a deeper level, so as to better optimize search engines. Since it is search engine optimization, it is necessary to be very clear about the working principle of search engines, and at the same time to master the working principle of search engine crawlers, so that we can know ourselves and ourselves when optimizing search engines.
@ Learn more reptiles. Crawler engineers are in short supply at present, and their salaries are generally high. Therefore, mastering this technology in depth is very beneficial to employment. Some friends may learn to crawl in order to get jobs or change jobs. From this point of view, the direction of reptile engineer is also a good choice, because the demand for reptile engineers is increasing at present, and there are fewer people who can be qualified for this position, so it belongs to a relatively scarce career direction, and with the arrival of the era of big data, the application of reptile technology will be more and more extensive, and there will be a good development space in the future.
In addition to the above four common reasons for learning reptiles, you may have some other reasons for learning reptiles. In short, no matter what the reason is, you can learn a knowledge technology better and stick to it.
1.3 How to learn reptiles;
1.3. 1 Select the programming language. The premise of getting started with reptiles is definitely that you need to learn a programming language, and Python is recommended. From 2065438 to May 2008, Python ranked first as the most popular language. Many people tie Python to reptiles. Compared with static programming languages such as Java, Php and Node, Python has more crawler libraries and provides more APIs for accessing web pages. You don't need dozens of lines to write a reptile, just a dozen lines. Especially in the increasingly severe situation of anti-crawler, how to disguise your own crawler is particularly important, such as UA, Cookie, Ip and so on. Python library encapsulates it harmoniously, which can reduce most code.
1.3.2 Learn the knowledge points that reptiles need to master. Http related knowledge, browser interception, packet capture; The installation and use of python's scrapy, requests, BeautifulSoap and other third-party libraries, coding knowledge, byte and str type conversion, capturing the dynamically generated content of javascript, simulating post, get, header, cookie processing, login, proxy access, multi-thread access, asyncio asynchronism, regular expression, xpath, distributed crawler development, etc.
1.3.3 Basic methods of learning reptiles. ? Make clear the knowledge system needed by reptiles, and then decompose them one by one; It is suggested to buy a well-known book first, so as to systematically learn the knowledge system of reptiles. At the beginning of learning, it is suggested to start from the basic library, and then climb with the framework after you have a certain understanding, because the framework is also built with the foundation, but it integrates many mature modules, which improves the efficiency of climbing and improves the function. Do more practical exercises and summarize practical exercises, summarize the construction technology of the other website, the anti-crawling mechanism of the website, the analysis method of this kind of website, and crack the anti-crawling skills of the other website.
2 Why choose Python?
Baidu knows that it has introduced a lot in this respect. Compared with other programming languages, I simply answer the reasons:
2. 1 python is a scripting language. Because the development and testing process of scripting language is different from that of compiling language, it can greatly improve the programming efficiency. As a programmer, you should master at least one universal scripting language, and python is the most popular universal scripting language at present. Similar to python, there are ruby, tcl, perl, etc. python is called the king of scripting languages.
2.2 python has a wide community. It can be said that as long as you think of problems, as long as you need to use third-party libraries, they are basically python interfaces.
2.3 python has high development efficiency. The same task is about 10 times that of java and 10-20 times that of c++.
2.4 python has a large number of applications in scientific research. There are many packages for big data computing, simulation computing and scientific computing. Python is installed on almost every linux operating system, and most unix systems are also installed by default, which is very convenient to use.
2.5 python has rich and powerful independent libraries. Almost without relying on third-party software, most system operations and common task development can be completed; There are many sample codes in python Help, which can be used formally with a few modifications.