Current location - Education and Training Encyclopedia - Graduation thesis - Why is Python called a reptile?
Why is Python called a reptile?
Crawler generally refers to the crawling of network resources. Due to the scripting characteristics of python, python is easy to configure and flexible in handling characters, and python has a wealth of network crawling modules, which are often linked. Just use python's own urllib library; Write a search engine in python, which is a complex crawler. From here, you will know what a Python crawler is. It is a way to grab network resources based on Python programming. Python is not a reptile.

Why Python is suitable for some reptiles?

1) captures the interface of the web page itself.

Compared with other static programming languages, such as java, c#, C++ and python, the interface for grabbing web documents is simpler. Compared with other dynamic scripting languages such as perl, shell, python, etc., the urllib2 package provides a relatively complete API for accessing web documents. (Of course, ruby is also a good choice. )

In addition, crawling a web page sometimes needs to simulate the behavior of a browser, and many websites prohibit blunt crawler crawling. This is why we need to simulate the behavior of user agents to construct appropriate requests, such as simulating user login and simulating the storage and setting of session/cookie. There are excellent third-party packages in python that can help you, such as Requests and mechanize.

2) Processing after webpage crawling

Crawled web pages usually need to be processed, such as filtering html tags and extracting text. Python's beautifulsoap provides concise document processing function, which can complete most document processing with very short code.

In fact, many languages and tools can do the above functions, but python is the fastest and cleanest. Life is short, you need python.