Current location - Education and Training Encyclopedia - Graduation thesis - Seeking python crawler articles
Seeking python crawler articles
The overall progress is as follows:

1. added Cron: used to tell the program to wake up a task every 30 minutes and run to the specified blog to get the latest update.

2. Use google's Datastore to store the content that the crawler crawls down every time. . Only new content is stored. .

As I said last time, the performance has been greatly improved: the crawler is awakened after each request, so it takes about 17 seconds to output from the background to the foreground, and now it is less than 2 seconds.

3. The crawler is optimized.

1.Cron.yaml to arrange the time for each program to wake up.

After browsing the document and asking questions, I finally figured out how Google's cron works-in fact, it's just that Google virtually visited a url… at each designated time, we specify it ourselves …

So, under Django, you don't need to write pure python programs at all. Don't write:

if __name__=="__main__ ":

Just configure a url and put it in views.py:

def updatePostsDB(request):# delete all()SiteInfos =[]SiteInfo = { } SiteInfo[' post site ']= " l2z story " SiteInfo[' feed URL ']= " feed:// l2zstory.wordpress.com/feed/" SiteInfo[' blog _ type ']= " WordPress " SiteInfos . append(SiteInfo)SiteInfo = { } SiteInfo[' post site '] = " YukiLife " SiteInfo[' feed URL ']= " feed://blog . Sina . com . cn/RSS/65839028+0583902except Exception,e:Msg = str(e)return HttpResponse(Msg)

Cron.yaml should be placed at the same level as app.yaml:

Krone:

-Description: Retrieve the latest posts

url: /task_updatePosts/

Timetable: Every 30 minutes.

In url.py, just point to this and point task_updatePostsDB to the url.

The process of debugging this cron is tragic. . . There are many, many people on stackoverflow asking why their cron doesn't work. . . I was sweating at first, too, and I couldn't find my head. . . Finally, it is done by luck, and the general steps are very vague. . But it's simple:

First, make sure there are no grammatical errors in your program ... then you can try to access the url yourself manually. If cron is normal, the task should have been carried out by this time. If it really doesn't work, read the log … ...

2. Configuration and utilization of data storage-using data storage through Django

My demand here is very simple-there is no join…… so I only use the simplest django-helper. ..

This models.py is the key point:

The copy code code is as follows:

Import the basic model from appengine_django.models

Import database from google.appengine.ext

ClassPostsDB (basic model):

link=db。 LinkProperty()

title=db。 StringProperty()

Author =db. StringProperty()

Date =db. Datetime property ()

Description =db. TextProperty()

postSite=db。 StringProperty()

The first two lines are the key points. . . . I was naive not to write the second line at first. . . It took me more than two hours to understand what was going on. . The gain is outweighed by the loss. . .

When reading and writing, don't forget. . . PostDB.put()

At first, in order to save trouble, I just deleted all the data every time cron was awakened, and then rewritten the newly climbed data. . .

Result. . . A day later. . . There are 40 thousand reading and writing records. . . . And there are only 50 thousand free articles every day. . . .

Therefore, change it to check for updates before inserting. Write if you have it, and don't write if you don't. . Finally, this part of the database is done. . .

3. The improvement of reptiles:

At first, reptiles just climbed the things given in the feed. . So, if a blog has 24 * 30 articles. . . You can only get 10 at most. . . .

This time, the improved version can climb all the articles. . I made experiments with lonely blogs of Ling Chuan, Han Han, Yuki and Z respectively. . This is a great success. . . Lonely Lingchuan has more than 720 articles. . . What I completely missed was climbing down. .

import URL lib # from beautiful soup import beautiful soup from PyQuery import PyQuery as pqdef getarticle list(URL):lst articles =[] URL _ prefix = URL[:-6]Cnt = 1 response = urllib . urlopen(URL)html = response . read()d = pq(html)try:page Cnt = d(" ul。 SG_pages”)。 find(' span ')page CNT = int(d(page CNT)。 Text()[ 1:- 1]) except: pagecnt =1for I in range (1,pagecnt+1): URL = URL _ prefix+str (I)+". html " # print URL response = urllib . urlopen(URL)html = response . read()d = pq(html)title _ spans = d("。 atc_title”)。 find('a') date_spans=d('。 atc_tm') for j in range(0,len(title _ spans)):title obj = title _ spans[j]date obj = date _ spans[j]article = { } article[' link ']= d(title obj)。 attr(' href ')article[' title ']= d(title obj)。 text()article[' date ']= d(date obj)。 text()article[' desc ']= getpage Content(article[' link '])lst articles . append(article)return lst articles def getpage Content(URL):# get Page Content response = urllib . URL open(URL)html = response . read()d = pq(html)Page Content = d(" div . artic Content ")。 Text () # print page content return page content def main (): URL ='/s/articlelist _119123 _ 0 _1.html' # Han Han url="/s/ Article list _1225833283 _ 0 _1.html "# Gudulingchuan URL = "/s/article list _ 65438txt", W')f. write (article ['desc']. Encode ('utf-8') # Pay special attention to the handling of Chinese f.close () # printarticle ['desc'] if _ _ name _ =' _ _ main _ _': main ().

PyQuery recommended. .

I'm sorry to say that BueautifulSoup let me down. . . When I was writing an article, there was a small bug ... I couldn't find the reason. . After I got home, I spent a lot of time trying to figure out why BueautifulSoup couldn't catch what I wanted. . . Later, I glanced at the source code of its selector part, and I think it should be it or not.

I gave up this library and tried lxml ... xpath-based is very easy to use. . But I always need to check the documentation about xpath. . . So I found another library, PyQuery… I can use jQuery selector as a tool. . . Very, very useful. . . . See above for specific usage. . . This library has a future. . .

secret worry

Because pyquery is based on lxml… and the bottom layer of lxml is c…, it is estimated that it will not be used on gae. . . I am a reptile, and now I can only climb things on the computer. . . And then pushed to the server. . .