How to get started with Python crawler?

thumbnail

Recently, I found out that many people are asking how to learn crawler, so I feel that maybe everyone lacks a crawler tutorial.

So prepare a simple introductory tutorial. Although it is not very in-depth, it is still possible to teach you to crawl the content of a small page.

First, let’s talk about how to learn

First, you need to have a small goal. For example, my small goal at the time was actually to crawl Zhihu articles. I don’t know what your small goal is?

Second, it is best to have a certain foundation and understand the general syntax, and don't even understand import and from at that time.

Third, it is very important to learn while doing, because some things are like this. I can see it with my eyes, but my brain can’t. If you learn and do it, you can go back and look it up or search on Baidu if you have any problems. , otherwise it will be very uncomfortable.

Fourth, learn to use GitHub, there are many excellent libraries about python, which are very suitable for everyone to use.

Without further ado, let's get to the point.

But I still have to say a digression. In fact, I have written a similar article on this introduction, called the basics of reptiles. You are welcome to read it.

First of all, the library recommends using requests_html. Of course, you can also use requests. These two should be the most popular now.

Installation is also a commonplace.

pip install requests

Let's talk about requests, the installation is that simple. Then we can import the library directly by importing requests. Some specific acquisition methods will not be discussed in detail here. Baidu has many. Of course, I recommend authoritative documents -> Requests to get started quickly

Generally speaking, these documents are very practical. I used to like to read books, but now I actually like to read documents.

Then le, now that we know this library and understand how to use it, what should we do next? Next, you need to learn html, haha, really, I'm not kidding you, because what you crawl is a web page after all, and the web page is written through html+css+JavaScript, you don't need to understand how to write it, just need to know This is what it does.

Then learn about re-regular, which is really useful for crawlers. For example, in my article, crawling Zhihu Salt Selection, is to use regularity to grab the jump link of the next page. In fact, if you are getting started, you don't need to learn how to learn, you just need to know how to grab the specified value, and then learn it slowly.

Then learn scrapy for further crawling. The general idea is these. It is actually easy to get started, and it is also very easy to climb some simple stations, but it is still very easy to crawl some websites with something. It’s very rare. In addition, you can read other people’s code when you have nothing to do, which is of great help to yourself. If other people write shit that doesn’t make sense, you have to learn how to write things that are not like him. If you write so well, you learn from other people's good ideas, and you reflect on the methods of others' investigation, keep learning, and climb the peak bravely!

Latest Programming News and Information | GeekBar

Related Posts