Web Scraping with Python Scrapy Module

28/12/2020
Chưa phân loại
The skill of web scraping has become golden today, so let‘s learn how we can get needed data from web pages. In this article, we would be talking about the Scrapy Python library, what it can do and how to use it. Let’s get started.

Why Scrapy?

Scrapy is a robust web scraping library, that provides the ability to download web pages, images and any data you could think of at lightning speed. Speed is of great importance in computation, and Scrapy works on this by visiting websites asynchronously and doing a lot of background work making the whole task look easy.

It should be said that Python has other libraries that can be used to scrape data from websites, but none is comparable to Scrapy when it comes to efficiency.

Installation

Let‘s have a quick look at how this powerful library can be installed on your machine.

As with the majority of Python libraries, you can install Scrapy using the pip module:

pip install Scrapy

You can check if the installation was successful by importing scrapy in Python‘s interactive shell.

$ python
Python 3.5.2 (default, Sep 14 2017, 22:51:06)
[GCC 5.4.0 20160609] on linux

Type “help”, “copyright”, “credits” or “license” for more information.

>>> import scrapy

Now that we are done with the installation, let‘s get into the thick of things.

Creating a Web Scraping Project

During installation, the scrapy keyword was added to path so we can use the keyword directly from the command line. We would be taking advantage of this, throughout our use of the library.

From the directory of your choice run the following command:

scrapy startproject webscraper

This would create a directory called webscraper in the current directory and scrapy.cfg file. In the webscraper  directory would have __init__.py, items.py, middlewares.py, pipelines.py, settings.py files and a directory called spiders.

Our spider files i.e. the script that does the webscraping for us would be stored in the spiders directory.

Writing Our Spider

Before we go ahead to write our spider, it is expected that we already know what website we want to scrape. For the purpose of this article, we are scraping a sample webscraping website: http://example.webscraping.com.

This website just has country names and their flags, with different pages and we are going to be scrapping three of the pages. The three pages we would be working on are:

http://example.webscraping.com/places/default/index/0
http://example.webscraping.com/places/default/index/1
http://example.webscraping.com/places/default/index/2

Back to our spider, we are going to create a sample_spider.py in the spiders directory. From the terminal, a simple touch sample_spider.py command would help create a new file.

After creating the file, we would populate it with the following lines of code:

import scrapy
 
class SampleSpider(scrapy.Spider):
  name = "sample"
  start_urls = [
      "http://example.webscraping.com/places/default/index/0",
      "http://example.webscraping.com/places/default/index/1",
      "http://example.webscraping.com/places/default/index/2"
  ]
 
  def parse(self, response):
      page_number = response.url.split(‘/’)[1]
      file_name = "page{}.html".format(page_number)
      with open(file_name, ‘wb’) as file:
       file.write(response.body)

From the top level of the project‘s directory, run the following command:

scrapy crawl sample

Recall that we gave our SampleSpider class a name attribute sample.

After running that command, you would notice that three files named page0.html, page1.html, page2.html are saved to the directory.

Let‘s take a look at what happens with the code:

import scrapy

First we import the library into our namespace.

class SampleSpider(scrapy.Spider):
  name = "sample"

Then we create a spider class which we call SampleSpider. Our spider inherits from scrapy.Spider. All our spiders have to inherit from scrapy.Spider. After creating the class, we give our spider a name attribute, this name attribute is used to summon the spider from the terminal. If you recall, we ran the scrapy crawl sample command to run our code.

start_urls = [
 
   "http://example.webscraping.com/places/default/index/0",
   "http://example.webscraping.com/places/default/index/1",
   "http://example.webscraping.com/places/default/index/2"
]

We also have a list of urls for the spider to visit. The list must be called start_urls. If you want to give the list a different name we would have to define a start_requests function which gives us some more capabilities. To learn more you can check out the scrapy documentation.

Regardless, do not forget to include the http:// or https:// for your links else you would have to deal with a missing scheme error.

def parse(self, response):

We then go ahead to declare a parse function and give it a response parameter. When the code is run, the parse function is evoked and the response object is sent in which contains all the information of the visited web page.

page_number = response.url.split(‘/’)[1]
file_name = "page{}.html".format(page_number)

What we have done with this code is to split the string containing the address and saved the page number alone in a page_number variable. Then we create a file_name variable inserting the page_number in the string that would be the filename of the files we would be creating.

with open(file_name, ‘wb’) as file:
  file.write(response.body)

We have now created the file, and we are writing the contents of the web page into the file using the body attribute of the response object.

We can do more than just saving the web page. The BeautifulSoup library can be used to parse the body.response. You can check out this BeautiulSoup tutorial if you are not familiar with the library.

From the page to be scrapped, here is an excerpt of the html containing the data we need:

<div id="results">
<table>
<tr><td><div><a href="/places/default/view/Afghanistan-1">
<img src="/places/static/images/flags/af.png" /> Afghanistan</a></div></td>
<td><div><a href="/places/default/view/Aland-Islands-2">
<img src="/places/static/images/flags/ax.png" /> Aland Islands</a></div></td>
</tr>


</table>
</div>

You‘d notice that all of the needed data is enclosed in div tags, so we are going to rewrite the code to parse the html.
 
Here‘s our new script:

import scrapy
from bs4 import BeautifulSoup
 
class SampleSpider(scrapy.Spider):
    name = "sample"
 
    start_urls = [
     "http://example.webscraping.com/places/default/index/0",
     "http://example.webscraping.com/places/default/index/1",
     "http://example.webscraping.com/places/default/index/2"
     ]
 
    def parse(self, response):
      page_number = response.url.split(‘/’)[1]
      file_name = "page{}.txt".format(page_number)
      with open(file_name, ‘w’) as file:
        html_content = BeautifulSoup(response.body, "lxml")
        div_tags = html_content.find("div", {"id": "results"})
        country_tags = div_tags.find_all("div")
        country_name_position = zip(range(len(country_tags)), country_tags)
        for position, country_name in country_name_position:
          file.write("country number {} : {}n".format(position + 1, country_name.text))

The code is pretty much the same as the initial one, however I have added BeautifulSoup to our namespace and I have changed the logic in the parse function.

Let‘s have a quick look at the logic.

def parse(self, response):

Here we have defined the parse function, and given it a response parameter.

page_number = response.url.split(‘/’)[1]
file_name = "page{}.txt".format(page_number)
with open(file_name, ‘w’) as file:

This does the same thing as discussed in the intial code, the only difference is that we are working with a text file instead of an html file. We would be saving the scraped data in the text file, and not the whole web content in html as done previously.

html_content = BeautifulSoup(response.body, "lxml")

What we‘ve done in this line of code is to send in the response.body as an argument to the BeautifulSoup library, and assigned the results to the html_content variable.

div_tags = html_content.find("div", {"id": "results"})

Taking the html content, we are parsing it here by searching for a div tag that also has and id attribute with results as it‘s value, then we get to save it in a div_tags variable.

country_tags = div_tags.find_all("div")

Remember that the countries existed in div tags as well, now we are simply getting all of the div tags and saving them as a list in the country_tags variable.

country_name_position = zip(range(len(country_tags)), country_tags)
 
for position, country_name in country_name_position:
  file.write("country number {} : {}n".format(position + 1, country_name.text))

Here, we are iterating through the position of the countries among all the country tags then we are saving the content in a text file.

So in your text file, you would have something like:

country number 1 :  Afghanistan
country number 2 :  Aland Islands
country number 3 :  Albania
……..

Conclusion

Scrapy is undoubtedly one of the most powerful libraries out there, it is very fast and basically downloads the web page. It then gives you the freedom to whatever you wish with the web content.

We should note that Scrapy can do much more than we have checked out here. You can parse data with Scrapy CSS or Xpath selectors if you wish. You can read up the documentation if you need to do something more complex.

ONET IDC thành lập vào năm 2012, là công ty chuyên nghiệp tại Việt Nam trong lĩnh vực cung cấp dịch vụ Hosting, VPS, máy chủ vật lý, dịch vụ Firewall Anti DDoS, SSL… Với 10 năm xây dựng và phát triển, ứng dụng nhiều công nghệ hiện đại, ONET IDC đã giúp hàng ngàn khách hàng tin tưởng lựa chọn, mang lại sự ổn định tuyệt đối cho website của khách hàng để thúc đẩy việc kinh doanh đạt được hiệu quả và thành công.
Bài viết liên quan

How to install TeXstudio – A LaTeX Editor on Linux Distros

TeXstudio is a fully featured integrated writing app for creating LaTeX documents. The idea behind is to make writing...
28/12/2020

Firewall [ Phần 3 ] Iptables Service

I. Giới thiệu về Iptables Service Iptables Service là Firewall được cấu hình và hoạt động...
30/12/2020

Configure Ubuntu 18.04 LTS as an Ubuntu Package Cache Server

If you’re running a lot of Ubuntu machines in your private network, then it is highly likely that many people will be...
29/12/2020
Bài Viết

Bài Viết Mới Cập Nhật

Mua Proxy V6 Nuôi Facebook Spam Hiệu Quả Tại Onetcomvn
03/06/2024

Hướng dẫn cách sử dụng ProxyDroid để duyệt web ẩn danh
03/06/2024

Mua proxy Onet uy tín tại Onet.com.vn
03/06/2024

Thuê mua IPv4 giá rẻ, tốc độ nhanh, uy tín #1
28/05/2024

Thuê địa chỉ IPv4 IPv6 trọn gói ở đâu chất lượng, giá RẺ nhất?
27/05/2024