Nhập mã khuyến mãi ONETCOMVN được giảm 10%

Tài Khoản

Python Web Scraping Tutorial

28/12/2020

đang xem

Python

The web is a major source of data, and with the web developing on a daily basis, one can only expect the increase of data on the web.For every techie, the ability to get this information is of high importance. From system administrators to database administrators to data scientists and software developers, being able to scrape the web gives you an edge over others.

That‘s why in this article we are going to be taking a look at the process of web-scraping making use of Python programming language.

To be able to proceed with this, you need to have knowledge of the language and HTML.

Before we proceed, it would be good to let you know that web-scraping is not the only way of getting data from websites. Top websites wuch as Google, Spotify, Twitter actually provide APIs, giving the users easy access to their data. However, the number of calls to the API that would be allowed per day is usually limited; except it is a paid service.

Not just that, their are lots of other websites that do not give users the API feature, so we may be left with little or no choice but to do the scraping ourselves.

Before proceeding, it is important to know that some sites find it offensive for people to scrape information from their websites without permission, so you should be careful of what websites you decide to scrape.

In this tutorial, we would be making use of the requests library as well as the BeautifulSoup library. There are other library choices for doing web-scraping apart from BeautifulSoup on python such as Selenium(which is preferred for Quality Assurance testing of websites), Scrapy, Mechanize and a host of others. There is also Urllib, which can serve the same purpose as the requests library.

So let‘s get started.

To install the requests library, you can use the pip command:

# pip install requests

That should install the requests library.
Then to install the BeautifulSoup library, you can use the pip command as well:

# pip install beautifulsoup4

That‘s it. Our library is ready for making our soup………soup of data more like. Let‘s take a look at what we are going to be using the requests library for. The requests library actually isn‘t needed for much, as all we would use it for is to load the contents of the desired webpage.

For every script you write, you would have to import the requests and BeautifulSoup library using:

import requests
from bs4 import BeautifulSoup as bs

What this does is to add the requests keyword into the namespace, so that Python understands what the requests keyword is when you make use of it. It also does same to the bs keyword, however it gives us the opportunity to make use of BeautifulSoup using the simple keyword bs.

webpage = requests.get(url)

What the code above does is to get the webpage url and then assigning it to the variable webpage, the url should be a direct string or a string stored in a variable.

webcontent = webpage.content

Now the content method gets to extract the content from the webpage and then we assign it to the variable webcontent. It is much more complicated than this, but let‘s keep it at this for the sake of simplicity.

That‘s all we would be doing with the requests library. We are simply going to convert the requests object into a BeautifulSoup object.

htmlcontent = bs(webcontent, “html.parser“)

What this does is to parse the request object, into real readable html. Now we can begin scrapping from the html using the methods available to BeautifulSoup.

In order to understand things better, let‘s work with this snippet of html code.

<div class="Tech_head">Technology</div>
<div class="content">
<h1>XYZ praised after inspirational comment</h1>
<img src="xyzlady.jpg" alt="lady" align="right" />
ABC spoke emotionally about the incident after her room was cancelled
"It’s why we have DEF," said host who must now attend a course on entrepreneurial studies.
13 July 1997
From the section Technology comments
Related content
Video
ABC account hijackers burgle homes
DEF time-cap ‘unworkable in London’
Video
‘I can’t find a home because of ABC’

</div>

So, BeautifulSoup allows us access content of this html snippet by making use of different functions and using them on the variable htmlcontent.

htmlcontent.find(“div“)

What this code above does is to look for tags with the name <div></div>, if there are more than one tags with that name it simply returns the first tag it comes across.

So this should return:

<div class="Tech_head">Technology</div>

What if we want to get all the tags with the name <div></div> and then save them in a list so as to pick the one we need from it? All we would use is the find_all() method.

htmlcontent.find_all(“div“)

After running this piece of code, we would have:

[‘
<div class="Tech_head">Technology</div>
‘,
<div class="content">
<h1>XYZ praised after inspirational comment</h1>
<img src="xyzlady.jpg" alt="lady" align="right" />
ABC spoke emotionally about the incident after her room was cancelled
"It’s why we have DEF," said host who must now attend a course on entrepreneurial studies.
13 July 1997
From the section Technology comments
Related content
Video
ABC account hijackers burgle homes
DEF time-cap ‘unworkable in London’
Video
‘I can’t find a home because of ABC’

</div>
‘]

Now to get any of the <div> tags we can simply use indexing on the list and extract the needed one. So what if we need to pick a <div></div> tag according to certain attributes? For instance we need a <div class=”Tech_head”></div>, this means that we would want <div> tags that have the class attribute with the value “Tech_head“.

The following code would do:

htmlcontent.find(“div“, attrs = {“class“ = “Tech_head“})

This would get us the <div class=”Tech_head”></div> tag. If we need the contents of the tag only, without getting the tag itself as a result, we call the .text on it. So:

htmlcontent.find(“div“, attrs = {“class“ = “Tech_head“}).text

Would return:
Technology
Instead of:

<div class="Tech_head">
Technology
</div>

The last thing we would be looking at would be extracting the value of an attribute in a tag. Taking a look at the code, you would see this tag:

We could very easily extract the value of the src attribute, that would be done with the following code:

htmlcontent.find(“img“)[“src“]

This would return the following answer:

"xyzlady.jpg"

In cases where there are lots of <img> tags, and we want to get one by the attribute, we can use the attrs parameter as shown earlier.

There you have it, the find(), find_all() are going to be the most useful tags for you when it comes to web-scraping with BeautifulSoup. However, there some other tricks which you would get to find out as you get to tackle much more difficult websites.

ONET IDC

ONET IDC thành lập vào năm 2012, là công ty chuyên nghiệp tại Việt Nam trong lĩnh vực cung cấp dịch vụ Hosting, VPS, máy chủ vật lý, dịch vụ Firewall Anti DDoS, SSL… Với 10 năm xây dựng và phát triển, ứng dụng nhiều công nghệ hiện đại, ONET IDC đã giúp hàng ngàn khách hàng tin tưởng lựa chọn, mang lại sự ổn định tuyệt đối cho website của khách hàng để thúc đẩy việc kinh doanh đạt được hiệu quả và thành công.

Chia sẻ

Bài Viết Mới

Hướng dẫn fake ip bằng phần mềm SStap

VPS treo game là gì? Thuê VPS treo game giá rẻ, không lo giật lag

BitBrowser – Best Anti-Detect Browser!

Dịch Vụ Xây Dựng Hệ Thống Peering Với Internet Exchange (IXP)

Dịch Vụ Triển Khai VPN Site-to-Site & Remote Access

Dịch Vụ Thiết Lập Hệ Thống Tường Lửa (Firewall)

Dịch Vụ Triển Khai Hệ Thống Ảo Hóa & Cloud

Dịch Vụ Triển Khai Hệ Thống Ceph

Dịch Vụ Triển Khai Hệ Thống BGP Multi-Peer Cho ISP

Hướng Dẫn Chọn Dịch Vụ Thuê Địa Chỉ IPv4

Bài Viết

Bài Viết Mới Cập Nhật

Hướng dẫn fake ip bằng phần mềm SStap

Hướng dẫn Tải và cài đặt Các bạn vào Google gõ từ khóa “Download SStap” hoặc vào sẵn link https://sourceforge.net/projects/sstap/files/latest/download Sau...

10/06/2025

VPS treo game là gì? Thuê VPS treo game giá rẻ, không lo giật lag

Bạn đam mê những tựa game online và muốn cày cuốc không ngừng nghỉ, nhưng chiếc máy tính cá nhân lại không đủ “trâu”...

02/06/2025

BitBrowser – Best Anti-Detect Browser!

Good anti association effect, complete browser fingerprint modification, affordable price! Please recommend it to friends around you! BitBrowser – anti detect browser, Dorang Account Defense Association ⚙️ Function: – RPA automation – API script – Extended plug -in – Window synchronization – Support Global Proxy IP Used for: Capital monetization, crypto，E-commerce, Social Media Marketing, Shopping Price Comparison, Price Comparison, Advertising, Alliance Marketing, Agency Operation, Self-testing etc. ♾️10 Profiles for Free ♾️ Free registration link：https://www.bitbrowser.net/vi/?code=5df4f4ec WhatsApp service group : https://chat.whatsapp.com/FCQaHfHbR351GIje98OIA9 Technical service group : https://t.me/bitbrowser000

26/05/2025

Dịch Vụ Xây Dựng Hệ Thống Peering Với Internet Exchange (IXP)

Peering với Internet Exchange (IXP) là giải pháp quan trọng giúp tăng tốc độ kết nối, giảm độ trễ, tối ưu chi phí băng thông...

04/04/2025

Dịch Vụ Triển Khai VPN Site-to-Site & Remote Access

Giới Thiệu VPN (Virtual Private Network) là giải pháp quan trọng giúp bảo mật dữ liệu, đảm bảo kết nối an toàn giữa các...

04/04/2025

Bài Viết Mới

Hướng dẫn fake ip bằng phần mềm SStap

VPS treo game là gì? Thuê VPS treo game giá rẻ, không lo giật lag

BitBrowser – Best Anti-Detect Browser!

Dịch Vụ Xây Dựng Hệ Thống Peering Với Internet Exchange (IXP)

Dịch Vụ Triển Khai VPN Site-to-Site & Remote Access

Dịch Vụ Thiết Lập Hệ Thống Tường Lửa (Firewall)

Dịch Vụ Triển Khai Hệ Thống Ảo Hóa & Cloud

Dịch Vụ Triển Khai Hệ Thống Ceph

Dịch Vụ Triển Khai Hệ Thống BGP Multi-Peer Cho ISP

Hướng Dẫn Chọn Dịch Vụ Thuê Địa Chỉ IPv4

Hotline/Zalo

09.016.19.525

Nhận chương trình khuyến mãi từ ONET IDC

72 Lê Thánh Tôn, P.Bến Nghé, Quận 1, TP HCM

1001 S MAIN ST STE 600 KALISPELL, MT 59901

Điện thoại: 09.016.19.525

Email liên hệ:

[email protected]

Python Web Scraping Tutorial

Bài Viết Mới

How to Install Anaconda Python on Ubuntu 18.04 LTS

Python Command Line Parsing Tutorial

Keep These Portable Python Builds for Linux Always With You

Bài Viết Mới Cập Nhật

Hướng dẫn fake ip bằng phần mềm SStap

Hướng dẫn Tải và cài đặt Các bạn vào Google gõ từ khóa “Download SStap” hoặc vào sẵn link https://sourceforge.net/projects/sstap/files/latest/download Sau...

10/06/2025

VPS treo game là gì? Thuê VPS treo game giá rẻ, không lo giật lag

Bạn đam mê những tựa game online và muốn cày cuốc không ngừng nghỉ, nhưng chiếc máy tính cá nhân lại không đủ “trâu”...

02/06/2025

Dịch Vụ Xây Dựng Hệ Thống Peering Với Internet Exchange (IXP)

Peering với Internet Exchange (IXP) là giải pháp quan trọng giúp tăng tốc độ kết nối, giảm độ trễ, tối ưu chi phí băng thông...

04/04/2025

Dịch Vụ Triển Khai VPN Site-to-Site & Remote Access

Giới Thiệu VPN (Virtual Private Network) là giải pháp quan trọng giúp bảo mật dữ liệu, đảm bảo kết nối an toàn giữa các...

04/04/2025

Bài Viết Mới

CHÍNH SÁCH & ĐIỀU KHOẢN

Python Web Scraping Tutorial

Bài Viết Mới

How to Install Anaconda Python on Ubuntu 18.04 LTS

Python Command Line Parsing Tutorial

Keep These Portable Python Builds for Linux Always With You

Bài Viết Mới Cập Nhật

Hướng dẫn fake ip bằng phần mềm SStap Hướng dẫn Tải và cài đặt Các bạn vào Google gõ từ khóa “Download SStap” hoặc vào sẵn link https://sourceforge.net/projects/sstap/files/latest/download Sau... 10/06/2025

VPS treo game là gì? Thuê VPS treo game giá rẻ, không lo giật lag Bạn đam mê những tựa game online và muốn cày cuốc không ngừng nghỉ, nhưng chiếc máy tính cá nhân lại không đủ “trâu”... 02/06/2025

Dịch Vụ Xây Dựng Hệ Thống Peering Với Internet Exchange (IXP) Peering với Internet Exchange (IXP) là giải pháp quan trọng giúp tăng tốc độ kết nối, giảm độ trễ, tối ưu chi phí băng thông... 04/04/2025

Dịch Vụ Triển Khai VPN Site-to-Site & Remote Access Giới Thiệu VPN (Virtual Private Network) là giải pháp quan trọng giúp bảo mật dữ liệu, đảm bảo kết nối an toàn giữa các... 04/04/2025

Bài Viết Mới

CHÍNH SÁCH & ĐIỀU KHOẢN

Hướng dẫn fake ip bằng phần mềm SStap

Hướng dẫn Tải và cài đặt Các bạn vào Google gõ từ khóa “Download SStap” hoặc vào sẵn link https://sourceforge.net/projects/sstap/files/latest/download Sau...

10/06/2025

VPS treo game là gì? Thuê VPS treo game giá rẻ, không lo giật lag

Bạn đam mê những tựa game online và muốn cày cuốc không ngừng nghỉ, nhưng chiếc máy tính cá nhân lại không đủ “trâu”...

02/06/2025

Dịch Vụ Xây Dựng Hệ Thống Peering Với Internet Exchange (IXP)

Peering với Internet Exchange (IXP) là giải pháp quan trọng giúp tăng tốc độ kết nối, giảm độ trễ, tối ưu chi phí băng thông...

04/04/2025

Dịch Vụ Triển Khai VPN Site-to-Site & Remote Access

Giới Thiệu VPN (Virtual Private Network) là giải pháp quan trọng giúp bảo mật dữ liệu, đảm bảo kết nối an toàn giữa các...

04/04/2025