Beautifulsoup Scraping Example: A How To Guide On Web Scraping

Beautifulsoup Scraping Example: A How To Guide On Web Scraping

Scraping is the process of extracting and structuring data for later retrieval and analysis.

It automates the process of copying and pasting selected sections of a page, and is often used to collect data such as phone numbers and emails. Scraping can also be used to determine the best price for a certain item among different retailers.  Ever heard of a website that promises to find your ideal hotel for the best-rate? Well, I guess you can thank web scrapers for that.

 

Lately, due to the large amount of data available in the web, scraping have allowed us to build complex regressions systems to estimate the value of almost anything that is for sell.  It is often a requirement for data scientists.

 

In this article, we’ll be able to interact with the web to get useful information from a certain site.

 

Before beginning to scrape, it is important to know the basics. How is it done?

Web pages are rendered by the browser from HTML and CSS code, but much of this information is not interesting when scraping a site and actually, make data extraction really difficult.

 

BeautifulSoup allows us to easily access the information that we need by providing idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

 

 

If you don’t have BeautifulSoup installed , don’t worry about it! Just follow the tutorial and we’ll get you there step by step

 

For Windows and Linux users, try installing with pip.

 

pip install beautifulsoup4

 

You can also Download from the tarbar, https://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/ unzip it, cd to the new uncompressed directory and run:

 

python setup.py install

 

If you are having issues with the above code, for Linux-based operating system try downloading Beautiful Soup with the system package manager, just type in the command line:

 

apt-get install python-bs4

 

 

Now let’s examine the site that we are going to scrape. Please, visit

http://quotes.toscrape.com/page/1/. You can see a list of quotes with their corresponding authors and some tags. This is a website that was specifically designed to be scraped, it is a good starting point for beginners. Let’s try extracting first the title!

 

 

We begin by reading the source code for a given web page and creating a Beautiful Soup object with the BeautifulSoup function. It is also a good idea to read more about the libraries to fully understand what’s going on behind the code, urllib and BeautifulSoup.

 

from bs4 import BeautifulSoup

import urllib

r = urllib.urlopen(‘http://quotes.toscrape.com/page/1/’).read()

soup = BeautifulSoup(r)

print type(soup)

 

Output: <class ‘bs4.BeautifulSoup’>

 

Lines 1 and 2 import packages that we’ll need to extract the data.

Lines 3 Introduces the urllib.urlopen() function, it takes a string or a Request object as a parameter and allows us to extract the whole HTML from the website. For the purpose, of this tutorial, we will only use urls in strings.

Lines 4 creates our BeautifulSoup object, the object itself represents the document as a whole. For most purposes, you can treat it as a Tag object. This means it supports most of the methods described in Navigating the tree and Searching the tree..

 

`The soup object contains all of the HTML in the original document.

 

print soup.prettify()[0:1000]

 

 

If Beautifulsoup did not exist, we probably would need to locate the characters manually to get the title. The code for that would be something like this:

 

>>> print(soup.prettify()[69:110])
<title>
   Quotes to Scrape
</title>

 

We specifically had to program the characters 69 to 110 to get the title.

This is really time-consuming for almost all applications.

If this might not sound that bad at first, try to imagine that you needed to scrape 1000 pages and that for every page the numbers of characters change. Automating this easy task could take us days.

 

 

 

However, BeautifulSoup makes it easy for us to get the data and we just need to use:

>>> print(soup.title)
<title>Quotes to Scrape</title>

 

What if we wanted to remove the title tag? We can also get only what’s inside with:

>>> print(soup.title.string)
Quotes to Scrape

 

Let’s suppose that now you wanted to extract all the links within this specific url.

You can code it pretty easily with the <a> tags:

for link in soup.find_all(‘a’):
    print(link.get(‘href’))

 

Now we’ll get something like this:
/tag/value/page/1/
/author/Andre-Gide
/tag/life/page/1/
/tag/love/page/1/
/author/Thomas-A-Edison
/tag/edison/page/1/
/tag/failure/page/1/
/tag/inspirational/page/1/
/tag/paraphrased/page/1/
/author/Eleanor-Roosevelt
/tag/misattributed-eleanor-roosevelt/page/1/
/author/Steve-Martin
/tag/humor/page/1/
/tag/obvious/page/1/
/tag/simile/page/1/
/page/2/
/tag/love/
/tag/inspirational/
/tag/life/
/tag/humor/
/tag/books/
/tag/reading/
/tag/friendship/
/tag/friends/
/tag/truth/
/tag/simile/
https://www.goodreads.com/quotes
https://scrapinghub.com

 

 

Okay, this looks weird now. The last two urls look good, but what about the rest?

Sometimes websites use something called relative urls in href parameters.

A relative url, as it name implies, is relative to the base site only. What this means

is that if your current url is http://quotes.toscrape.com/page/1/ and the code above outputs

/tag/truth/. The new url would be http://quotes.toscrape.com/tag/truth/..

.

 

How can we get all our urls displayed correctly then?

 

You could try manually saving the main site variable in your code and then concatenating it with the results if the url you are getting does not start with http. But, this would not tell you which combination of urls is correct. Luckily, someone came up with the solution for you. There’s no need to reinvent the wheel. You can easily import a function and have the results that you want to have. For this part we are going to be importing another library, it is called urlparse.

 

According to its site:

“This module defines a standard interface to break Uniform Resource Locator (URL) strings up in components (addressing scheme, network location, path etc.), to combine the components back into a URL string, and to convert a relative URL to an absolute URL given a base URL.”

 

This is exactly what we need to use !

We need to import a function called urljoin that takes as parameter the base url, ‘http://quotes.toscrape.com/page/1/‘ and a relative url (For ex: /tag/simile/) and constructs a full (“absolute”) URL.

 

 

from bs4 import BeautifulSoup

import urllib

from urlparse import urljoin

r = urllib.urlopen(‘http://quotes.toscrape.com/page/1/’).read()

soup = BeautifulSoup(r)

for link in soup.find_all(‘a’):
    print(urljoin(‘http://quotes.toscrape.com/page/1/’ ,link.get(‘href’)))

 

And now our output becomes:

http://quotes.toscrape.com/author/Marilyn-Monroe
http://quotes.toscrape.com/tag/be-yourself/page/1/
http://quotes.toscrape.com/tag/inspirational/page/1/
http://quotes.toscrape.com/author/Albert-Einstein
http://quotes.toscrape.com/tag/adulthood/page/1/
http://quotes.toscrape.com/tag/success/page/1/
http://quotes.toscrape.com/tag/value/page/1/
http://quotes.toscrape.com/author/Andre-Gide
http://quotes.toscrape.com/tag/life/page/1/
http://quotes.toscrape.com/tag/love/page/1/
http://quotes.toscrape.com/author/Thomas-A-Edison
http://quotes.toscrape.com/tag/edison/page/1/
http://quotes.toscrape.com/tag/failure/page/1/
http://quotes.toscrape.com/tag/inspirational/page/1/
http://quotes.toscrape.com/tag/paraphrased/page/1/
http://quotes.toscrape.com/author/Eleanor-Roosevelt
http://quotes.toscrape.com/tag/misattributed-eleanor-roosevelt/page/1/
http://quotes.toscrape.com/author/Steve-Martin
http://quotes.toscrape.com/tag/humor/page/1/
http://quotes.toscrape.com/tag/obvious/page/1/
http://quotes.toscrape.com/tag/simile/page/1/
http://quotes.toscrape.com/page/2/
http://quotes.toscrape.com/tag/love/
http://quotes.toscrape.com/tag/inspirational/
http://quotes.toscrape.com/tag/life/
http://quotes.toscrape.com/tag/humor/
http://quotes.toscrape.com/tag/books/
http://quotes.toscrape.com/tag/reading/
http://quotes.toscrape.com/tag/friendship/
http://quotes.toscrape.com/tag/friends/
http://quotes.toscrape.com/tag/truth/
http://quotes.toscrape.com/tag/simile/
https://www.goodreads.com/quotes
https://scrapinghub.com

 

 

Much better, don’t you think?

For the final part of this tutorial, we are going to scrape all the quotes.

First of all, visit the site and right click any of the quotes.

If you click inspect, you’ll be able to see the HTML of the site.

We can see that the quotes are in span elements and they have a unique class, called text.

Let’s use soup.find_all(‘span’, class_=’text’) to get the elements.

 

for quote in soup.find_all(‘span’, class_=’text’):

    quote.text

 

This will output:

 

u’\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d’
u’\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d’
u’\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d’
u’\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d’
u”\u201cImperfection is beauty, madness is genius and it’s better to be absolutely ridiculous than absolutely boring.\u201d”
u’\u201cTry not to become a man of success. Rather become a man of value.\u201d’
u’\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d’
u”\u201cI have not failed. I’ve just found 10,000 ways that won’t work.\u201d”
u”\u201cA woman is like a tea bag; you never know how strong it is until it’s in hot water.\u201d”
u’\u201cA day without sunshine is like, you know, night.\u201d’

 

 

This collected all the quotes but the code outputs some strange characters. This happens because the format of this website is not unicode and when Python tries to parse it into unicode strings it fails to process all characters. We need to encode it as ascii.

 

 

from bs4 import BeautifulSoup

import urllib

from urlparse import urljoin

r = urllib.urlopen(‘http://quotes.toscrape.com/page/1/’).read()

soup = BeautifulSoup(r)

for quote in soup.find_all(‘span’, class_=’text’):

    quote.text.encode(“ascii”, “ignore”)

 

This returns:

 

‘The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.’
‘It is our choices, Harry, that show what we truly are, far more than our abilities.’
‘There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.’
‘The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.’
“Imperfection is beauty, madness is genius and it’s better to be absolutely ridiculous than absolutely boring.”
‘Try not to become a man of success. Rather become a man of value.’
‘It is better to be hated for what you are than to be loved for what you are not.’
“I have not failed. I’ve just found 10,000 ways that won’t work.”
“A woman is like a tea bag; you never know how strong it is until it’s in hot water.”
‘A day without sunshine is like, you know, night.’

 

 

I hope that this article was able to introduce you to Beautifulsoup. Now that you know the basics you can start building more complex scrapers. You should probably learn how to store your data in a json file or database. A very good follow-up would be learning to use the Scrapy library, it is much more powerful than BeautifulSoup and can help you scrape thousands of websites in seconds.


Leave a Reply

Your email address will not be published. Required fields are marked *