Using Beautifulsoup To Web Scrape

  1. Using Beautifulsoup To Web Scrape
  2. Using Beautifulsoup To Web Scrape Chrome
  3. Beautifulsoup Tutorial Python 3
  4. Using Beautiful Soup To Web Scraper
  5. Using Beautifulsoup To Web Scrape Software

If you're going to spend time crawling the web, one task you might encounter is stripping out visible text content from HTML. If you're working in Python, we can accomplish this using BeautifulSoup. Setting up the extraction. To start, we'll need to get some HTML. I'll use Troy Hunt's recent blog post about the 'Collection #1' Data Breach. Apr 17, 2021 I've been working on a project to reverse-enginner twitter's app to scrape public posts from Twitter using an unofficial API, with Python. (I want to create an 'alternative' app, which is simply a localhost that can search for a user, and get its posts).

Nowadays, there are APIs for nearly everything. If you wanted to build an app that told people the current weather in their area, you could find a weather API and use the data from the API to give users the latest forecast.

But what do you do when the website you want to use doesn't have an API? That's where Web Scraping comes in. Web pages are built using HTML to create structured documents, and these documents can be parsed using programming languages to gather the data you want.

Web Scraping with Python and Beautiful Soup

There are two basic steps to web scraping for getting the data you want:

  1. Load the web page (i.e. the HTML) into a string
  2. Parse the HTML string to find the bits you care about

Python provides two very powerful tools for doing both of these tasks. We can use the Requests library to retrieve the web page containing our data, and we can use the awesome Beautiful Soup package for parsing and extracting the data. If you'd like to know a bit more about the Requests library and how it works, check out this post for a bit more depth.

Using Beautiful Soup we can easily select any links, tables, lists or whatever else we require from a page with the libraries powerful built-in methods. So let's get started!

HTML basics

Before we get into the web scraping, it's important to understand how HTML is structured so we can appreciate how to extract data from it. The following is a simple example of a HTML page:

HTML will always start with a type declaration of <!DOCTYPE html> and will be contained between <html> / </html> tags.

The <body> tags wrap around the visible part of a website, which is made up by various combinations of header tags (<h1> to <h6>), paragraphs (<p>), links (<a>) and several others not shown in this example, such as <input> and <table> tags.

HTML tags can also be given attributes, like the id and class attributes in the example above. These attributes can help with styling by uniquely identifying elements.

If these tags are new to you, it might be worth taking some time quickly getting up to speed with HTML. Codecademy and W3Schools both offer excellent introductions into HTML (and CSS) that will be more than enough for this tutorial.

Analyzing the HTML

Have you ever followed one of those links on your social media to a 'Top 10 films of 2017', only to find it's one of those sites where each listing is on a different page? Part of you wants to find out what they thought was number one, the other part wants to give up waiting for all the ads to load? Well, web scrapping can help you with that.

We are going to use this article from CinemaBlend to find out the 10 Greatest Movies of All-Time.

Take a look at the link. It should bring you to a page where you can see that Taxi Driver was ranked 10th in the list. We want to grab this, so the first thing we need to do is look at the page structure. Right click on the page in the link above, and select the Page Source option.

This will bring up the HTML document for the entire page, side-menus and all. Don't be alarmed, I don't expect you to read all that. Instead press Ctrl + F and search for 10. Taxi Driver.

You should find something like this:

This part of the HTML represents the rank and title found underneath the movie image as shown below:

The easiest way to be sure is that this search should return only 1 result, which means we must be looking at the same part of the page.

So the 10th entry in our list is Taxi Driver, but how do we get the other 9 without having to click through every page?

Open the page source again, but this time search for Continued On Next Page. You should find something like this:

This section is rendered as the link we need to click on to see the next entry:

Again, we can tell this is the same element because it is the only result in the whole page source that should match.

Using beautifulsoup to web scrape video

Believe it or not, with just those two HTML segments we can create a Python script that will get us all the results from the article.

Scraping the HTML

Before we can write our scraping script, we need to install the necessary packages. Type the following into the console:

pip install requestspip install beautifulsoup4

Now we can write our web scraper. Create a script called scraper.py and open it in your development environment. We'll start by importing Requests and BeautifulSoup:

Let's use the Requests library to grab the page holding the 10. Taxi Driver entry and store it in a variable called page. We'll also create a variable called results, which will store the film rankings in a list for us:

Do you remember when we looked at the HTML for the web article using page source? Essentially, we now have that page's HTML stored in our variable, and we're going to use BeautifulSoup to parse through the response to find the data we care about.

The next step is to feed page into BeautifulSoup:

Now we can use the BeautifulSoup built-in methods to extract the film and it's ranking from the snippet we examined earlier:

To do this, we can use CSS selector syntax. In CSS, selectors are used to select elements for styling. Notice how the div element has a class of liststyle? We can use this to select the div tag, since a div tag with this exact class only appears once on the page.

Note: Usually, class attributes aren't unique and are used to style multiple elements in a similar way. If you want to guarantee uniqueness, try to use an id attribute.

Here, we have used the BeautifulSoup select method to grab the div element we want. The select method returns a list containing any matching elements. In our case, element returns: [<div>10. Taxi Driver</div>].

Since our list only contains one item, we get the element with index 0. We then use the BeautifulSoup get_text method to return just the text inside the div element, which will give us '10. Taxi Driver'.

Finally, let's append the result to our results list:

Crawling the HTML

Another key part of web scraping is crawling. In fact, the terms web scraper and web crawler are used almost interchangeably; however, they are subtly different. A web crawler gets web pages, whereas a web scraper extracts data from web pages. The two are often used together, since usually when you crawl some web pages you also want to get some data from them, hence the confusion.

In order for us to determine the other 9 rankings in the article, we will need to crawl through the web pages to find them. To do that, we are going to use the snippet we discovered before:

An <a> tag represents a link, and the destination for that link when clicking on it is held by the href attribute. We want to pass the value held by the href attribute to the Requests library, just like we did for the first page. We can do that with the following:

Here we have selected for any a tag that contains the class next-story and is within a parent div element that itself has a class of nextpage. This will return just a single result, since a link matching this criteria occurs just once on the page for our Continued On Next Page link.

We can then get the value of the href attribute by calling the get method on the a tag and storing it in a variable called url.

The next step would be to pass the href variable into the Requests library get method like we did at the beginning, but in order to do that we are going to need to refactor our code slightly to avoid repeating ourselves.

How to web scrape with beautifulsoup

Refactoring the Scraper

Right now, our scraper successfully grabs our chosen page and extracts the movie title and ranking, but to do the same for the remaining pages we need to repeat the process without just duplicating our code. To do this we are going to use recursion.

Recursion involves coding a function that calls itself one or more times, something that Python is able to take advantage of very easily. Here is our scraper refactored as a recursive function:

Let's go through each section of the code and see what is happening.

The scraper function takes two arguments. The first, url, is the URL of the page you want to extract information from, which gets passed into requests.

Beautifulsoup

The second argument results is optional but is key to the operation of our recursive function. When the function is first called, it should be called as follows:

scraper('https://www.cinemablend.com/new/10-Greatest-Movies-All-Time-According-Actors-73867.html')

The results parameter is not provided, and thus is set to an empty list. The function then grabs the page and extracts the information from it, appending it to the results list.

The next vital part of our recursive function lies here:

If we find a link on the page matching the CSS selector div.nextpage a.next-story, then we will call the scraper function again, this time with the href of the link to the next page AND the results list we have generated so far. This means when scraper runs for any subsequent calls, the results parameter will not be empty and instead we will continue to append new results to it.

When the scraper reaches the last page of the article (i.e. the movie ranked number one), then there will be no link matching the CSS selector and our recursive function wil return the final results list.

Note: Take care when using recursion. If you don't create a condition that will eventually end the function calls, a recursive function will run continously until it causes a runtime error. This is to prevent an issue known as stack overflow.

A complete working script could look something like this:

Scraper limitations

So now you've seen how easily you can extract information from a web page, why wouldn't you use it all the time? Well, sadly, there are downsides.

For starters, web scraping can also be slower obtaining the information than through an equivalent API, and some sites don't like you scraping information from their pages, so you need to check their policies to see it's okay.

But perhaps the most significant drawback is changes to the the HTML page structure. One of the advantages of APIs is that they are designed with developers in mind, and are therefore less likely to changes how they work. Web pages on the other hand can change quite dramatically. If the web page author decides to change the class names of their elements, such as the nextpage and next-story CSS selectors we used, our scraper will break. This can be frustrating if a website updates regularly.

That being said, web sites have improved their structures a lot over the years with the popularity of many easy-to-use frameworks, which means pages are unlikely to change too much over time.

Summary

Hopefully you've seen enough that you can now use web scraping confidently in your own projects. The advantage of web scraping is that what you see is what you get and If you know the information you are after, you don't need to dig around trying to figure out an API to get it. Just code a simple scraper and it's yours!

Like

Please enable JavaScript to view the comments powered by Disqus.blog comments powered by Disqus

Web scraping python beautifulsoup tutorial with example

Web scraping python beautifulsoup tutorial with example : The data present are unstructured and web scraping will help to collect data and store it. There are many ways of scraping websites and online services. Use the API of the website. Example, Facebook has the Facebook Graph API and allows retrieval of data posted on Facebook. Then access the HTML of the webpage and extract useful data from it. This technique is called as web scraping or web harvesting or web data extraction.

Steps involved in web scraping python beautifulsoup :-

  1. Send a request to the URL of a webpage which you want to access.
  2. Then the server will respond to the request by returning the HTML content of the webpage.
  3. After accessing data from HTML content we are at the left task of parsing data.
  4. We need to navigate and search trees that we create a task.

Installing required third party library:-

Easy way to install the library in python to use pip and used to install and manage packages in python.
Pip install requests
Pip install html5lib
Pip install bs4
Then access HTML content from the webpage:-
Import requests
URL=http://www.geeksforgeeks.org/data-structures/
R=requests. get (URL)
Print (r.content)

  1. First, the step is to import request library and specify URL of webpage which you want to scrape.
  2. And send an HTTP request to URL and then save response from the server in response object called r.
  3. Also print r.contemt ton get rawHTML content of webpage.

Parse HTML content:-

Import requests
From bs4 import Beautifulsoup
URL=”http://www.values.com/inspirational-quotes”
R=requests. get (URL)
Soup=Beautifulsoup (r.content,’html5lib’)
Print(soup.prettify ())
The library in beautifulsoup is build on top of the HTML libraries as html.parser.Lxml.and the it will specify parser library as,
Soup=BeautifulSoup (r.content,’html5lib’)
From above example soup=beautifulsoup (r.content,’html5lib’)-will create an object by passing the arguments.
Html5lib:-will specify parser which we use.
r.content:-also called as raw HTML content.

Using Beautifulsoup To Web Scrape

Libraries used for web scraping python beautifulsoup :-

We will use the following libraries:

  1. Selenium: - It is a web testing library and used to automate browser activities.
  2. BeautifulSoup: -Beautiful Soup is also called Python package for parsing HTML and XML documents and creates the parse trees which are helpful to extract the data easily.
  3. Pandas: - the library is used for data manipulation and analysis. And also used to extract the data and store it in the desired format.

Automated web scraping can be used to speed up the data collection process.
You can write your code once and it will get the information you want from many times and many pages.
When you try to get the information and if you want to do manually you have to spend a lot of time clicking, scrolling, and searching.
You need large amounts of data from websites that are regularly updated with new content.
The manual web scraping can take a lot of time and repetition.
There is much information on the Web and new information is added.
Python Beautiful Soup and libraries requests both are powerful tools for the job.
If you like to learn with hands-on example you have a basic understanding of Python and HTML.
Web scraping will extract the data and presents it in a format you can easily make sense of.
It is the process of gathering information from the Internet.

HTML tags:-

<! DOCTYPE html>
<html>
<head>
</head>
<body>
<h1> first scraping</h1>
<p>Hello World</p>
<body>
</html>
1. <! DOCTYPE html>: it starts the document with a type declaration.
2 It is contained between <html> and </html>.
3. The script and Meta declaration of the HTML document is between <head>and </head>.
4. HTML document contains visible part between <body> and </body>tags.
5. The title headings are defined with the <h1> through <h6> tags.
6. All paragraphs are defined with the <p> tag.
And useful tags include <a> for hyperlinks, <table> for tables, <tr> for table rows, and <td> for table columns.
HTML tags sometimes come with the id or class attributes.
The id attribute specify a unique id for an HTML tag and the value must be unique within the HTML document.
The class attribute is used to define tags with the same class.
We use of these id and classes to help us locate the data we want.

The rules for scraping:-

We have to Terms and Conditions before you scrape it and be careful to read the statements about the legal use of data and should not be used for commercial purposes.
Do not request data from the website with your program as this may break the website. The layout may change from time to time we have to make sure to revisit the site and rewrite your code as needed.

Scraping Flipchart Website:-

Find the URL that you want to scrape
We are going to scrape the Flipchart website to extract the Price, Name, and Rating of Laptops.
The URL for this page is https://www.flipkart.com/laptops/~buyback-guarantee-on-laptops-/pr?sid=6bo%2Cb5g&uniqBStoreParam1=val1&wid=11.productCard.PMU_V2.
Inspecting the Page
The data is usually nested in tag and inspect the page to see which tag the data we want to scrape is nested.
To inspect the page we just right click on the element and click on “Inspect”.

The next step is that you will see a “Browser Inspector Box” open.
Find the data you want to extract
Then extract the Price, Name, and Rating which are nested in the “div”.

Web scraping python beautifulsoup Example:-

Importing libraries as,
From selenium import webdriver
From beautifulsoup import beautifulsoup
Import pandas as pd
For configuration:-
Driver=webdriver.chrome (“/usr/lib/chromium-browser/chromedriver”)
Products= []
Prices= []
Ratings= []
Driver. get(https://www.flipcart.com/laptops/>https://www.flipkart.com/laptops/~buyback-gauranteelaptops-/pr?sid=6bo%Cb5&uniq)
Code is as follows:-
content=driver.page_source
soup=Beautifulsoup (content)
for a in soup.finsAll (‘a’, href=True, attrs= {‘class’:’_31qSD5’}):
name=a. find (‘div’, attrs= {‘class’:’_3wU53n’})
price=a. find (‘div’, attrs= {‘class’:’_1vC4OE_2rQ-NK’})
name=a. find (‘div’, attrs= {‘class’:’hGSR34_2beYZw’})
products. append (name. text)
prices. append(price. text)
ratings. append (ratings. text)

Run the code and extract the data

To run the code, use the below command:
Python web-s.py
Store the data in a required format:-
df=pd.Dataframe ({‘product name’: products,’ Price’: prices, ‘Ratings’: ratings})
df.to_csv (‘products.csv’, index=False, encoding=’utf-8’)

APIs: An Alternative to Web Scraping:-

The Web is grown out of many sources and combines a ton of different technologies, styles, and personalities.
The API (application programming interfaces) allow to accessing data in a predefined manner.
You can avoid parsing HTML and instead access the data directly using format.
HTML is a way to visually present content to users.
The process is more stable than gathering the data through web scraping.
APIs are made to be consumed by programs than by human eyes.
Scraping the Monster Job Site:-
You will build a web scraper that fetches Software Developer job listings from the job aggregator site.
Web scraper will parse the HTML to pick out the pieces of information and filter the content for specific words.
Inspect Your Data Source:-
Click through the site and interact with it just like any normal user would.
In this example you could search for Software Developer jobs in Australia using the site’s native search interface:

Query parameters generally consist of three things:-

  1. Start: - The query parameters are denoted by a question mark (?).
  2. Information: - The pieces of information constitute one query parameter that is encoded in key value.

Where related keys and values are joined together by an equals sign.

  1. Separator: - Every URL can have multiple query parameters which are separated from each other by an ampersand.

Hidden Websites:-
The information is hidden in login and needs to see from the page.
The HTTP request from python script is different than accessing the page from the browser.
Some advanced techniques are also used with a request to access behind the login.
Dynamic Websites:-
They are easy to work with because the server will send you an HTML page which contains all the information as a response.
Then you can parse an HTML response with Beautiful Soup and begin to pick out the relevant data.
Using the dynamic website the server might not send HTML at all and receive JavaScript code as a response.

Parse HTML Code with Beautiful Soup:-

Pip3 install beautifulsoup4
After it import library and create beautiful soup object,
Import requests
From bs4 import Beautifulsoup
URL=’https://www.monster.com/jobs/search/?q=software-developer&where=Austrialia’
Page=requests. get (URL)
Soup=Beautifulsoup (page.content,’html.parser’)

Find the URL you want to scrape:-

To scrape the web for means to find speeches by famous politicians then scrape the text for the speech, and analyze it for how often they approach certain topics, or use certain phrases.
Before you try to start scraping a site we check the rules of the website first.
Rule can be found in the robots.txt file, which can be found by adding a /robots.txt path to the main domain of the site.

Identify the structure of the sites HTML:-

After finding a site to scrap use chrome’s developer tools to inspect the site’s HTML structure.
It is important because more you want to scrape data from certain HTML elements, or elements with specific classes or IDs.
Using the inspect tool you can identify which elements you need to target.

Install Beautiful Soup and Requests:-

Using Beautifulsoup To Web Scrape Chrome

There are packages and frameworks, like Scrapy but Beautiful Soup will allow you to parse the HTML.
With Beautiful Soup we need to install a Request library, which will fetch the url content.
The Beautiful Soup documentation has a lot of examples to help get you started as well.
$pip install requests
$pip install beautifulsoup4

Web Scraping Code:-

Results:-

Beautifulsoup Tutorial Python 3

This finds all of the <p> elements in the HTML.
The text allows selecting only the text from inside all the <p> elements.

Using Beautiful Soup To Web Scraper

It is messy and so filtering of results using the Beautiful Soup text allows us to get a cleaner return.
Other ways are present to search, filter and isolate the results you want from the HTML.
You can also be more specific, finding an element with a specific class as,

Using Beautifulsoup To Web Scrape Software

This would fine all the <div> elements with the class “cool_paragraph”.