R For Web Scraping

  1. Web Scraping Reference: A Simple Cheat Sheet for Web Scraping with Python Startup Security Guide: Minimum Viable Security Checklist for a Cloud-Based Web Application How to Scrape Amazon.com: 19 Lessons I Learned While Crawling 1MM+ Product Listings.
  2. The process of mining data from the web is called scraping or harvesting. 1.5 The rvest and xml2 packages The rvest and xml2 packages were designed to make it easier for people working in R to harvest web data.
  3. Helps you scrape information from web. Package provides utilities for handling tibbles, where “tibble” is a colloquial term for the S3 tbldf class. The tbldf class is a special case of the base.

Using jsoup for web scraping. Jsoup is a popular Java-based HTML parser for manipulating and scraping data from web pages. The library is designed to work with real-world HTML, while implementing the best of HTML5 DOM (Document Object Model) methods and CSS selectors.

Want to scrape the web with R? You’re at the right place!

We will teach you from ground up on how to scrape the web with R, and will take you through fundamentals of web scraping (with examples from R).

Throughout this article, we won’t just take you through prominent R libraries like rvest and Rcrawler, but will also walk you through how to scrape information with barebones code.

Overall, here’s what you are going to learn:

  1. R web scraping fundamentals
  2. Handling different web scraping scenarios with R
  3. Leveraging rvest and Rcrawler to carry out web scraping

Let’s start the journey!

Introduction

The first step towards scraping the web with R requires you to understand HTML and web scraping fundamentals. You’ll learn how to get browsers to display the source code, then you will develop the logic of markup languages which sets you on the path to scrape that information. And, above all - you’ll master the vocabulary you need to scrape data with R.

We would be looking at the following basics that’ll help you scrape R:

  1. HTML Basics
  2. Browser presentation
  3. And Parsing HTML data in R

So, let’s get into it.

HTML Basics

HTML is behind everything on the web. Our goal here is to briefly understand how Syntax rules, browser presentation, tags and attributes help us learn how to parse HTML and scrape the web for the information we need.

Browser Presentation

Before we scrape anything using R we need to know the underlying structure of a webpage. And the first thing you notice, is what you see when you open a webpage, isn’t the HTML document. It’s rather how an underlying HTML code is represented. You can basically open any HTML document using a text editor like notepad.

HTML tells a browser how to show a webpage, what goes into a headline, what goes into a text, etc. The underlying marked up structure is what we need to understand to actually scrape it.

For example, here’s what ScrapingBee.com looks like when you see it in a browser.


And, here’s what the underlying HTML looks like for it


Looking at this source code might seem like a lot of information to digest at once, let alone scrape it! But don’t worry. The next section exactly shows how to see this information better.

HTML elements and tags

If you carefully checked the raw HTML of ScrapingBee.com earlier, you would notice something like <title>...</title>, <body>...</body etc. Those are tags that HTML uses, and each of those tags have their own unique property. For example <title> tag helps a browser render the title of a web page, similarly <body> tag defines the body of an HTML document.

Once you understand those tags, that raw HTML would start talking to you and you’d already start to get the feeling of how you would be scraping web using R. All you need to take away form this section is that a page is structured with the help of HTML tags, and while scraping knowing these tags can help you locate and extract the information easily.

Parsing a webpage using R

With what we know, let’s use R to scrape an HTML webpage and see what we get. Keep in mind, we only know about HTML page structures so far, we know what RAW HTML looks like. That’s why, with the code, we will simply scrape a webpage and get the raw HTML. It is the first step towards scraping the web as well.

Earlier in this post, I mentioned that we can even use a text editor to open an HTML document. And in the code below, we will parse HTML in the same way we would parse a text document and read it with R.

I want to scrape the HTML code of ScrapingBee.com and see how it looks. We will use readLines() to map every line of the HTML document and create a flat representation of it.

Now, when you see what flat_html looks like, you should see something like this in your R Console:

The whole output would be a hundred pages so I’ve trimmed it for you. But, here’s something you can do to have some fun before I take you further towards scraping web with R:

  1. Scrape www.google.com and try to make sense of the information you received
  2. Scrape a very simple web page like https://www.york.ac.uk/teaching/cws/wws/webpage1.html and see what you get

Remember, scraping is only fun if you experiment with it. So, as we move forward with the blog post, I’d love it if you try out each and every example as you go through them and bring your own twist. Share in comments if you found something interesting or feel stuck somewhere.

While our output above looks great, it still is something that doesn’t closely reflect an HTML document. In HTML we have a document hierarchy of tags which looks something like

But clearly, our output from readLines() discarded the markup structure/hierarchies of HTML. Given that, I just wanted to give you a barebones look at scraping, this code looks like a good illustration.

However, in reality, our code is a lot more complicated. But fortunately, we have a lot of libraries that simplify web scraping in R for us. We will go through four of these libraries in later sections.

First, we need to go through different scraping situations that you’ll frequently encounter when you scrape data through R.

Common web scraping scenarios with R

Access web data using R over FTP

FTP is one of the ways to access data over the web. And with the help of CRAN FTP servers, I’ll show you how you can request data over FTP with just a few lines of code. Overall, the whole process is:

  1. Save ftp URL
  2. Save names of files from the URL into an R object
  3. Save files onto your local directory

Let’s get started now. The URL that we are trying to get data from is ftp://cran.r-project.org/pub/R/web/packages/BayesMixSurv/.

Let’s check the name of the files we received with get_files

Looking at the string above can you see what the file names are?

The screenshot from the URL shows real file names


It turns out that when you download those file names you get carriage return representations too. And it is pretty easy to solve this issue. In the code below, I used str_split() and str_extract_all() to get the HTML file names of interest.

Let’s print the file names to see what we have now:

extracted_html_filenames

Great! So, we now have a list of HTML files that we want to access. In our case, it was only one HTML file.

Now, all we have to do is to write a function that stores them in a folder and a function that downloads HTML docs in that folder from the web.

We are almost there now! All we now have to do is to download these files to a specified folder in your local drive. Save those files in a folder called scrapignbee_html. To do so, use GetCurlHandle().

After that, we’ll use plyr package’s l_ply() function.

And, we are done!

I can see that on my local drive I have a folder named scrapingbee_html, where I have inde.html file stored. But, if you don’t want to manually go and check the scraped content, use this command to retrieve a list of HTMLs downloaded:

That was via FTP, but what about HTML retrieving specific data from a webpage? That’s what our next section covers.

Scraping information from Wikipedia using R

In this section, I’ll show you how to retrieve information from Leonardo Da Vinci’s Wikipedia page https://en.wikipedia.org/wiki/Leonardo_da_Vinci.

Let’s take the basic steps to parse information:

Leonardo Da Vinci’s Wikipedia HTML has now been parsed and stored in parsed_wiki.

But, let’s say you wanted to see what text we were able to parse. A very simple way to do that would be:

By doing that, we have essentially parsed everything that exists within the <p> node. And since it is an XML node set, we can easily use subsetting rules to access different paragraphs. For example, let’s say we pick the 4th element on a random name. Here’s what you’ll see:

Reading text is fun, but let’s do something else - let’s get all links that exist on this page. We can easily do that by using getHTMLLinks() function:

Notice what you see above is a mix of actual links and links to files.

You can also see the total number of links on this page by using length() function:

I’ll throw in one more use case here which is to scrape tables off such HTML pages. And it is something that you’ll encounter quite frequently too for web scraping purposes. XML package in R offers a function named readHTMLTable() which makes our life so easy when it comes to scraping tables from HTML pages.

Leonardo’s Wikipedia page has no HTML though, so I will use a different page to show how we can scrape HTML from a webpage using R. Here’s the new URL:

As usual, we will read this URL:

If you look at the page you’ll disagree with the number “108”. For a closer inspection I’ll use name() function to get names of all 108 tables:

Our suspicion was right, there are too many “NULL” and only a few tables. I’ll now read data from one of those tables in R:

Here’s how this table looks in HTML


Awesome isn’t it? Imagine being able to access census, pricing, etc data over R and scraping it. Wouldn’t it be fun? That’s why I took a boring one, and kept the fun part for you. Try something much cooler than what I did. Here’s an example of table data that you can scrape https://en.wikipedia.org/wiki/United_States_Census

Let me know how it goes for you. But it usually isn’t that straightforward. We have forms and authentication that can block your R code from scraping. And that’s exactly what we are going to learn to get through here.

Handling HTML forms while scraping with R

Often we come across pages that aren’t that easy to scrape. Take a look at the Meteorological Service Singapore’s page (that lack of SSL though :O). Notice the dropdowns here


Imagine if you want to scrape information that you can only get upon clicking on the dropdowns. What would you do in that case?

Well, I’ll be jumping a few steps forward and will show you a preview of rvest package while scraping this page. Our goal here is to scrape data from 2016 to 2020.

Let’s check what type of data have been able to scrape. Here’s what our data frame looks like:

From the dataframe above, we can now easily generate URLs that provide direct access to data of our interest.

Now, we can download those files at scale using lappy().

Note: This is going to download a ton of data once you execute it.

Scraping

Web scraping using Rvest

Tutorial

Inspired by libraries like BeautifulSoup, rvest is probably one of most popular packages in R that we use to scrape the web. While it is simple enough that it makes scraping with R look effortless, it is complex enough to enable any scraping operation.

Let’s see rvest in action now. I will scrape information from IMDB and we will scrape Sharknado (because it is the best movie in the world!) https://www.imdb.com/title/tt8031422/

Awesome movie, awesome cast! Let's find out what was the cast of this movie.

Awesome cast! Probably that’s why it was such a huge hit. Who knows.

Still, there are skeptics of Sharknado. I guess the rating would prove them wrong? Here’s how you extract ratings of Sharknado from IMDB

R Packages For Web Scraping

I still stand by my words. But I hope you get the point, right? See how easy it is for us to scrape information using rvest, while we were writing 10+ lines of code in much simpler scraping scenarios.

Next on our list is Rcrawler.

Web Scraping using Rcrawler

Rcrawler is another R package that helps us harvest information from the web. But unlike rvest, we use Rcrawler for network graph related scraping tasks a lot more. For example, if you wish to scrape a very large website, you might want to try Rcrawler in a bit more depth.

Note: Rcrawler is more about crawling than scraping.

We will go back to Wikipedia and we will try to find the date of birth, date of death and other details of scientists.

Output looks like this:

And that’s it!

You pretty much know everything you need to get started with Web Scraping in R.

Try challenging yourself with interesting use cases and uncover challenges. Scraping the web with R can be really fun!

While this whole article tackles the main aspect of web scraping with R, it does not talk about web scraping without getting blocked.

If you want to learn how to do it, we have wrote this complete guide, and if you don't want to take care of this, you can always use our web scraping API.

Happy scraping.

Once you’ve put together enough web scrapers, you start to feel like you can do it in your sleep. I’ve probably built hundreds of scrapers over the years for my own projects, as well as for clients and students in my web scraping course.

Occasionally though, I find myself referencing documentation or re-reading old code looking for snippets I can reuse. One of the students in my course suggested I put together a “cheat sheet” of commonly used code snippets and patterns for easy reference.

I decided to publish it publicly as well – as an organized set of easy-to-reference notes – in case they’re helpful to others.

While it’s written primarily for people who are new to programming, I also hope that it’ll be helpful to those who already have a background in software or python, but who are looking to learn some web scraping fundamentals and concepts.

Table of Contents:

  1. Extracting Content from HTML
  2. Storing Your Data
  3. More Advanced Topics

Useful Libraries

For the most part, a scraping program deals with making HTTP requests and parsing HTML responses.

I always make sure I have requests and BeautifulSoup installed before I begin a new scraping project. From the command line:

Then, at the top of your .py file, make sure you’ve imported these libraries correctly.

Making Simple Requests

Make a simple GET request (just fetching a page)

Make a POST requests (usually used when sending information to the server like submitting a form)

Pass query arguments aka URL parameters (usually used when making a search query or paging through results)

Inspecting the Response

See what response code the server sent back (useful for detecting 4XX or 5XX errors)

Access the full response as text (get the HTML of the page in a big string)

Look for a specific substring of text within the response

Check the response’s Content Type (see if you got back HTML, JSON, XML, etc)

Extracting Content from HTML

Now that you’ve made your HTTP request and gotten some HTML content, it’s time to parse it so that you can extract the values you’re looking for.

Using Regular Expressions

Using Regular Expressions to look for HTML patterns is famously NOT recommended at all.

However, regular expressions are still useful for finding specific string patterns like prices, email addresses or phone numbers.

Run a regular expression on the response text to look for specific string patterns:

Using BeautifulSoup

BeautifulSoup is widely used due to its simple API and its powerful extraction capabilities. It has many different parser options that allow it to understand even the most poorly written HTML pages – and the default one works great.

Compared to libraries that offer similar functionality, it’s a pleasure to use. To get started, you’ll have to turn the HTML text that you got in the response into a nested, DOM-like structure that you can traverse and search

Look for all anchor tags on the page (useful if you’re building a crawler and need to find the next pages to visit)

Look for all tags with a specific class attribute (eg <li>...</li>)

Look for the tag with a specific ID attribute (eg: <div>...</div>)

Look for nested patterns of tags (useful for finding generic elements, but only within a specific section of the page)

Look for all tags matching CSS selectors (similar query to the last one, but might be easier to write for someone who knows CSS)

Get a list of strings representing the inner contents of a tag (this includes both the text nodes as well as the text representation of any other nested HTML tags within)

Return only the text contents within this tag, but ignore the text representation of other HTML tags (useful for stripping our pesky <span>, <strong>, <i>, or other inline tags that might show up sometimes)

Convert the text that are extracting from unicode to ascii if you’re having issues printing it to the console or writing it to files

Get the attribute of a tag (useful for grabbing the src attribute of an <img> tag or the href attribute of an <a> tag)

Putting several of these concepts together, here’s a common idiom: iterating over a bunch of container tags and pull out content from each of them

Using XPath Selectors

BeautifulSoup doesn’t currently support XPath selectors, and I’ve found them to be really terse and more of a pain than they’re worth. I haven’t found a pattern I couldn’t parse using the above methods.

R For Web Scraping Tutorial

If you’re really dedicated to using them for some reason, you can use the lxml library instead of BeautifulSoup, as described here.

Storing Your Data

Now that you’ve extracted your data from the page, it’s time to save it somewhere.

Note: The implication in these examples is that the scraper went out and collected all of the items, and then waited until the very end to iterate over all of them and write them to a spreadsheet or database.

I did this to simplify the code examples. In practice, you’d want to store the values you extract from each page as you go, so that you don’t lose all of your progress if you hit an exception towards the end of your scrape and have to go back and re-scrape every page.

Writing to a CSV

Probably the most basic thing you can do is write your extracted items to a CSV file. By default, each row that is passed to the csv.writer object to be written has to be a python list.

In order for the spreadsheet to make sense and have consistent columns, you need to make sure all of the items that you’ve extracted have their properties in the same order. This isn’t usually a problem if the lists are created consistently.

If you’re extracting lots of properties about each item, sometimes it’s more useful to store the item as a python dict instead of having to remember the order of columns within a row. The csv module has a handy DictWriter that keeps track of which column is for writing which dict key.

Writing to a SQLite Database

You can also use a simple SQL insert if you’d prefer to store your data in a database for later querying and retrieval.

More Advanced Topics

These aren’t really things you’ll need if you’re building a simple, small scale scraper for 90% of websites. But they’re useful tricks to keep up your sleeve.

Javascript Heavy Websites

Contrary to popular belief, you do not need any special tools to scrape websites that load their content via Javascript. In order for the information to get from their server and show up on a page in your browser, that information had to have been returned in an HTTP response somewhere.

It usually means that you won’t be making an HTTP request to the page’s URL that you see at the top of your browser window, but instead you’ll need to find the URL of the AJAX request that’s going on in the background to fetch the data from the server and load it into the page.

There’s not really an easy code snippet I can show here, but if you open the Chrome or Firefox Developer Tools, you can load the page, go to the “Network” tab and then look through the all of the requests that are being sent in the background to find the one that’s returning the data you’re looking for. Start by filtering the requests to only XHR or JS to make this easier.

Once you find the AJAX request that returns the data you’re hoping to scrape, then you can make your scraper send requests to this URL, instead of to the parent page’s URL. If you’re lucky, the response will be encoded with JSON which is even easier to parse than HTML.

Content Inside Iframes

R For Web Scraping In Java

This is another topic that causes a lot of hand wringing for no reason. Sometimes the page you’re trying to scrape doesn’t actually contain the data in its HTML, but instead it loads the data inside an iframe.

R Web Scraping Click Button

Again, it’s just a matter of making the request to the right URL to get the data back that you want. Make a request to the outer page, find the iframe, and then make another HTTP request to the iframe’s src attribute.

Sessions and Cookies

While HTTP is stateless, sometimes you want to use cookies to identify yourself consistently across requests to the site you’re scraping.

Web Scraping Tools

The most common example of this is needing to login to a site in order to access protected pages. Without the correct cookies sent, a request to the URL will likely be redirected to a login form or presented with an error response.

Python

However, once you successfully login, a session cookie is set that identifies who you are to the website. As long as future requests send this cookie along, the site knows who you are and what you have access to.

Delays and Backing Off

If you want to be polite and not overwhelm the target site you’re scraping, you can introduce an intentional delay or lag in your scraper to slow it down

Some also recommend adding a backoff that’s proportional to how long the site took to respond to your request. That way if the site gets overwhelmed and starts to slow down, your code will automatically back off.

Spoofing the User Agent

By default, the requests library sets the User-Agent header on each request to something like “python-requests/2.12.4”. You might want to change it to identify your web scraper, perhaps providing a contact email address so that an admin from the target website can reach out if they see you in their logs.

More commonly, this is used to make it appear that the request is coming from a normal web browser, and not a web scraping program.

Web

Web Scraping With Python

Using Proxy Servers

Even if you spoof your User Agent, the site you are scraping can still see your IP address, since they have to know where to send the response.

If you’d like to obfuscate where the request is coming from, you can use a proxy server in between you and the target site. The scraped site will see the request coming from that server instead of your actual scraping machine.

If you’d like to make your requests appear to be spread out across many IP addresses, then you’ll need access to many different proxy servers. You can keep track of them in a list and then have your scraping program simply go down the list, picking off the next one for each new request, so that the proxy servers get even rotation.

Setting Timeouts

If you’re experiencing slow connections and would prefer that your scraper moved on to something else, you can specify a timeout on your requests.

Handling Network Errors

Web Scraping Python

Just as you should never trust user input in web applications, you shouldn’t trust the network to behave well on large web scraping projects. Eventually you’ll hit closed connections, SSL errors or other intermittent failures.

Learn More

If you’d like to learn more about web scraping, I currently have an ebook and online course that I offer, as well as a free sandbox website that’s designed to be easy for beginners to scrape.

You can also subscribe to my blog to get emailed when I release new articles.