Reddit Web Scraping Python

Reddit web scraper

This is a little side project I did to try and scrape images out of reddit threads. There's a few different subreddits discussing shows, specifically /r/anime where users add screenshots of the episodes. And I thought it'd be cool to see how much effort it'd be to automatically collate a list of those screenshots from a thread and display them in a simple gallery. The result looked like this

PRAW

PRAW is the Python Reddit API Wrapper, that provides a nice set of bindings to talk to Reddit.

Today I’m going to walk you through the process of scraping search results from Reddit using Python. We’re going to write a simple program that performs a keyword search and extracts useful information from the search results.

Reddit Web Scraping Python Interview

Now that you have created your Reddit app, you can code in python to scrape any data from any subreddit that you want. If you want the entire script go here. The first step is to import the. Getting Reddit and subreddit instances PRAW stands for Python Reddit API Wrapper, so it makes it very easy for us to access Reddit data. First we connect to Reddit by calling the praw.Reddit function and storing it in a variable. I’m calling mine reddit. Scraping anything and everything from Reddit used to be as simple as using Scrapy and a Python script to extract as much data as was allowed with a single IP address. This is because, if you look at the link to the guide in the last sentence, the trick was to crawl from page to page on Reddit’s subdomains based on the page number.

Web Scraping Tools

To scrape Reddit you need credentials. The way to generate credentials is hidden away at https://www.reddit.com/prefs/apps where you have to register a new 'app' with Reddit. Connecting is as simple as

Traversing reddit is made simple by the API, for example printing all of the comments in a thread.

Finding links

Reddit Web Scraper

99% of the images I was looking for are posted to imgur so I just matched on those. I used a regular expression to extract the links. I always recommend using a tool like RegEx101 that makes it really easy to debug your regular expressions as they can be pretty brain bending.

Check if an image still exists

One of the problems I found was dead image links, so I created a simple helper that checks the status_code for that link.

Getting Thumbnails

To save bandwidth and your mobile data I wanted to return a smaller version of the image. In imgur you can append a size character onto a URL to get it at a different size, for example 'l' large and 's' small.

Reddit Web Scraping Python

Putting it all together

Putting all of these bits together you get

Trying it out

I decided to stand up a quick demo of this, using an Azure Function to host my new function and a simple web form to allow people to try it out. Just copy and paste a Reddit URL and the function will return any images.

The Demo App uses Bulma for the look and feel, and a little bit of JQuery for the loading of the page.

I'll be looking in a future article at providing a show name search instead of having to paste individual episode URLs. Happy Reddit scraping!