Web Scraping Imdb

Web Scraping with requests and BeautifulSoup. We will use requests and BeautifulSoup to access and scrape the content of IMDB's homepage. What is BeautifulSoup? It is a Python library for pulling data out of HTML and XML files. It provides methods to navigate the document's tree structure that we discussed before and scrape its content. Some R scripts to begin with web scrapping. 1) imdb250.R contains basic scripts to fetch top 250 movie list and corresponding details. 2) amazonsearch.R contains advanced scrapping with R for products on amazon's website based on the search term. R amazon regex webscraper css-selector selectors webcrawler webscraping rvest imdb-webscrapping.

imdb.py
frombs4importBeautifulSoup
importrequests
importre
# Download IMDB's Top 250 data
url='http://www.imdb.com/chart/top'
response=requests.get(url)
soup=BeautifulSoup(response.text, 'lxml')
movies=soup.select('td.titleColumn')
links= [a.attrs.get('href') forainsoup.select('td.titleColumn a')]
crew= [a.attrs.get('title') forainsoup.select('td.titleColumn a')]
ratings= [b.attrs.get('data-value') forbinsoup.select('td.posterColumn span[name=ir]')]
votes= [b.attrs.get('data-value') forbinsoup.select('td.ratingColumn strong')]
imdb= []
# Store each item into dictionary (data), then put those into a list (imdb)
forindexinrange(0, len(movies)):
# Seperate movie into: 'place', 'title', 'year'
movie_string=movies[index].get_text()
movie= (' '.join(movie_string.split()).replace('.', '))
movie_title=movie[len(str(index))+1:-7]
year=re.search('((.*?))', movie_string).group(1)
place=movie[:len(str(index))-(len(movie))]
data= {'movie_title': movie_title,
'year': year,
'place': place,
'star_cast': crew[index],
'rating': ratings[index],
'vote': votes[index],
'link': links[index]}
imdb.append(data)
foriteminimdb:
print(item['place'], '-', item['movie_title'], '('+item['year']+') -', 'Starring:', item['star_cast'])

commented Jan 5, 2018

Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Web Scraping - IMDb, Wiki¶

This is written to collect data for my friend's translation project, in which she attempts to analyse the differences between the official movie title translations in China, Hong Kong and Taiwan.

Web scraping imdb r

Learnings:¶

Web Scraping Into Excel

  • find() and find_all() only works on bs4.BeautifulSoup or bs4.element.Tags
  • bs4.element.Resultset can simply be treated as a list of bs4.element.Tags

Challenges:¶

Web scraping mlb stats
  1. Directing to the correct Wiki page of the movie.
  2. The tag containing title translations are elements in a table without distinguishing features.

Web Scraping Imdb Movie

Sol 1¶

Web Scraping Icon

Using try except, first try the _(film) labeled url. For cases where there are multiple entries of this movie title, this would direct us to the correct movie page. If there is only one entry (the film one), this url would be redirected to an error page which would produce errors in the following bs4 scraping code. Handle this exceptiong with a normal url with the appended title.

Sol 2¶

Web Scraping Image

After a few observations, the translation elements seemed to always be at the very last three slots of the table. A simple backward indexing did the trick.

Limitations:¶

Web scraping and beautiful soup
  1. Failed to consider instances where there are multiple versions of a movie, in which case, the correct url would be appended with _(year_film).
  2. Following the first bug, since we are using simple indexing to retrieve text instead of bs4, irrelavant text (whatever the last three table elements are) would be appended to the results in lieu of the 'null' marker we applied for other exceptions. This has created multiple entries with erroneous translations and couldn't be removed along with the others that are correctly marked with 'null'.

Web Scraping Imdb 2020

You might enjoy

Web Scraping And Beautiful Soup

Please enable JavaScript to view comments.