Spider Web Scraping

  • Starting Scraping

Introduction

Before reading it, please read the warnings in my blog Learning Python: Web Scraping.

Different from Beautiful Soup or Scrapy, pyspider is a powerful spider (web crawler) system in Python:

Scrapy A Fast and Powerful Scraping and Web Crawling Framework An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way. Maintained by Zyte (formerly Scrapinghub) and many other contributors.

  • Write script in Python
  • Powerful WebUI with script editor, task monitor, project manager and result viewer
  • MySQL, MongoDB, Redis, SQLite, Elasticsearch; PostgreSQL with SQLAlchemy as database backend
  • RabbitMQ, Beanstalk, Redis and Kombu as message queue
  • Task priority, retry, periodical, recrawl by age, etc…
  • Distributed architecture, Crawl Javascript pages, Python 2&3, etc…

Installation and Start

  1. In order to scrape a website in Python, we’ll use ScraPy, its main scraping framework. Some people prefer BeautifulSoup, but I find ScraPy to be more dynamic. ScraPy’s basic units for scraping are called spiders, and we’ll start off this program by creating an empty one. So, first of all, we’ll install ScraPy.
  2. Spider is a smart point-and-click web scraping tool. With Spider, you can turn websites into organized data, download it as JSON or spreadsheet. There's no coding experience or configuration time involved, simply open the chrome extension and start clicking.

Use pip to install it:

Spider Web Scraping

Start it with the command below or run the run.py in the module directory:

Perhaps you might meet the error below:

There are two solutions to solve this error (See in Error to start webui service).

Change the line 209 in the file pyspider/webui/webdav.py (almost the end of the file):

Or you can change the version of wsgidav (I do not recommend this one option since it already published a new version 3.x).

After that, you could visit http://localhost:5000/ to use the system.And, in the directory where you start pyspider, there will be a directory data that is auto-generated and stores the databases of projects, tasks and results.

Get more help information with:

Starting Scraping

Creating a New Project

Spider Web Scraping Machine

For the first time, there are no projects in the page.You need to create a new one by clicking the “Create” button.Input the project name and the URL you want to scrap:

Click the “Create” button and enter the script editing page:

Spider

On the right panel, it is an auto-generated sample script:

  • def on_start(self) is the entry point of the script. It will be called when you click the run button on dashboard.
  • self.crawl(url, callback=self.index_page) is the most important API here. It will add a new task to be crawled. Most of the options will be spicified via self.crawl arguments.
  • def index_page(self, response) get a Response object. response.doc is a pyquery object which has jQuery-like API to select elements to be extracted.
  • def detail_page(self, response) return a dict object as result. The result will be captured into resultdb by default. You can override on_result(self, result) method to manage the result yourself.

Other configuration:

  • @every(minutes=24*60, seconds=0) is a helper to tell the scheduler that on_start method should be called everyday.
  • @config(age=10 * 24 * 60 * 60) specified the default age parameter of self.crawl with page type index_page (when callback=self.index_page). The parameter age* can be specified via self.crawl(url, age=102460*60) (highest priority) and crawl_config (lowest priority).
  • age=10 * 24 * 60 * 60 tell scheduler discard the request if it have been crawled in 10 days. pyspider will not crawl a same URL twice by default (discard forever), even you had modified the code, it’s very common for beginners that runs the project the first time and modified it and run it the second time, it will not crawl again (read itag for solution)
  • @config(priority=2) mark that detail pages should be crawled first.

Running Script

If you have modified the script, then click the “save” button.

Click the green “run” button on the left panel.After that, you will find a red 1 above follows:

Spider Web Scraping

Click the “follows” button to switch to the follows panel.It lists an index page.

Click the green play button on the right of the URL (this will invoke the index_page method).It will list all the URLs in the panel:

We could choose any one of the detail pages and click the green play button on the right (this will invoke the detail_page method).It will show the final result.In this example, we get the title and URL in json format.

Project Management

Back to the dashboard, you will find the project that is new created.Change the status of the project from “TODO” to “RUNNING”:

Click the “run” button.Then the project will start to run:

Spider Web Scraping

The output log in the background:

Click the “Results” button and check all the scraping results:

Click one of the results and the new page will show the result in detail.

Example

UEFA European Cup Coefficients Database lists links for matches, country ranking and club ranking since season 1955/1956.The sample program below extracts the match data from season 2004/2005 to season 2017/2018.

Output looks like:

Spider Web Scraping Tools

Further

Some web contents are becoming more complicated using some technology like AJAX.Then page looks different with it in browser, the information you want to extract is not in the HTML of the page.In this case, you will need the web browser developer tools(such as Web Developer Tools in Firefox or Chrome) to find the request with parameters by yourself.

Sometimes web page is too complex to find out the API request. It provides an option to use PhantomJS.To use PhantomJS, you should have PhantomJS installed. If you are running pyspider with all mode, PhantomJS is enabled if executable in the PATH.

Spider Web Scraping Tool

More information about pyspider in detail can be found in pyspider Official Documentation or its GitHub.

References

Please enable JavaScript to view the comments powered by Disqus.blog comments powered by

Spider Web Scraping Techniques

Disqus

Spider - A Smart Web Scraping Tool

Published

Tags