He is also the author of the Java Web Scraping Handbook. XPath is a technology that uses path expressions to select nodes or node- sets in an XML document (or in our case an HTML document). Even if XPath is not a programming language in itself, it allows you to write expressions that can access directly to a specific HTML element without having. The Farm: Requests. The Requests library is vital to add to your data science toolkit. Web scraping is a technique to automatically access and extract large amounts of information from a website, which can save a huge amount of time and effort. In this article, we will go through an easy example of how to automate downloading hundreds of files from the New York MTA. Web scraping with Python is easy due to the many useful libraries available A barebones installation isn’t enough for web scraping. One of the Python advantages is a large selection of libraries for web scraping. For this Python web scraping tutorial, we’ll be using three important libraries – BeautifulSoup v4, Pandas, and Selenium.
Today I am sharing my experience and code of a simple web crawler of using scrapy to scraping web domain of
based on Pycharm IDE, restore data to MongoDB and finally deploy to Heroku Scheduler.
(1) Setup your Pycharm IDE environment
refer to this
Here’s my run/debug configuration. The project name is “caissSpider”
(2) The spider
There’s a lots of tutorials on Scrapy, one of the useful projects to the beginner is this.
Here’s my simple project’s structure.
Like’s study them briefly one by one.
As you can see, this item class is like the collection in database, which defines the fields of the data structure that scraping from the web. At here, all the fields are simply using String.
2.2.1. This file defines the settings of the spider. Note you would better put
in order to avoid the Scrapy “ulopen error time out” errors.
2.2.2 Note for MONGODB_SERVER you have to setup the database username and database password for your MONGODB. Otherwise the Scrapy will show up “user is empty” error
2.2.3. Refer to this to set mongodb username and password
“This answer is for Mongo 3.2.1 Reference
if you want to add without roles (optional):
Pycharm Web Scraping Tutorial
to check if authenticated or not:
it should give you:
2.2.4. For my spider, I store the data to “test” database in “news” collections.
The pipeline.py is aiming to filter out the sensitive words, clean up the current collection before storing data and insert the new scraping data to the collection.
These are the packages needed to deploy to the Heroku Scheduler.
I didn’t change anything from this for the above two files. Simple setup scripts.
Finally, let’s look at the core spider file.
2.7.1 Note the start_url contains the urls that we want to scrape from
2.7.2 parse(self, response) will keep scraping the web in the start_url but may not obey the sequence of the urls listed int eh start_url. Which is to say, the response of the “infoworld” may comes earlier than the “techcrunch”.
Thus, I am using simple arrays to maintain the sequence of the urls to compare, which includes the “xpathSel”, “compareUrls”, “sites”, “domains”
2.7.3 For each web in the start_url, the spider will scrape “GRABNO”‘s response, which is 3 here.
xpath is an easy to learn web/xml path selector. While I am not an expert of xpath, but here ‘s some tips that I can share.
1) “/” indicates the next node near its parent
2) “//”select the next node no matter where it is from the parent
3) “[@class=]” is useful if “//../” doesn’t work. For example,
When “//h2/a/text()” doesn’t work, you should try this,
4) “text()” to grap text and “@href” to scrape url. Remember to put “/” before them in some cases.
5) If we try to scrape the node whose class has partial words, use “[contains(@class, “<some words>”)]”, instead of “[@class=”<full words>”]”
To conclude, xpath is not hard to learn. We can figure it out quickly if we scrape something, print it and analyze it.
(3) Results in the MongoDB
The fields of the items saved in the MongoDB are exactly same as the documents defined in the “item.py”
To run the spider, there are three options, among which the later two will drop the output files.
My .json file result
(5) Heroku Deploy
We could easily setup a Heroku Scheduler like the following image illustrated freely. As a web-spider will generate lots of throughput and the AWS is charging by the it, deploying on the Heroku will save some bucks if your spider is running once per day.
First, we need to create a Procfile to notify Heroku the command to run this Application. So create a Procfile and put the following command in it.
Then we can set a free Scheduler in Heroku.
The scheduler will run it daily at 0am an push the results to your database set before.
(6) Post to your website
Pycharm In Online
Congratulations! You now know how to write a web spider and deploy as a product. Enjoy the Scrapy and xPath now!
Pycharm Tutorial Python