Web Scraping Using Node Js

Scrapingdog is a web scraping API to scrape any website in just a single API call. It handles millions of proxies, browsers and CAPTCHAs so developers and even non-developers can focus on data collection. You can start with free 1000 API calls. Browse other questions tagged javascript node.js web-crawler phantomjs or ask your own question. The Overflow Blog Podcast 330: How to build and maintain online communities, from gaming to. And in the python script (has scraping functions) take time to process the output as the csv file may have more number of rows (lets say csv file of size 200 X 1 and has 200 URLs) so to process 200 input and then return their corresponding output lead to me request timeout as the python execution requires some time. Scrapingdog is a web scraping API to scrape any website in just a single API call. It handles millions of proxies, browsers and CAPTCHAs so developers and even non-developers can focus on data collection. You can start with free 1000 API calls. I am looking at an example of scraping text data from a website and struggling to get all the text from a particular section specifically where that text box has a field called “Read More”. I have tried different css selectors (identified using Selector Gadget) with no success and the captured text is not all the text available.

For webscarpping you can go with phantom.js, nightmare.js etc. But in some case while using phantomJs or nightmare some server detect that the call is from the bot not by original user so in some case you can avoid that by using selenium not worked in all cases but yes this is one of the option to do scarping. It is a web testing framework that automatically loads the web browser to mimic a normal user. Once a page loads, you can scrape the content. For using selenium in your project you need to follow the steps:-

you can check the document for this in:-

After that you need to install selenium web-driver

For detail description of installing and usage you can go through with the link:-

if you will get the following error:-

than you need to download the latest version of geckodriver or first check your path also. If you are using Ubuntu than you can directly install the geckodriver from the following link:-

After that you also need to install the compatible firefox version for that you can download easily via following link:-

That issue is related to the version of Firefox and also the version we are using for geckodriver, so i upgrade my Firefox browser to the stable version i.e. 51.0.1 and also upgrade driver to 0.16.1 and set again the PATH in Bashrc after that the issue we were facing was resolved. Now if all works fine than you can get the html content of any webpage via the pageSource property.

in this way you can get the source of page using selenium web driver in NodeJS.

Hope this will help. Thanks!

We recently composed a scraper that works to extract data of a static site. By a static site, we mean such a site that does not utilize JS scripting that loads or transforms on-site data.

If you are interested in a scrape JS-rendered site, please read the following: Scraping a Javascript-dependent website with puppeteer.

Technologies stack

  1. Node.js, the server-side JS environment. The main characteristic of Node.js is the code asynchronous execution.
  2. Apify SDK, the scalable web scraping and crawling library for JavaScript/Node.js. Let’s highlight its excellent characteristics:
  • automatically scales a pool of headless Chrome/Puppeteer instances
  • maintains queues of URLs to crawl (handled, pending) – this makes it possible to accommodate crawler possible failures and resumes.
  • saves crawl results to a convenient [json] dataset (local or in the cloud)
  • allows proxies rotation
    We’ll use a Cheerio crawler of Apify to crawl and extract data off the target site. The target is https://www.ebinger-gmbh.com/.
Read of a project that uses Apify SDK for the JS-rendered content scrape.

Get categories links – initial urls for crawler

First, we find all the categories’ links of the website. For that we used the Scraper, Google Chrome extension:

This will provide us with some links based on Xpath. Edit Xpath to have all possible categories be included.

Now we have a categories file, categories.txt

Cheerio Crawler

Let’s copy a Cheerio Crawler from the Apify official site (do not forget npm i apify –save). As usual in a new folder we init node project and crate index.js file. Now we’ll customize the crawler. Let’s highlight its important features.

Init crawler queue with categories urls

The following code inserts categories links into the crawler request queue:

Main crawler function

Node js full tutorial

handlePageFunction() is the main crawler function in index.js file that is called in order for each page to be fetched by a single request to perform the logic of the crawler.

Discern between (1) category page and (2) product page

Node

Since we store both kinds of urls (category page and product page) into the same queue, when processing each inside a crawler, we need to apply if/else :

  1. From categories pages we retrieve pagination links and single product pages:
  1. At the product page we retrieve product data and push it both in Apify dataset and in the custom array:

Save results into CSV file

Since Apify stores data locally as JSON, and we need to deliver data as CSV, we’ll be using csv-writer package. The array where we push data is total_data.

Storing data into total_data array and later to CSV eliminates a need for using Apify’s own dataset, but we have still left it inside the crawler code.

Execution result and code repo

The result (~1500 entries) was fetched at local run thru asynchronous runs within a minute. The code repo is here: https://github.com/igorsavinkin/ebinger.

Scraping

Conclusion

Web Scraping With Nodejs

The Node.js asynchronous runtime environment offers a new era web development, where no threads are needed but parallel async runs do their excellent jobs.