NodeJS Scaping

Every programmer at some point in their career will need to scrape at least 1 webpage, guaranteed; it’s almost a rite of passage. Recently I’ve started a side-project that required data, data which unfortunately couldn’t be provided in a programmatic manner and needed extracting from 2 completely different websites.

An annoyance? Maybe. An opportunity to tryout some different scraping frameworks, definitely! We wanted to do this scraping in a lean manner and we don’t anticipate using the final code in anger, so the search was limited to NodeJS and the following two libraries found:

[Node Osmosis](https://github.com/rchipka/node-osmosis) - Describes itself as a "HTML/XML parser and web scraper for NodeJS.", lovely.

[NightmareJS](http://www.nightmarejs.org/) - Lists itself as "A high-level browser automation library.", which just about describes every scraper ever written...

The applications are very different, Node Osmosis is a simple HTML/XML parser so to this end, it doesn’t process JS, it doesn’t render CSS. NightmareJS however uses electron to load, render and then parse webpages (JS, CSS the works); this makes it very useful for scraping JS/AJAX powered sites.

Node Osmosis

Overall I found Node Osmosis easier to get along with, probably because of the reduced complexity when not having to deal with dynamic JS content. Simply navigate to a URL:

osmosis = require('osmosis');
osmosis.get(url);

Use a CSS selector to find an element:

osmosis.find('.row div ul li a');

Follow the href in the previously targetted element:

osmosis.follow('@href')

Find another element:

osmosis.find('.store-address-container h5');

assign a single propoerty to a variable:

osmosis.set('location');

or assign multiple properties to a an object:

osmosis.set({
    'name': 'div:nth-child(0)',
    'firstLine': 'div:nth-child(1)'
});

process the set data:

osmosis.data(function (listing) {
    console.log(listing.location, listing.name, listing.firstLine);
});

This is an over simplification, but it does demonstrate how easily you can make use of the library.

NightmareJS

NightmareJS is much more powerful and in my opinion much more interesting to watch, because it uses a browser to render the site (chromium wrapped in the electron runtime), you can opt to watch the application load and process pages (click links, fill in forms etc). Below is an example of a simple scraping process:

Navigate to a URL:

nightmare.goto(url);

Type something into an input:

nightmare.type('#inputPostcode', 'SW1A 2AA');

Click a button:

nightmare.click('#find-a-store-btn');

Wait for an element to become visible:

nightmare.wait('div.well.white.relative.full-width-xs.store-list-section');

Run native commands natively within the browser:

nightmare.evaluate(function () {
    return {
        'foo': 'Data which',
        'bar': 'Can be manipulated'
    };
});

Use the data returned in the evaluate:

nightmare.then(function (resultData) {
    console.log(resultData);
});

The biggest issue I had with NightmareJS was actually the asynchronous nature of JS. You expect everything to happen in a linear manner a->b->c, but because we have assets loading dynamically and we’re trying to scrape across multiple pages you can end up with processes that look like a->c->b.

Ben Squire

my bio

Leicestershire, United Kingdom https://squired.co.uk