Class: Crawler

Crawler

new Crawler(args)

Crawler

Parameters:
Name Type Description
args Object

arguments to the constructor

Properties
Name Type Description
concurrency Integer

maximum number of parallel network operations

loadPageList function

a function with signature loadPageList( cb ) that asynchronously returns a new batch of page urls to be crawled. When loadPageList returns an empty array, it is considered as end of page and Crawler will stop its execution when scraping of all loaded pages finishes.

pageListFilter function

a function with signature pageListFilter( data, cb ). If provided, all data returned by loadPageList will be filtered through this function. Why this can not be included in loadPageList function is that, If loadPageList returns an empty list, it is considered as end of page and crawler will stop the execution . But if the filtered output is zero, it is not accounted as end of page. instead, crawler will call loadPageList until it returns a empty list or we get non-empty filtered data .If this process takes longer time, and all loaded tasks are already processed, then Crawler will pause its execution and resumes when it get data .

scrapePage function

a function with signature scrapePage( url, cb ) that asynchronously extracts the data from a url

onError function

a function with signature onError( err, url, worker ). It is called when an error is occurred during the crawling process.

onStopped function

a function with signature onStopped() is called when the Crawler stops its working after successfully crawling all its targets.

sink BufferedSink

An instance of BufferedSink class that is used by Crawler to save the extracted data.

Source:

Methods

scrap(arguments, cb) → {JobManager}

start scraping pages.

Parameters:
Name Type Description
arguments mixed

to loadPageList function.

cb function

Callback executed after completing the whole process.

Source:
Returns:

A new instance of JobManager that can pause, resume, dynamically change concurrency, add/remove no of workers etc.. see JobManager.

Type
JobManager