I do a lot of web scraping with Node.js in my work. Usually, you do not want to fire all your API calls in one go, as doing so is likely to overwhelm other people’s servers, trigger their DDoS protection, or worse, take them offline.

Data scraping on the web is usually done in two steps:

  1. You visit an index page where you find a “listing” of all the sub-resources you can call to fetch some details. Take the example of a property portal. The website might organize the property information by location or every property might have its own page.
  2. One by one, you will visit the items on the list from step 1 to fetch out the details.

It is certainly a bad idea to wrap all the step 2 API calls in a giant Promise.all. The correct approach is to use staggering and/or a rate limit. JavaScript’s Promise lends itself well to staggering because it is easy to create waterfall behavior from Promise. You simply .then all your invocations so they occur one after another. For example:

	Promise.waterfall = function (array, invoke) {
	  let pending = Promise.resolve()
	  const results = []

	  for (const item of array) {
	    pending = pending
	      .then(() => invoke(item, i))
	      .then(result => results.push(result))
	  }

	  return pending.then(() => results)
	}

To rate limit, you can introduce a timed delay:

function delay (invoke, ms) {
	  return (...args) => new Promise(resolve => {
	    setTimeout(resolve, ms)
	  }).then(() => invoke.apply(...args))
	}

	function delay2 (invoke, ms) {
	  return (...args) => new Promise(resolve => {
	    setTimeout(resolve, ms, invoke(...args))
	  })
	}

	/* Usage
	Promise.waterfall(array, delay(invoke, 1000))
	Promise.waterfall(array, delay2(invoke, 1000))
	*/

#javascript #nodejs #web-scraping #promises #programming

Level Up Your Asynchronous JavaScript Skills by Implementing a Bluebird-Style Promise.map
1.50 GEEK