I do a lot of web scraping with Node.js in my work. Usually, you do not want to fire all your API calls in one go, as doing so is likely to overwhelm other people’s servers, trigger their DDoS protection, or worse, take them offline.
Data scraping on the web is usually done in two steps:
It is certainly a bad idea to wrap all the step 2 API calls in a giant Promise.all
. The correct approach is to use staggering and/or a rate limit. JavaScript’s Promise
lends itself well to staggering because it is easy to create waterfall behavior from Promise
. You simply .then
all your invocations so they occur one after another. For example:
Promise.waterfall = function (array, invoke) {
let pending = Promise.resolve()
const results = []
for (const item of array) {
pending = pending
.then(() => invoke(item, i))
.then(result => results.push(result))
}
return pending.then(() => results)
}
To rate limit, you can introduce a timed delay:
function delay (invoke, ms) {
return (...args) => new Promise(resolve => {
setTimeout(resolve, ms)
}).then(() => invoke.apply(...args))
}
function delay2 (invoke, ms) {
return (...args) => new Promise(resolve => {
setTimeout(resolve, ms, invoke(...args))
})
}
/* Usage
Promise.waterfall(array, delay(invoke, 1000))
Promise.waterfall(array, delay2(invoke, 1000))
*/
#javascript #nodejs #web-scraping #promises #programming