Creating a new dataset is part of the challenges Data Scientists and Analysts have today. One popular way of creating your dataset is to surf the web and gather information from different web sites for your needs. But as you may know, this is time-consuming, so speeding up this process without the use of multiple VM or clusters is nice to have as part of your Data Scientist Toolset. Here we are going to dig deep into how to do parallel APIs connections using R.

This is a simple guide on how to use parallel computing. The idea is that each thread/worker/core of your computer uses a different connection to access the web, therefore, you don’t need to wait for an API response or the loading of a dynamic website to start scraping another. Just adding simple commands to your code and wrapping it in a smart way you can speed up saving lots of time in the process.

Parallel API Connection

We are going to start by using R and an API connection in parallel. This will allow us to download information more efficiently, think of this as opening multiple tabs in your web browser. For our example, we are going to use the OMDB API. This API gives you access to a lot of information about movies, for our test, we used the Poster API, to download movie posters and save them locally so we can process them later.

#parallel-processing #r #api #parallel api

Parallel API connections in R
2.75 GEEK