Internet of Things (IoT) and 5G technologies will involve an astounding amount of data recording the links between people, device, event, location, and time. According to IDC’s forecast, there will be 79.4 zettabytes of data generated by 41.6 billion IoT devices in 2025. With the rapid growth of IoT data and IoT applications, there is an ever rising demand for a highly efficient spatial-temporal data science workflow to gain insights from the overwhelming data.
Scalability is key to building productive data science pipelines. To address this scalability challenge, we launched Arctern, an open source spatial-temporal analytic framework for boosting end-to-end data science performance. Arctern aims to improve scalability from two aspects:
The rest of this article probes into the current geospatial data science pipelines and reviews the tools, libraries, and systems used in each stage. By discussing the existing workflows’ deficiencies, we underscore the importance of scalability. We will show that well-scaled interfaces, algorithms, and models not only reduce the time to solve mathematical or technical problems, but also improve the efficiency of collaboration and communication between data scientists and engineers.
To better understand the scalability problem existing in the current workflows, we need to take a look at the pipeline of spatial-temporal data science, which is illustrated in the figure above. Raw data is first generated by IoT devices, and then collected by data store. Data scientists make exploratory analysis over these data. Basing on technical hypothesis and data features, they select proper models to develop a prototype for answering business questions. After a few iterations of evaluation and adjustment, the model will finally be deployed on a data processing system, which in turn delivers better services to end users via IoT devices.
The whole process can break down to three stages:
#geopandas #postgis #arctern-project #geospatial #data-science #data analysis