Big Data Transformations with Complex and Nested Data Types

Introduction

Apache Spark is a distributed computing big data analytics framework designed to transform, engineer, and process massive amounts of data (think terabytes and petabytes) across a cluster of machines. Often working with diverse datasets, you will come across complex data types and formats that require expensive compute and transformations (think IoT devices). Extremely complicated and specialized, under the hood, Apache Spark is a master of its craft when it comes to scaling big data engineering efforts. In this blog using the native Scala API I will walk you through examples of 1.) how to flatten and normalize semi-structured JSON data with nested schema (array and struct), 2.) how to pivot your data, and 3.) how to save the data to storage as parquet schema for downstream analytics. To note, the same exercises can be achieved using the Python API and Spark SQL.

#big-data #data-engineering #data-science #scala #spark

Introduction

towardsdatascience.com

Big Data Transformations with Complex and Nested Data Types