Jamison  Fisher

Jamison Fisher

1620441300

An Accelerated Big Data Workflow for the Data Analyst

Analyze billions of records faster with RAPIDS & Nvidia GPUs on Google Cloud’s AI Platform Notebooks.

“Of the cast of characters mentioned … the only ones every business with data needs are decision-makers and analysts”

— Cassie Kozyrkov (Data Science’s Most Misunderstood Hero)

Analysts and “citizen data scientists” are the often forgotten heroes across every organization . They tend to have a wide range of responsibilities spanning business domain knowledge, data extraction & analysis, predictive analytics & machine learning, and reporting & communication to stakeholders. Piece of cake right?

And as the size of data has grown, many of these practitioners have had to learn parts of big data frameworks and infrastructure management. This increased scope of work is not sustainable and has a direct impact on the most important steps of the workflow: data exploration & experimentation. This can result in rudimentary reports, less accurate predictive models & forecasts, and less innovative insights & ideas.

Am I really suggesting yet another big data framework/service? Don’t we already have Hive, Impala, Presto, Spark, Beam, BigQuery, Athena, and the list goes on? Don’t get me wrong. For teams running data platforms for a large organization, one or more of these frameworks/services is essential for managing hundreds of batch and streaming jobs, a vast ecosystem of data sources, and production pipelines.

My focus here however, is the data analyst who wants a flexible and scalable solution with minimal code changes to accelerate their existing workflows. Before thinking about multi-node clusters and new frameworks, you’d be surprised how much can be done with your existing code on one machine with some help from GPUs. Using myself as a guinea pig, I wanted to explore a workflow with the following constraints:

  1. I want the setup (both hardware and software) to be easy and quick (< 30 min)
  2. I don’t want to manage a distributed cluster or learn a new framework
  3. I want full flexibility to interact with the python data & machine learning ecosystem (jupyter, pandas, xgboost, tensorflow, pytorch, sklearn)
  4. I want to be able to scale to 100s of millions of rows of data without waiting overnight for the results

These constraints led me to RAPIDS with the help of the new & powerful Nvidia A100 GPU.

#rapids-ai #big-data #gpu #google-cloud-platform #pandas

An Accelerated Big Data Workflow for the Data Analyst