Introduction to Big Data with Vaex — A Simple Code to Read 1.25 Billion Rows

Introduction to Big Data with Vaex — A Simple Code to Read 1.25 Billion Rows

In this article, we will discuss how to deal with big data practically. We will use a galaxy simulation data from Gaia Universe Model Snapshots (GUMS). It has about 3.5 billion objects. You can access it here. For instance, we will just read 1.25 billion rows.

Nowadays, we are entering many kinds of the era. Some people said that we are in the Disruption Era. To understand it, we can use a term from Schwartz (1999) in his book, Digital Darwinism. The term describes that we are entering the era in which businesses can not adapt to the evolution of technology and science. Digital platforms and globalization change the customers’ paradigm and change their needs.

On the other hand, some people said that we are entering the Big Data Era. Almost all of the disciplines got experienced with data booming. One of them is Astronomy. Astronomers worldwide realized that they need to build bigger and bigger telescope or observatory to collect more data in a consortium. For example, at the beginning of 2000, an all-sky survey called 2 Micron All Sky Survey (2MASS) gathered about 470 million objects. In the middle of 2016, Gaia, a space-based telescope, released its 2nd data release consisting of about 1.7 billion objects. How the astronomers handle it?

In this article, we will discuss how to deal with big data practically. We will use a galaxy simulation data from Gaia Universe Model Snapshots (GUMS). It has about 3.5 billion objects. You can access it here. For instance, we will just read 1.25 billion rows.

Read 1 billion rows with Vaex

Firstly, I must thank Maarten Breddels that build Vaex, a python module to read and visualize big data. It can read 1 billion rows in a second. You can read the documentation here.

Vaex will effectively read a file in a format of hdf5 and arrow. Here, we will use some hdf5 files. To read a file in Vaex, you can use this following code

import vaex as vx

df = vx.open('filename.hdf5')

But, if you want to read some hdf5 files simultaneously, use this code

df = vx.open_many(['file1.hdf5', 'file2.hdf5'])

big-data python programming getting-started data-visualization

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Silly mistakes that can cost ‘Big’ in Big Data Analytics

‘Data is the new science. Big Data holds the key answers’ - Pat Gelsinger The biggest advantage that the enhancement of modern technology has brought

Big Data can be The ‘Big’ boon for The Modern Age Businesses

We need no rocket science in understanding that every business, irrespective of their size in the modern-day business world, needs data insights for its expansion. Big data analytics is essential when it comes to understanding the needs and wants of a significant section of the audience.

Role of Big Data in Healthcare - DZone Big Data

In this article, see the role of big data in healthcare and look at the new healthcare dynamics. Big Data is creating a revolution in healthcare, providing better outcomes while eliminating fraud and abuse, which contributes to a large percentage of healthcare costs.

How you’re losing money by not opting for Big Data Services?

Big Data Analytics is the next big thing in business, and it is a reality that is slowly dawning amongst companies. With this article, we have tried to show you the importance of Big Data in business and urge you to take advantage of this immense...

Python Programming & Data Handling

Python Programming & Data Handling