Big data:

As the name says, big data referred to as the massive amount of data that cannot be stored and processed with the traditional computer system. But how do we define a dataset as big data? It depends on the three components:

  • Volume
  • Velocity
  • Variety

Volume: refers to the size of a data file eg: 10GB, 1 TB like that.
Velocity: in which scale the data is producing eg: 1kb/microsecond, 1Mb/s
Variety: refers to the type of the data eg: structure, unstructured, semi-structured.

According to the above components, we can define data as big data or not.

For example: If you want to attach a document of size 50 Mb in a mail, but the limit is 25 Mb, so the volume of 50 Mb data is referred to as the big data here.

So, in real-time usage in certain situations, we cannot be able to read and process some data with our traditional computer systems. In order to solve these kinds of problems, google file system released the game-changing thing called “google file system”. After that Hadoop distributed file system(HDFS) and many more came to the industry.

You can view the paper for the google file system in the below link

https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf

Ok, let’s see what is the role of apache spark here?

#big data #apache spark

Beginner Guideline to Big Data with Apache Spark
2.75 GEEK