Introduction to Big Data and the different techniques employed to handle it such as MapReduce, Apache Spark and Hadoop.

Big Data

According to Forbes, about 2.5 quintillion bytes of data is generated every day. Nonetheless, this number is just projected to constantly increase in the following years (90% of nowadays stored data has been produced within the last two years) [1].

What makes Big Data different from any other large amount of data stored in relational databases is its heterogeneity. The data comes from different sources and has been recorded using different formats.

Three different ways of formatting data are commonly employed:

  • Unstructured = unorganised data (eg. videos).
  • Semi-structured = the data is organised in a not fixed format (eg. JSON).
  • Structured = the data is stored in a structured format (eg. RDBMS).

Big Data is defined by three properties:

  1. **Volume **= because of the large amount of data, storing data on a single machine is impossible. How can we process data across multiple machines assuring fault tolerance?
  2. **Variety **= How can we deal with data coming from varied sources which have been formatted using different schemas?
  3. **Velocity **= How can we quickly store and process new data?

Big Data can be analysed using two different processing techniques:

  • Batch processing = usually used if we are concerned by the volume and variety of our data. We first store all the needed data and then process it in one go (this can lead to high latency). A common application example can be calculating monthly payroll summaries.
  • Stream processing = usually employed if we are interested in fast response times. We process our data as soon as is received (low latency). An application example can be determining if a bank transaction is fraudulent or not.

Big Data can be processed using different tools such as MapReduce, Spark, Hadoop, Pig, Hive, Cassandra and Kafka. Each of these different tools has its advantages and disadvantages which determines how companies might decide to employ them [2].

Big Data Analysis is now commonly used by many companies to predict market trends, personalise customers experiences, speed up companies workflow, etc…

#big data & cloud #artificial intelligence #big data #deep learning #hadoop #mapreduce #spark

Big Data Analysis: Spark and Hadoop - Experfy Insights
1.20 GEEK