When last we dove into the world of Big-Data we spoke about its implementation with Python and Hive. Plus how that combination could be used to develop really to bust the plications.

Today we would like to switch gears a bit and get our feet wet with another BigData combo of Python and Impala. The reason for this is because there are some limitations that exist when using Hive that might prove a deal-breaker for your specific solution. Impala might be a better route to take instead.

To begin we have to understand a few core concepts Map-Reduce all the way to an overview of impala before we take a quick glance at some python-Impala coding.

Map-reduce concept overview

First on our list of concepts is Map Reduce. Simply stated, it is a software framework and programming model used for processing huge amounts of data. MapReduce programs work in two phases, namely, Map and Reduce. Map tasks deal with splitting and mapping of data while Reduce tasks shuffle and reduce the data.

Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby, Python, and C++. MapReduce programs are parallel in nature, thus are very useful for performing large-scale data analysis using multiple machines in the cluster.

The input to each phase is key-value pairs. Plus, every programmer needs to specify two functions: map function and reduce function.

The whole process goes through four phases of execution:

Input Splits:

  • An input to a MapReduce job is divided into fixed-size pieces called input splits Input split is a chunk of the input that is consumed by a single map


  • This is the very first phase in the execution of the map-reduce program. In this phase data in each split is passed to a mapping function to produce output values. In our example, a job of mapping phase is to count a number of occurrences of each word from input splits (more details about input-split is given below) and prepare a list in the form of <word, frequency>


  • This phase consumes the output of the mapping phase. Its task is to consolidate the relevant records from the Mapping phase output. In our example, the same words are clubbed together along with their respective frequency.


  • Here, output values from the Shuffling phase are aggregated. This phase combines values from the Shuffling phase and returns a single output value. In short, this phrase summarizes the complete dataset.

#python #big-data #impala #data-engineering

Python and Impala: Quick Overview and Samples
1.25 GEEK