When last we dove into the world of Big-Data we spoke about its implementation with Python and Hive. Plus how that combination could be used to develop really to bust the plications.
Today we would like to switch gears a bit and get our feet wet with another BigData combo of Python and Impala. The reason for this is because there are some limitations that exist when using Hive that might prove a deal-breaker for your specific solution. Impala might be a better route to take instead.
To begin we have to understand a few core concepts Map-Reduce all the way to an overview of impala before we take a quick glance at some python-Impala coding.
First on our list of concepts is Map Reduce. Simply stated, it is a software framework and programming model used for processing huge amounts of data. MapReduce programs work in two phases, namely, Map and Reduce. Map tasks deal with splitting and mapping of data while Reduce tasks shuffle and reduce the data.
Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby, Python, and C++. MapReduce programs are parallel in nature, thus are very useful for performing large-scale data analysis using multiple machines in the cluster.
The input to each phase is key-value pairs. Plus, every programmer needs to specify two functions: map function and reduce function.
The whole process goes through four phases of execution:
Input Splits:
Mapping:
Shuffling:
Reducing:
#python #big-data #impala #data-engineering