Hive table is one of the big data tables which relies on structural data. By default, it stores the data in a Hive warehouse. To store it at a specific location, the developer can set the location using a location tag during the table creation. Hive follows the same SQL concepts like row, columns, and schema.
Developers working on big data applications have a prevalent problem when reading Hadoop file systems data or Hive table data. The data is written in Hadoop clusters using spark streaming, Nifi streaming jobs, or any streaming or ingestion application. A large number of small data files are written in the Hadoop Cluster by the ingestion job. These files are also called part files.
These part files are written across different data nodes, and when the number of files increases in the directory, it becomes tedious and a performance bottleneck if some other app or user tries to read this data. One of the reasons is that the data is distributed across nodes. Think about your data residing in multiple distributed nodes. The more scattered it is, the job takes around “N * (Number of files)” time to read the data, where N is the number of nodes across each Name Nodes. For example, if there are 1 million files, when we run the MapReduce job, the mapper has to run for 1 million files across data nodes and this will lead to full cluster utilization leading to performance issues.
#apache hadoop #performance tuning #big data #hive #development #ai # ml & data engineering #article