In this article, see how to manage small files in your data lake.

Big Data faces an ironic small file problem that hampers productivity and wastes valuable resources.

If not managed well, it slows down the performance of your data systems and leaves you with stale analytics. This kind of defeats the purpose, doesn’t it? HDFS stores small files inefficiently, leading to inefficient Namenode memory utilization and RPC calls, block-scanning throughput degradation, and reduced application layer performance. If you are a big data administrator on any modern data lake, you will invariably come face to face with the problem of small files. Distributed file systems are great but let’s face it, the more you split storage layers the greater your overhead is when reading those files. So the idea is to optimize the file size to best serve your use case, while also actively optimizing your data lake.

Slow Files and the Business Impact

  • Slowing down reads — Reading through small files requires multiple seeks to retrieve data from each small file which is an inefficient way of accessing data.
  • Slowing down processing — Small files can slow down Spark, MapReduce, and Hive jobs. For example, MapReduce map-tasks process one block at a time. Files use one map task each and if there are a large no. of small files each map task processes very little input. The larger the number of files the larger the number of tasks.
  • Wasted storage — Hundreds of thousands of files that are 5 KB each or even 1 KB may be created daily while running jobs which adds up quickly. The lack of transparency on where they are located adds complexity.
  • Stale data — All of this results in stale data which can weigh down the entire reporting and analytics process of extracting value. If jobs don’t run fast or if responses are slow, decision making becomes slower and the data stops being as valuable. You lose the edge that the data is meant to bring in the first place.
  • Spending more time tackling operational issues than on strategic improvements — Resources end up being used to actively monitor jobs. If that dependency could be removed resources can be used to explore how to optimize the job itself such that a job that earlier took 4 hours now takes only 1 hour. So, this has a cascading effect.
  • Impacting ability to scale — Operational costs increase exponentially. If you grow 10x in the process, the rise in operation cost is not linear. This impacts your cost to scale. While small files are a massive problem, they aren’t completely avoidable either. Following the best practices to effectively apply them to your organization will give you control over rather than firefighting. In any production system the focus is on keeping it up and running. As issues crop up resources are deployed to tackle it.

The Small File Problem

Let’s take the case of HDFS, a distributed file system that is part of the Hadoop infrastructure, designed to handle large data sets. In HDFS, data is distributed over several machines and replicated to optimize parallel processing. As the data and metadata are stored separately every file created irrespective of size occupies a minimum default block size in memory. Small files are files size less than 1 HDFS block, typically 128MB. Small files, even as small as 1kb, cause excessive load on the name node (which is involved in translating file system operations into block operations on the data node) and consume as much metadata storage space as a file of 128 MB. Smaller file sizes also mean smaller clusters as there are practical limits on the number of files (irrespective of size) that can be managed by a name mode.

#bigdata #spark #mapreduce #hive #data lake #hdfs data files

When Small Files Crush Big Data — How to Manage Small Files in Your Data Lake
1.45 GEEK