Exploring the world of MapReduce, I landed on the MapReduce documentation page of MongoDB. The first thing that is mentioned there is

An aggregation pipeline provides better performance and usability than a map-reduce operation.

So here I am thinking - “should I carry-on exploring the MapReduce method of MongoDB or dive a little deep into Aggregation Pipeline?”. For my self-satisfaction, I finally thought that it would be interesting to juxtapose MapReduce with Aggregation Pipeline and compare them. In this read, I write about how I did the same with regards to CPU and memory utilization when performing a simple query over a large data set and see which one presented the result faster.

The problem statement — Counting the Swedish pronouns “den”, “denne”, “denna”, “det”, “han”, “hon” and “hen” (case-insensitive) in Twitter tweets. How many tweets? — Approximately 4 million tweets.

When following a simple approach of running MapReduce and Aggregation Pipeline code for the above problem on a single small VM (4 GB RAM and 2 vCPUs), the MapReduce job gave the result in around ~8mins while Aggregation Pipeline gave the same in ~5mins.

#docker #mongodb #mapreduce

MongoDB Aggregation vs MapReduce in a Sharded setup on Docker containers
1.40 GEEK