Improve the MongoDB Aggregation Framework

Improve the MongoDB Aggregation Framework

Our MongoDB Online Training provide you to learn about MongoDB strategies with realty. Our MongoDB Online Training also includes live sessions, live Projects, and much

1. MongoDB Aggregation Framework The MongoDB Aggregation Framework draws on the well-known linux pipeline concept, where the output of one command is piped or redirected to be used as input of the next command. In case of MongoDB, multiple operators are combined into a single pipeline that is responsible for processing a stream of documents. Some operators, such as $match, $limit and $skip take a document as input and output the same document in case a certain set of criteria’s is met.

Other operators, such as $project and $unwind take a single document as input and reshape that document or emit multiple documents based upon a certain projection. The $group operator finally, takes multiple documents as input and groups them into a single document by aggregating the relevant values. Expressions can be used within some of these operators to calculate new values or execute string operations. MongoDB online training helps you to learn more skills and techniques.

Multiple operators are combined into a single pipeline that is applied upon a list of documents. The pipeline itself is executed as a MongoDB Command, resulting in single MongoDB document that contains an array of all documents that came out at end of the pipeline. The next paragraph details the refactoring of the molecular similarities algorithm as a pipeline of operators. Make sure to (re)read the previous two articles to fully grasp the implementation logic.

2. Molecular Similarity Pipeline When applying a pipeline upon a certain collection, all documents contained within this collection are given as input to the first operator. It’s considered best practice to filter this list as quickly as possible to limit the number of total documents that are passed through the pipeline. In our case, this means filtering out all document that will never be able to satisfy the target Tanimoto coefficient. Hence, as a first step, we match all documents for which the fingerprint count is within a certain threshold. If we target a Tanimoto coefficient of 0.8 with a target compound containing 40 unique fingerprints, the $match operator look as follows:

{ "$match" : { "fingerprint_count" : { "$gte" : 32 , "$lte" : 50}} }

Only compounds that have a fingerprint count between 32 and 50 will be streamed to the next pipeline operator. To perform this filtering, the $match operator is able to use the index that we have defined for the fingerprint_count property. For computing the Tanimoto coefficient, we need to calculate the number of shared fingerprints between a certain input compound and the compound we are targeting. In order to be able to work at the fingerprint level, we use the $unwind operator. $unwind peels off the elements of an array one by one, returning a stream of documents where the specified array is replaced by one of its elements. In our case, we apply the $unwind upon the fingerprints property. Hence, each compound document will result in n compound documents, where n is the number of unique fingerprints contained within the compound.

{ "$unwind" : "$fingerprints"}

In order to calculate the number of shared fingerprints, we will start off by filtering out all documents which do not have a fingerprint that is in the list of fingerprints of the target compound. For doing so, we again apply the $match operator, this time filtering on the fingerprints property, where only documents that contain a fingerprint that is in the list of target fingerprints are maintained.

{ "$match" : { "fingerprints" : { "$in" : [ 1960 , 15111 , 5186 , 5371 , 756 , 1015 , 1018 , 338 , 325 , 776 , 3900 , ..., 2473] } } }

As we only match fingerprints that are in the list of target fingerprints, the output can be used to count the total number of shared fingerprints. For this, we apply the $group operator on the compound_cid, though which we create a new type of document, containing the number of matching fingerprints (by summating the number of occurrences), the total number of fingerprints of the input compound and the smiles representation. Best MongoDB course along with certification and real time projects.

{ "$group" : { "_id" : "$compound_cid" , "fingerprintmatches" : { "$sum" : 1} , "totalcount" : { "$first" : "$fingerprint_count"} , "smiles" : { "$first" : "$smiles"} } }

We now have all parameters in place to calculate the Tanimoto coefficient. For this we will use the $project operator which, next to copying the compound id and smiles property, also adds a new, computed property named tanimoto.

{ 
"$project" 
: 





{ 
"_id" 
: 
1 
, 





"tanimoto" 
: 
{ 
"$divide" 
: 
[ 
"$fingerprintmatches" 
, 
{ 
"$subtract" 
: 
[ 
{ 
"$add" 
: 
[ 
40 
, 
"$totalcount"
] 
} 
, 
"$fingerprintmatches"
] 
} 
] 
} 
, 





"smiles" 
: 
1





}




}

As we are only interested in compounds that have a target Tanimoto coefficient of 0.8, we apply an additional $match operator to filter out all the ones that do not reach this coefficient.

{ "$match" : { "tanimoto" : { "$gte" : 0.8} }

The full pipeline command can be found below.

{ "aggregate" : "compounds" , "pipeline" : [ { "$match" : { "fingerprint_count" : { "$gte" : 32 , "$lte" : 50} } }, { "$unwind" : "$fingerprints"}, { "$match" : { "fingerprints" : { "$in" : [ 1960 , 15111 , 5186 , 5371 , 756 , 1015 , 1018 , 338 , 325 , 776 , 3900, ... , 2473] } } }, { "$group" : { "_id" : "$compound_cid" , "fingerprintmatches" : { "$sum" : 1} , "totalcount" : { "$first" : "$fingerprint_count"} , "smiles" : { "$first" : "$smiles"} } }, { "$project" : { "_id" : 1 , "tanimoto" : { "$divide" : [ "$fingerprintmatches" , { "$subtract" : [ { "$add" : [ 89 , "$totalcount"]} , "$fingerprintmatches"] } ] } , "smiles" : 1 } }, { "$match" : { "tanimoto" : { "$gte" : 0.05} } } ] }

The output of this pipeline contains a list of compounds which have a Tanimoto of 0.8 or higher with respect to a particular target compound.

3. Conclusion The new MongoDB Aggregation Framework provides a set of easy-to-use operators that allow users to express map-reduce type of algorithms in a more concise fashion. The pipeline concept beneath it offers an intuitive way of processing data. It is no surprise that this pipeline paradigm is adopted by various NoSQL approaches, including Tinkerpop’s Gremlin Framework and Neo4J’s Cypher implementation. MongoDB online course helps you to learn more effectively.

Performance wise, the pipeline solution is a major improvement upon the map-reduce implementation. The employed operators are natively supported by the MongoDB platform, which results in a huge performance improvement with respect to interpreted Javascript. As the Aggregation Framework is also able to work in a sharded environment, it easily beats the performance of my initial implementation, especially when the number of input compounds is high and the target Tanimoto coefficient is low. Great work from the MongoDB team!

mongodb course mongodb training best mongodb course mongodb online course mongodb online training

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Which is the Best MongoDB GUI?

Our MongoDB Online Training provide you to learn about MongoDB strategies with realty. Our MongoDB Online Training also includes live sessions, live Projects, and much

MongoDB Database and java applications

Enroll for free demo to acquire the best knowledge on the schema-less database from live industry experts through MongoDB training

Connecting MongoDB hosted in Mongo Atlas with PowerBI (including M0 clusters)

Enroll for free demo to acquire the best knowledge on the schema-less database from live industry experts through MongoDB training

Docker (a) the concept and foundation

**Docker basis** Why docker? In a traditional deployment, we will encounter different machines, different versions dependent on compatibility and other issues, resolve this problem typically consume a lot of time, and have to execute the...

MuleSoft Certification Training | MuleSoft Training | ITGuru

Our Mulesoft Certification Training will provide you to learn the best testing tools easily with realty. Our Mulesoft Course also includes live sessions, live Projects.