MongoDB Atlas Online Archive is a new feature of the MongoDB Cloud Data Platform. It allows you to set a rule to automatically archive data off of your Atlas cluster to fully-managed cloud object storage. In this blog post, I’ll demonstrate how you can use Online Archive to tier your data for a cost-effective data management strategy.

The MongoDB Cloud data platform also provides a serverless and scalable Atlas Data Lake which allows you to natively query your data across cloud object storage and MongoDB Atlas clusters in-place.

In this blog post, I will use one of the MongoDB Open Data COVID-19 time series collections to demonstrate how you can combine Online Archive and Atlas Data Lake to save on storage costs while retaining easy access to query all of your data.

Prerequisites

For this tutorial, you will need:

Let’s get some data

To begin with, let’s retrieve a time series collection. For this tutorial, I will use one of the time series collections that I built for the MongoDB Open Data COVID19 project.

The covid19.global_and_us collection is the most complete COVID-19 times series in our open data cluster as it combines all the data that JHU keeps into separated CSV files.

As I would like to retrieve the entire collection and its indexes, I will use mongodump.

mongodump --uri="mongodb+srv://readonly:readonly@covid-19.hip2i.mongodb.net/covid19" --collection='global_and_us'

copy code

This will create a dump folder in your current directory. Let’s now import this collection in our cluster.

mongorestore --uri="mongodb+srv://<USER>:<PASSWORD@clustername.1a2bc.mongodb.net"

copy code

Now that our time series collection is here, let’s see what a document looks like:

{
  "_id": {
    "$oid": "5f077868c3bda701aca1a3a7"
  },
  "uid": 175,
  "country_iso2": "YT",
  "country_iso3": "MYT",
  "country_code": 175,
  "state": "Mayotte",
  "country": "France",
  "combined_name": "Mayotte, France",
  "population": 272813,
  "loc": {
    "type": "Point",
    "coordinates": [
      45.1662,
      -12.8275
    ]
  },
  "date": {
    "$date": "2020-06-03T00:00:00.000Z"
  },
  "confirmed": 1993,
  "deaths": 24,
  "recovered": 1523
}

copy code

Note here that the date field is an IsoDate in extended JSON relaxed notation.

This time series collection is fairly simple. For each day and each place, we have a measurement of the number of confirmeddeaths and recovered if it’s available. More details in our documentation.

What’s the problem?

Problem is, it’s a time series! So each day, we add a new entry for each place in the world and our collection will get bigger and bigger every single day. But as time goes on, it’s likely that the older data is less important and less frequently accessed so we could benefit from archiving it off of our Atlas cluster.

Today, July 10th 2020, this collection contains 599760 documents which correspond to 3528 places, time 170 days and it’s only 181.5 MB thanks to WiredTiger compression algorithm.

While this would not really be an issue with this trivial example, it will definitely force you to upgrade your MongoDB Atlas cluster to a higher tier if an extra GB of data was going in your cluster each day.

Upgrading to a higher tier would cost more money and maybe you don’t need to keep all this cold data in your cluster.

Online Archive to the Rescue!

Manually archiving a subset of this dataset is tedious. I actually wrote a blog post about this.

It works, but you will need to extract and remove the documents from your MongoDB Atlas cluster yourself and then use the new $out operator or the s3.PutObject MongoDB Realm function to write your documents to S3.

Lucky for you, MongoDB Atlas Online Archive does this for you automatically!

Let’s head to MongoDB Atlas and click on our cluster to access our cluster details. Currently, Online Archive is not set up on this cluster.

#mongodb

Learn how to use MongoDB Atlas Data Lake and Online Archive.
8.35 GEEK