How to Effectively Store & Index JSON Data in PostgreSQL

JSON stands for JavaScript Object Notation. It is an open standard format which organizes data into key/value pairs and arrays detailed in RFC 7159. JSON is the most common format used by web services to exchange data, store documents, unstructured data, etc. In this post, we are going to show you tips and techniques on how to effectively store and index JSON data in PostgreSQL.

You can also check out our Working with JSON Data in PostgreSQL vs. MongoDB webinar in partnership with PostgresConf to learn more on the topic, and check out our SlideShare page to download the slides.

What’s in this article?

Why Store JSON in PostgreSQL?

Why should a relational database even care about unstructured data? It turns out that there are a few scenarios where it is useful.

Schema flexibility

One of the main reasons to store data using the JSON format is schema flexibility. Storing your data in JSON is useful when your schema is fluid and is changing frequently. If you store each of the keys as columns, it will result in frequent DML operations – this can be difficult when your data set is large - for example, event tracking, analytics, tags, etc. Note: If a particular key is always present in your document, it might make sense to store it as a first class column. We discuss more about this approach in section “JSON Patterns & Antipatterns” below.

Nested objects

If your data set has nested objects (single or multi-level), in some cases, it is easier to handle them in JSON instead of denormalizing the data into columns or multiple tables.

Syncing with external data sources

Often times an external system is providing data as JSON, so it might be a temporary store before data is ingested into other parts of the system. For example, Stripe transactions.

Timeline of JSON Support in PostgreSQL

JSON support in PostgreSQL was introduced in 9.2 and has steadily improved in every release going forward.

Wave 1: PostgreSQL 9.2 (2012) added support for JSON data type

JSON database in 9.2 was fairly limited (and probably overhyped at that point) – basically a glorified string with some JSON validation thrown in. It is useful to validate incoming JSON and store in the database. More details are provided below.

Wave 2: PostgreSQL 9.4 (2014) added support for JSONB data type

JSONB stands for “JSON Binary” or “JSON better” depending on whom you ask. It is a decomposed binary format to store JSON. JSONB supports indexing the JSON data, and is very efficient at parsing and querying the JSON data. In most cases, when you work with JSON in PostgreSQL, you should be using JSONB.

Wave 3: PostgreSQL 12 (2019) added support for SQL/JSON standard and JSONPATH queries

JSONPath brings a powerful JSON query engine to PostgreSQL.

When Should You Use JSON vs. JSONB?

In most cases, JSONB is what you should be using. However, there are some specific cases where JSON works better:

JSON preserves the original formatting (a.k.a whitespace) and ordering of the keys.
JSON preserves duplicate keys.
JSON is faster to ingest vs. JSONB – however, if you do any further processing, JSONB will be faster.

For example, if you’re just ingesting JSON logs and not querying them in any way, then JSON might be a better option for you. For the purposes of this blog, when we refer to JSON support in PostgreSQL, we will refer to JSONB going forward.

Using JSONB in PostgreSQL: How to Effectively Store & Index JSON Data in PostgreSQL

CLICK TO TWEET

JSONB Patterns & Antipatterns

If PostgreSQL has great support for JSONB, why do we need columns anymore? Why not just create a table with a JSONB blob and get rid of all columns like the schema below:

**CREATE** **TABLE** test(id **int**``, data JSONB, **PRIMARY** **KEY** (id));

At the end of the day, columns are still the most efficient technique to work with your data. JSONB storage has some drawbacks vs. traditional columns:

PostreSQL does not store column statistics for JSONB columns

PostgreSQL maintains statistics about the distributions of values in each column of the table - most common values (MCV), NULL entries, histogram of distribution. Based on this data, the PostgreSQL query planner makes smart decisions on the plan to use for the query. At this point, PostgreSQL does not store any stats for JSONB columns or keys. This can sometimes result in poor choices like using nested loop joins vs. hash joins, etc. A more detailed example of this is provided in this blog post – When To Avoid JSONB In A PostgreSQL Schema.

JSONB storage results in a larger storage footprint

JSONB storage does not deduplicate the key names in the JSON. This can result in considerably larger storage footprint compared to MongoDB BSON on WiredTiger or traditional column storage. I ran a simple test with the below JSONB model storing about 10 million rows of data, and here are the results – In some ways this is similar to the MongoDB MMAPV1 storage model where the keys in JSONB were stored as-is without any compression. One long-term fix is to move the key names to a table level dictionary and refer this dictionary instead of storing the key names repeatedly. Until then, the workaround might be to use more compact names (unix-style) instead of more descriptive names. For example, if you’re storing millions of instances of a particular key, it would be better storage-wise to name it “pb” instead of “publisherName”.

The most efficient way to leverage JSONB in PostgreSQL is to combine columns and JSONB. If a key appears very frequently in your JSONB blobs, it is probably better off being stored as a column. Use JSONB as a “catch all” to handle the variable parts of your schema while leveraging traditional columns for fields that are more stable.

JSONB Data Structures

Both JSONB and MongoDB BSON are essentially tree structures, using multi-level nodes to store the parsed JSONB data. MongoDB BSON has a very similar structure.

Images source

#jsonpath #nested objects #data analysis