Python Pydantic: Handling large and heavily nested Json in Python

Hello World!! 👋

Today, I’d like to share with you an interesting journey I embarked on recently, involving some massive Json data that I had to work with for one of our mobile clients, in a language I don't speak or understand (German, I'm an English speaker), and the various solutions I tried to actually make things work. Buckle up and let's dive in! 🚀

The Problem: A 200MB+ JSON Beast 🗃️

The task was a simple one. or so we thought. We were to build a mobile app for an existing website, we would of course use a Rest API for the data. Very straight forward 😃😃. Unfortunately, not. Upon testing the API endpoints, I started realizing the mountain we had to climb.

The Json data was in German, fields, and all. A language none of us was proficient in
The response was taking some time. on my local machine over a minute. In a GitHub code space like 20 seconds and in a google collab like 5 seconds. upon saving the response to a file I realized it was over 200mbs big. This is not good😓😓
The Json response was an object with 4 fields each an object or array but nested deeply to up to twelve levels. I'm sure at this point you can see the tears in my eyes 😭. The mobile app could never handle this kind of data as it is.

Brainstorming Solutions

At this time, we really had no time to waste. we had already set up our timeline communicated deadlines, milestones, and deliverables. We absolutely had to come through.

We put our heads together and strategized and came up with a number of ideas. The main one was to set up a proxy server that would stand in for our mobile client, handle all the requests, trim down the data and cache it thus improving response time. It was to be like a culinary genius, expertly filleting the response down to manageable bites and caching it for easy retrieval later. Diagram below:

Architecture diagram

Other things we also needed to consider were:

Pagination: The mobile app will by default have to paginate data. This would help with our response time problem. We would initially load, trim and cache a few small chunks of data to be consumed by the app, then go on load a bigger chuck and make all of it ready for the app to consume 😁😁.
Asynchronous Loading: The mobile app would be designed to request and load data in the background, improving user experience by reducing perceived load times 🤭🤭.
Using a Realtime database like Firebase Firestore to store data that is constantly changing.

All this was easy to plan but tough to execute given all the handicaps we had with the deeply nested structure, language barrier, and such colossal data. But we had to start somewhere.

Dealing with the data 📃

Of course, the choice for the language had been set from the get-go. I was going to use python. It has a huge number of libraries to handle data so well. My first pick for a library was Pandas because of its Data Frame which seemed like an attractive solution. After all, it’s designed to handle JSON. But our nested JSON was a wild beast, one that proved too tricky for this powerful library. It didn’t really solve our issue with language barrier and the data Frame was hard to make sense of unfortunately. I had to explore more options.

After a lot of research, I came across a library called Pydantic, that seemed like the magical solution I was looking for, the light at the end of a very long tunnel. Pydantic is a library that make parsing of data and validation extremely easy and straight forward. it works heavily with typing and does type validation at runtime. 💫 Here’s some information from their website that caught my attention

Powered by type hints — with Pydantic, schema validation and serialization are controlled by type annotations; less to learn, less code to write, and integration with your IDE and static analysis tools. Learn more…
Speed — Pydantic’s core validation logic is written in Rust. As a result, Pydantic is among the fastest data validation libraries for Python. Learn more…
JSON Schema — Pydantic models can emit JSON Schema, allowing for easy integration with other tools. Learn more…
Strict and Lax mode — Pydantic can run in either strict=True mode (where data is not converted) or strict=False mode where Pydantic tries to coerce data to the correct type where appropriate. Learn more…
Dataclasses, TypedDicts and more — Pydantic supports validation of many standard library types including dataclass and TypedDict. Learn more…
Customisation — Pydantic allows custom validators and serializers to alter how data is processed in many powerful ways. Learn more…
Ecosystem — around 8,000 packages on PyPI use Pydantic, including massively popular libraries like FastAPI, huggingface, Django Ninja, SQLModel, & LangChain. Learn more…
Battle tested — Pydantic is downloaded over 70M times/month and is used by all FAANG companies and 20 of the 25 largest companies on NASDAQ. If you’re trying to do something with Pydantic, someone else has probably already done it. Learn more…

And this is how we can utilized it:

Suppose you have this 3 levels deep nested Json data with fields in German

land_data = {
    "id": 5,
    "name": "String",
    "laenderCode": "String",
    "bundeslandCode": "String",
    "code": "String"
}

town_data = {
    "id": 5,
    "plz": "String",
    "name": "String",
    "land": land_data
}


coordinates_data = {
    "lat": 0.00,
    "lon": 0.00
}

address_data = {
    "id": 5,
    "_bezeichnung": None,
    "strasse": "String",
    # "hinweise": None,
    "ortStrasse": town_data,
    "koordinaten": coordinates_data
}

Notice that “hinweise” is commented out and “_bezeichnung” starts with an underscore. This is relevant and I will explain why.

You would need the following classes that inherit the Pydantic BaseModel to hold that data

from pydantic import BaseModel, Field, ValidationError
from typing import Optional, List, Dict, Any


class Land(BaseModel):
    id: Optional[int]
    name: Optional[str]
    laenderCode: Optional[str]
    bundeslandCode: Optional[str]
    code: Optional[str]


class TownStreet(BaseModel):
    id: Optional[int]
    plz: Optional[str]
    name: Optional[str]
    land: Optional[Land]


class Coordinates(BaseModel):
    lat: Optional[float]
    lon: Optional[float]


class Address(BaseModel):
    id: Optional[int]
    bezeichnung: Optional[Any] = Field(alias="_bezeichnung")
    strasse: Optional[str]
    hinweise: Optional[Any] = None
    ortStrasse: Optional[TownStreet]
    koordinaten: Optional[Coordinates]

For the data which I was unsure of having in the final response I assigned it a default value and also used the optional type for just in case 🤭🤭.

2. For Json fields that started with an underscore I assigned it to the ``Field(alias=”actual_field_name”)`` from pydantic as it specifically treats class fields that start with an underscore as private and thus ignores them.

3. Pydantic also provides methods to create dictionaries from these classes which was a plus on my side as I wanted to save some of it to firestore.

I then went ahead to define some English classes and then did some manual mapping between the German data classes and the English data classes like so.

Address(
  label=company.adresse.bezeichnung,
  street=company.adresse.strasse,
  hints=company.adresse.hinweise,
  placeStreet=Street(
    zip_code=company.adresse.ortStrasse.plz,
    name=company.adresse.ortStrasse.name,
    country=company.adresse.ortStrasse.land.name
  ),
  lat=company.adresse.koordinaten.lat,
  lon=company.adresse.koordinaten.lon
)

This approach proved to be a game-changer. Pydantic handled the nested structure with ease and precision. Moreover, it also gracefully handled the unexpected, providing default values when the data was not available.

Wrapping Up

Looking back at the entire experience, it was a rollercoaster ride of learning and discovery. It was a testament to the fact that no problem is too big when you have the right tools in your arsenal. Pydantic came to my rescue this time round, transforming a daunting challenge into an exciting learning journey. 🎉

If you've faced such challenges that had you scratching your head bald then I’d love to hear about them in the comments.

Remember, with each challenge, we only become better. So, here’s to tackling the next big one! 🥂 Happy coding! 🚀

This blog post was originally published at: https://blog.devgenius.io

#py #python