Hello World!! 👋
Today, I’d like to share with you an interesting journey I embarked on recently, involving some massive Json data that I had to work with for one of our mobile clients, in a language I don't speak or understand (German, I'm an English speaker), and the various solutions I tried to actually make things work. Buckle up and let's dive in! 🚀
The Problem: A 200MB+ JSON Beast 🗃️
The task was a simple one. or so we thought. We were to build a mobile app for an existing website, we would of course use a Rest API for the data. Very straight forward 😃😃. Unfortunately, not. Upon testing the API endpoints, I started realizing the mountain we had to climb.
Brainstorming Solutions
At this time, we really had no time to waste. we had already set up our timeline communicated deadlines, milestones, and deliverables. We absolutely had to come through.
We put our heads together and strategized and came up with a number of ideas. The main one was to set up a proxy server that would stand in for our mobile client, handle all the requests, trim down the data and cache it thus improving response time. It was to be like a culinary genius, expertly filleting the response down to manageable bites and caching it for easy retrieval later. Diagram below:
Architecture diagram
Other things we also needed to consider were:
All this was easy to plan but tough to execute given all the handicaps we had with the deeply nested structure, language barrier, and such colossal data. But we had to start somewhere.
Dealing with the data 📃
Of course, the choice for the language had been set from the get-go. I was going to use python. It has a huge number of libraries to handle data so well. My first pick for a library was Pandas because of its Data Frame which seemed like an attractive solution. After all, it’s designed to handle JSON. But our nested JSON was a wild beast, one that proved too tricky for this powerful library. It didn’t really solve our issue with language barrier and the data Frame was hard to make sense of unfortunately. I had to explore more options.
After a lot of research, I came across a library called Pydantic, that seemed like the magical solution I was looking for, the light at the end of a very long tunnel. Pydantic is a library that make parsing of data and validation extremely easy and straight forward. it works heavily with typing and does type validation at runtime. 💫 Here’s some information from their website that caught my attention
Powered by type hints — with Pydantic, schema validation and serialization are controlled by type annotations; less to learn, less code to write, and integration with your IDE and static analysis tools. Learn more…
Speed — Pydantic’s core validation logic is written in Rust. As a result, Pydantic is among the fastest data validation libraries for Python. Learn more…
JSON Schema — Pydantic models can emit JSON Schema, allowing for easy integration with other tools. Learn more…
Strict and Lax mode — Pydantic can run in either
strict=True
mode (where data is not converted) orstrict=False
mode where Pydantic tries to coerce data to the correct type where appropriate. Learn more…Dataclasses, TypedDicts and more — Pydantic supports validation of many standard library types including
dataclass
andTypedDict
. Learn more…Customisation — Pydantic allows custom validators and serializers to alter how data is processed in many powerful ways. Learn more…
Ecosystem — around 8,000 packages on PyPI use Pydantic, including massively popular libraries like FastAPI, huggingface, Django Ninja, SQLModel, & LangChain. Learn more…
Battle tested — Pydantic is downloaded over 70M times/month and is used by all FAANG companies and 20 of the 25 largest companies on NASDAQ. If you’re trying to do something with Pydantic, someone else has probably already done it. Learn more…
And this is how we can utilized it:
Suppose you have this 3 levels deep nested Json data with fields in German
land_data = {
"id": 5,
"name": "String",
"laenderCode": "String",
"bundeslandCode": "String",
"code": "String"
}
town_data = {
"id": 5,
"plz": "String",
"name": "String",
"land": land_data
}
coordinates_data = {
"lat": 0.00,
"lon": 0.00
}
address_data = {
"id": 5,
"_bezeichnung": None,
"strasse": "String",
# "hinweise": None,
"ortStrasse": town_data,
"koordinaten": coordinates_data
}
Notice that “hinweise” is commented out and “_bezeichnung” starts with an underscore. This is relevant and I will explain why.
You would need the following classes that inherit the Pydantic BaseModel to hold that data
from pydantic import BaseModel, Field, ValidationError
from typing import Optional, List, Dict, Any
class Land(BaseModel):
id: Optional[int]
name: Optional[str]
laenderCode: Optional[str]
bundeslandCode: Optional[str]
code: Optional[str]
class TownStreet(BaseModel):
id: Optional[int]
plz: Optional[str]
name: Optional[str]
land: Optional[Land]
class Coordinates(BaseModel):
lat: Optional[float]
lon: Optional[float]
class Address(BaseModel):
id: Optional[int]
bezeichnung: Optional[Any] = Field(alias="_bezeichnung")
strasse: Optional[str]
hinweise: Optional[Any] = None
ortStrasse: Optional[TownStreet]
koordinaten: Optional[Coordinates]
2. For Json fields that started with an underscore I assigned it to the ``Field(alias=”actual_field_name”)`` from pydantic as it specifically treats class fields that start with an underscore as private and thus ignores them.
3. Pydantic also provides methods to create dictionaries from these classes which was a plus on my side as I wanted to save some of it to firestore.
I then went ahead to define some English classes and then did some manual mapping between the German data classes and the English data classes like so.
Address(
label=company.adresse.bezeichnung,
street=company.adresse.strasse,
hints=company.adresse.hinweise,
placeStreet=Street(
zip_code=company.adresse.ortStrasse.plz,
name=company.adresse.ortStrasse.name,
country=company.adresse.ortStrasse.land.name
),
lat=company.adresse.koordinaten.lat,
lon=company.adresse.koordinaten.lon
)
This approach proved to be a game-changer. Pydantic handled the nested structure with ease and precision. Moreover, it also gracefully handled the unexpected, providing default values when the data was not available.
Wrapping Up
Looking back at the entire experience, it was a rollercoaster ride of learning and discovery. It was a testament to the fact that no problem is too big when you have the right tools in your arsenal. Pydantic came to my rescue this time round, transforming a daunting challenge into an exciting learning journey. 🎉
If you've faced such challenges that had you scratching your head bald then I’d love to hear about them in the comments.
Remember, with each challenge, we only become better. So, here’s to tackling the next big one! 🥂 Happy coding! 🚀