Text Corpus Visualization Tool in Python

Recently, I was in need of an image for our blog and wanted it to have some _wow effect _or at least a _better fit than anything typical we’ve been using. _Pondering over ideas for a while, a word cloud flashed in my mind. 💡

Usually, you would just need a long string of text to generate one, but I thought of parsing our entire blog data to see if anything interesting pops out and to also get the holistic view of the keywords our blog uses in its entirety. So, I took this as a weekend fun project for myself.

PS: Images have a lot of importance in marketing. Give it quality!👀

Image for post

Photo by Henry Dick on Unsplash

Getting your hands dirty:

Our blog is hosted on **Ghost**and it allows to export all the posts and settings into a single, glorious JSON file. And, we have in-built json package in python for parsing JSON data. Our stage is set. 🤞

For other popular platforms like **_WordPress, Blogger, Substack, etc. _**it could be one or many XML files, you might need to switch the packages and do the groundwork in python accordingly.

Before you read into that JSON in python, you should get the idea of how it’s structured, what you need to read, what you need to filter out, etc. For that, use some JSON processor to pretty print your json file, I’d used jqplay.org and it helped me figure out where my posts are located ➡

data['db'][0]['data']['post']

Next, you’d like to call upon pd.json.normalize() to convert your data into a flat table and save it as a data frame.

👉 Make sure that you have updated version of pandas installed for _pd.json.normalize()_ to work as it has tweaked names in older versions.

Also, keep the encoding as UTF-8, as otherwise, you’re likely to run into UnicodeDecodeErrors. (We have these bad guys: ‘\xa0’ , ‘\n’, and ‘\t’ etc.)

import pandas as pd
import json

with open('fleetx.ghost.2020-07-28-20-18-49.json', encoding='utf-8') as file: 
    data = json.load(file)  
posts_df = pd.json_normalize(data['db'][0]['data']['posts']) 
posts_df.head()

Image for post

Posts Dataframe

Looking at the dataframe you can see that ghost is keeping three formats of the posts we created, mobiledoc (simple and fast renderer without an HTML parser), HTML and plaintext, and range of other attributes of the post. I choose to work with the plaintext version as it would require the least cleaning.

#blog #data-science #marketing #data-visualization #python

How to make a wordcloud of your blog, programmatically?
1.30 GEEK