A Name Across Time and Numbers

This post is a loving tribute to my daughter, whose eyes are shining stars in these dark and troubled nights.

Names evolve. Parents would take a name from pop culture, history, sports, current events, or sacred texts and add their own spin to it. The act of naming both evokes the meaning of the original name but also leaves blank pages for the newborn to write.

Take Cassandra for example. There are shorter variants such as Cass. Some consonants might be replaced. In some countries, ‘c’ can be replaced with ‘k’, like Kassandra. Interestingly, it is common in the Philippines, for example, to add an ‘h’ to the spelling (Cassandhra). Here’s a quick plot of the variants over time.

Image for post

Read on to see how I did it! Here’s the Kaggle Notebook, if anyone’s interested.

Identifying Variants

With data, we can see a glimpse of how names and their variants move in popularity over time. I used the US Baby Names dataset which is gathered from US Social Security data. I then use the Double Metaphone algorithm to group together words by their English pronunciation. Designed by Lawrence Phillips in 1990, the original Metaphone algorithm does its phonetic matching through complex rules for variations in vowel and consonant sounds. Since then, there have been two updates to the algorithm. Fortunately for us, there is a Python port from C/C++ code, the fuzzy library. The result is a grouping of words like:

Mark -> MRK 
Marc -> MRK 
Marck -> MRK 
Marco -> MRK

In the following code, we first get the fingerprint (a.k.a. hash code) of all the names in the data:

names = df["Name"].unique()
fingerprint_algo = fuzzy.DMetaphone()

list_fingerprint = []
for n in names:
    list_fingerprint.append(fingerprint_algo(n)[0])

The result is having an index for each of the names. Then with simple filtering, we can extract variants of both Cassandra and Cass.

def get_subset(df, df_fp, names):
    fingerprint_candidates = []
    for name in names:
        matches = df_fp[df_fp["name"] == name]["fingerprint"]
        fingerprint_candidates.extend(matches.values.tolist())

    name_candidates = df_fp.loc[df_fp["fingerprint"].isin(
fingerprint_candidates), "name"]

    df_subset = df[(df["Name"].isin(name_candidates)) & (df["Gender"] == "F")]
    return df_subset

## using my function
df_fp_names = pd.DataFrame([list_fingerprint, names]).T df_fp_names.columns=["fingerprint", "name"] df_subset = get_subset(df, df_fp_names, ["Cass", "Cassandra"])

#baby #code #names #visualization #kaggle

Identifying Variants

towardsdatascience.com

A Name Across Time and Numbers