Python Tutorial: A Name Lookup Table for Fuzzy Name Data Sets

Python Tutorial: A Name Lookup Table for Fuzzy Name Data Sets

Increase accuracy in person name matching by using name component combinations

Increase accuracy in person name matching by using name component combinations

This is the sixth article of our journey into the Python data exploration world. A list of the published articles you can find here (and the source code here). So let’s move on then.

The Goal of this Tutorial

Let’s recap what we have achieved in our last lesson. We established a person’s name “fuzzy” name matching algorithm by using a normalization step as well as the power of the double metaphone algorithm.

All Name Components available

The algorithm may handle typos, special characters and other differences in the name attribute BUT it fails if any of the name components are missing on the twitter account name side.

Two Name Components Available, one only available as an abbreviation

The problem is that the Twitter account name is concatenated by the user and therefore not deterministic (well we assume he isn’t using a pseudonym and there is a chance for a match). As we have seen in the last tutorial, we tried to fix certain anomalies but the worked out solution seems not robust enough. So we have to invest additional time to build up a more robust solution, which tackles the “Missing Name Components” problem by using combinatorics.

Get Jupyter Notebook for additional tutorial explanations

As part of my own ramp up into the Python data science world, I started recently to play around with Jupyter Notebook for Python.

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more. It’s just awesome and via the Ancaconda distribution installed in minutes (well you need some spare disk space on your machine).

I integrated it into my tutorial work-out process and from now on will describe the build out of algorithms and the coding in more detail in dedicated Jupyter notebooks.

You can find the Jupyter notebook for this tutorial in my Github project. Using the Jupyter Notebook (side by side with this tutorial), allows you to play around and do real-time modifications of code sequences in the notebook. Amazingly Github renders a Jupyter Notebook file in its origin and easily readable Web view format. In case you want to look at the read-only format!

Jupyter Notebook rendered in Github

Handling Missing Name Components in the Name Matching Process

What we want to achieve is that our algorithm is capable of managing missing name components. In the below example we could achieve a match by omitting the 2nd name component on both sides.

Obviously, we want a generic algorithm which can handle any complexity in the twitter account name, i.e. a match should be possible as well in a naming construct like:

  • “Dr. med. Peter A. Escher (Zurich)” or
  • “Dr. med. Peteer A. Escher (Zurich)”, where the user had a typo.

The first problem we tackle with combinatorics, the second one, we solved already by applying the well known double metaphone algorithm.

A Gentle Introduction to Combinations and Permutations

To derive any kind of Name Component constellations in case of missing elements, we have to understand the basic concepts in combinatorics.

Should we use combinations or permutations

Let’s answer this question first. Too long away from school? Check out this helpful six minutes video (from betterExplained.com) which freshes up the various concepts.

Python provides in its standard library itertools already the function permutations and combinations which we want to inspect a little bit further.

Let’s apply it on our name sample to understand its mechanism. First the *combinations *function

As well as the permutations function:

As you can see within the permutation the order of a name component plays a role. There is a tuple for ('peter','alfred') as well as one for ('peter','alfred'). Within the combination the order of a name component is irrelevant.

For us, the order plays not a role, ('peter','alfred') is treated as('peter','alfred') We anyway sort the name components before we apply the double methaphone algorithm. Therefore we use the combination function.

Always think about “pin code” as a crib for “permutations”. Any pin code lock mechanism is based on the number of permutations of a given set of “digits” out of 10 elements: “1234” doesn’t unlock the screen which is locked with “4321”.

ATM — Photo by Mirza Babic on Unsplash

Build Up a Lookup Directory For Name Combinations

We now build up a lookup directory class which allows us to store name component combinations and compare them to a person name which we got from the Twitter API.

Adding a person to the lookup directory

The method ('peter','alfred')* *calculates all combinations of the provided name tuple, generates for each combination the associated double metaphone key and stores the key together with the unique person id in the lookup directory.

The sequence diagram of the method is shown below, let’s look at the various helper methods which are required.

Method: generate_combinations

As a first step, we generate all combinations of name tuple. The following method calculates all combinations of a name tuple. For example for a 3 element tuple, the array built up consists of

  • the 3 element tuple itself
  • all tuples of the combination with 2 out of 3 elements
  • all tuples of the combination with 1 out of 3 elements
@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms

For example, the below 3 name tuple are resulting in the following list of combinations.

Method: add_combinations_to_directory

In this method, we build up the directory by adding each tuple to our lookup directory.

Our lookup directory is made of a tuple of two dictionaries, which are used store the key, value pairs. Our keys are the double metaphone tuple created from a name, and the value is the unique person identifier.

@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms

The method is shown below

@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms

On code line 3 we create a normalized name string out of a name component tuple. The method looks as follows

@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms

The result is a lower case string of the sorted concatenated tuple name elements:

The string is used to generate the double metaphone tuple, which are our keys for the dictionaries.

On code line 5 and 10, we check if we have already an entry with the same key in the dictionary.

  • If not, we add the new key, value pair.
  • If there is already an entry, we check first if we have already the ('peter','alfred') stored. If not we add the ('peter','alfred') to the value array.

    Method: add_person_to_lookup_directory

Finally, we have the method ready, which allows us to add name tuples to the lookup directory.

@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms

As an example, we add the following three persons to our lookup table.

As we can see in the output, our entry for Peter with the key ('peter','alfred') has an array of three-person identifiers.

Matching a Name in our Lookup Directory

Our lookup directory is now ready and potentially filled with the person name data, which we retrieved via our Government API.

What is missing the ('peter','alfred') method which allows us to lookup a name which we retrieved via the Twitter API. We are going to tackle this now.

Our first step is to collect all lookup directory entries - which match our searched name tuple — into a result list.

@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms

As an example, we add the following three persons to our lookup table.

As we can see in the output, our entry for Peter with the key ('peter','alfred') has an array of three-person identifiers.

Matching a Name in our Lookup Directory

Our lookup directory is now ready and potentially filled with the person name data, which we retrieved via our Government API.

What is missing the ('peter','alfred') method which allows us to lookup a name which we retrieved via the Twitter API. We are going to tackle this now.

Our first step is to collect all lookup directory entries - which match our searched name tuple — into a result list.

  • In code line 3 we generate all name component combinations via our existing method and the is iterating over all combinations
  • In code line 5 and 6 we prepare the key tuple of a combination tuple for the lookup.
  • In code line 7 we check if our key exists. As you can see we do the check only with the first entry of double metaphone tuple, which is stored in the first entry of the lookup directory. We leave the full implementation based on the ranking feature of the double metaphone tuple as an exercise.
@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms

Running the ('peter','alfred')* *method on our sample loaded lookup directory produces the following output:

As we can see, we have two tuples which point to one ('peter','alfred') and one tuple ('peter','alfred') which points to 3 persons (Obviously surnames is re-used by multiple persons). The two tuples pointing to one person have the same id ('peter','alfred'). That means our match identified exactly one person. Would we have single person tuples in our result which are pointing to different persons would mean our match isn't unique.

So we enhance our method to do this uniqueness check, as well (code line 12–20):

  • Do we have in our match list one or multiple tuples which always point to one single person?
  • If yes we a found a unique record otherwise we return None
@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms

Let’s verify the algorithm on our test sample (as explained above all the content is also available as an interactive Jupyter notebook)

So we are ready to apply the new lookup strategy to our program.

Refactor Our Existing Classes

GovAPI class extension

We enhance our abstract ('peter','alfred') class by integrating a ('peter','alfred') class instance.

@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms

We enhance ('peter','alfred') with the code sequence to build up our lookup directory (code line 22–29)

@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms

We also add a new method for the ('peter','alfred') check, which will be called when we try to merge table records.

@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms

The below('peter','alfred') method is not required anymore.

@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms

SocialMediaAnalyzer class refactoring

In this class we have to refactor the ('peter','alfred') method as well, we now call for the matching, the('peter','alfred') method of the ('peter','alfred') class instance (line 5).

@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms

If we have a match (7–14), we retrieve the full person record from the govAPI class.

@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms

Remember that the('peter','alfred') method gets called via the Panda ('peter','alfred') method on each row record and as a result, the row will be complemented with the additional new columns:

@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms

When we execute our program once again, our table Twitter retrieved table looks like this:

In ('peter','alfred') we list the govAPI unique id and in ('peter','alfred') our result tuple list, which was analyzed. E.g.

Assess our Algorithm for False Positives

Let’s check out algorithm for false positives.

What are false positives?

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more. False positive means there are problems with the accuracy of our algorithm.

False Positive 1

When we browse through our Twitter table, we encounter the records below.

They are both pointing to the same person id. When checking the govAPI table for the record, we get back the following record “74 Christoph Eymann”, which hasn’t a Twitter account and therefore cannot be found in the Twitter table.

What went wrong:

“Christophe Darbellay” as well as “Christoph Mörgeli” were in the past politicians of the Swiss council and therefore not part of the govAPI list which we filtered for active members only.

“Christophe” as well as “Christoph” are converted to the same double metaphone string and are matching to the govAPI record 74 of “Christoph Leymann”. Due to the fact that the govAPI list has only one person with a surname “Christoph” our algorithm returns a false positive for any person with a surname “Christoph(e)” and match it to “Christoph Leymann”. Would the govAPI list two persons with the surname “Christoph” the match record would point to two persons id and wouldn’t unique anymore. Our algorithm wouldn’t result in this case a false positive.

Solution:

Well, we have to readjust our algorithm and make it more strict. That means we change the condition of our name tuple generator that we

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more. So we readjust our method and ask for the position of the last name in the tuple by the caller, when he adds a person to the directory.

@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms

In the ('peter','alfred') method we now add only tuples which contain the last_name (line 3).

@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms

For the Twitter Name ‘Christian Levrat’ we found three entries in our lookup table:

  • ‘christian leverat ’which maps to 1 person id (1150)
  • ‘christian’ which maps to 5 person id
  • ‘leverat’ which maps to 1 person id (11509

Our matching algorithm had a positive match because both entries are pointing to the same person id.

Rerunning our program results in the following matching statistics.

Matching via Lookup Directory

This is actually not better as in our first attempt (refer to the following tutorial), but we got a more robust matching algorithm. It looks like that the used Twitter politician list isn’t really up-to-date in context active members of the federal assembly. However, that’s something for the next lesson, where we want to finalize the topic of data matching and move on.

The source code you can find in the corresponding Github project.

Happy coding then.

Angular 9 Tutorial: Learn to Build a CRUD Angular App Quickly

What's new in Bootstrap 5 and when Bootstrap 5 release date?

Brave, Chrome, Firefox, Opera or Edge: Which is Better and Faster?

How to Build Progressive Web Apps (PWA) using Angular 9

What is new features in Javascript ES2020 ECMAScript 2020

Machine Learning, Data Science and Deep Learning with Python

Complete hands-on Machine Learning tutorial with Data Science, Tensorflow, Artificial Intelligence, and Neural Networks. Introducing Tensorflow, Using Tensorflow, Introducing Keras, Using Keras, Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Learning Deep Learning, Machine Learning with Neural Networks, Deep Learning Tutorial with Python

Python Tutorial - Learn Python for Machine Learning and Web Development

Learn Python for Machine Learning and Web Development. Can Python be used for machine learning? Python is widely considered as the preferred language for teaching and learning ML (Machine Learning). Can I use Python for web development? Python can be used to build server-side web applications. Why Python is suitable for machine learning? How Python is used in AI? What language is best for machine learning?

Machine Learning Full Course - Learn Machine Learning

This complete Machine Learning full course video covers all the topics that you need to know to become a master in the field of Machine Learning.