Increase accuracy in person name matching by using name component combinations
Increase accuracy in person name matching by using name component combinations
This is the sixth article of our journey into the Python data exploration world. A list of the published articles you can find here (and the source code here). So let’s move on then.
Let’s recap what we have achieved in our last lesson. We established a person’s name “fuzzy” name matching algorithm by using a normalization step as well as the power of the double metaphone algorithm.
All Name Components available
The algorithm may handle typos, special characters and other differences in the name attribute BUT it fails if any of the name components are missing on the twitter account name side.
Two Name Components Available, one only available as an abbreviation
The problem is that the Twitter account name is concatenated by the user and therefore not deterministic (well we assume he isn’t using a pseudonym and there is a chance for a match). As we have seen in the last tutorial, we tried to fix certain anomalies but the worked out solution seems not robust enough. So we have to invest additional time to build up a more robust solution, which tackles the “Missing Name Components” problem by using combinatorics.
As part of my own ramp up into the Python data science world, I started recently to play around with Jupyter Notebook for Python.
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.
It’s just awesome and via the Ancaconda distribution installed in minutes (well you need some spare disk space on your machine).
I integrated it into my tutorial work-out process and from now on will describe the build out of algorithms and the coding in more detail in dedicated Jupyter notebooks.
You can find the Jupyter notebook for this tutorial in my Github project. Using the Jupyter Notebook (side by side with this tutorial), allows you to play around and do real-time modifications of code sequences in the notebook. Amazingly Github renders a Jupyter Notebook file in its origin and easily readable Web view format. In case you want to look at the read-only format!
Jupyter Notebook rendered in Github
What we want to achieve is that our algorithm is capable of managing missing name components. In the below example we could achieve a match by omitting the 2nd name component on both sides.
Obviously, we want a generic algorithm which can handle any complexity in the twitter account name, i.e. a match should be possible as well in a naming construct like:
The first problem we tackle with combinatorics, the second one, we solved already by applying the well known double metaphone algorithm.
To derive any kind of Name Component constellations in case of missing elements, we have to understand the basic concepts in combinatorics.
Let’s answer this question first. Too long away from school? Check out this helpful six minutes video (from betterExplained.com) which freshes up the various concepts.
Python provides in its standard library itertools already the function permutations and combinations which we want to inspect a little bit further.
Let’s apply it on our name sample to understand its mechanism. First the *combinations *function
As well as the permutations function:
As you can see within the permutation the order of a name component plays a role. There is a tuple for ('peter','alfred')
as well as one for ('peter','alfred')
. Within the combination the order of a name component is irrelevant.
For us, the order plays not a role, ('peter','alfred')
is treated as('peter','alfred')
We anyway sort the name components before we apply the double methaphone algorithm. Therefore we use the combination function.
Always think about “pin code” as a crib for “permutations”. Any pin code lock mechanism is based on the number of permutations of a given set of “digits” out of 10 elements: “1234” doesn’t unlock the screen which is locked with “4321”.
ATM — Photo by Mirza Babic on Unsplash
We now build up a lookup directory class which allows us to store name component combinations and compare them to a person name which we got from the Twitter API.
The method ('peter','alfred')
* *calculates all combinations of the provided name tuple, generates for each combination the associated double metaphone key and stores the key together with the unique person id in the lookup directory.
The sequence diagram of the method is shown below, let’s look at the various helper methods which are required.
As a first step, we generate all combinations of name tuple. The following method calculates all combinations of a name tuple. For example for a 3 element tuple, the array built up consists of
@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms
For example, the below 3 name tuple are resulting in the following list of combinations.
In this method, we build up the directory by adding each tuple to our lookup directory.
Our lookup directory is made of a tuple of two dictionaries, which are used store the key, value pairs. Our keys are the double metaphone tuple created from a name, and the value is the unique person identifier.
@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms
The method is shown below
@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms
On code line 3 we create a normalized name string out of a name component tuple. The method looks as follows
@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms
The result is a lower case string of the sorted concatenated tuple name elements:
The string is used to generate the double metaphone tuple, which are our keys for the dictionaries.
On code line 5 and 10, we check if we have already an entry with the same key in the dictionary.
('peter','alfred')
stored. If not we add the ('peter','alfred')
to the value array.Finally, we have the method ready, which allows us to add name tuples to the lookup directory.
@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms
As an example, we add the following three persons to our lookup table.
As we can see in the output, our entry for Peter with the key ('peter','alfred')
has an array of three-person identifiers.
Our lookup directory is now ready and potentially filled with the person name data, which we retrieved via our Government API.
What is missing the ('peter','alfred')
method which allows us to lookup a name which we retrieved via the Twitter API. We are going to tackle this now.
Our first step is to collect all lookup directory entries - which match our searched name tuple — into a result list.
@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms
As an example, we add the following three persons to our lookup table.
As we can see in the output, our entry for Peter with the key ('peter','alfred')
has an array of three-person identifiers.
Our lookup directory is now ready and potentially filled with the person name data, which we retrieved via our Government API.
What is missing the ('peter','alfred')
method which allows us to lookup a name which we retrieved via the Twitter API. We are going to tackle this now.
Our first step is to collect all lookup directory entries - which match our searched name tuple — into a result list.
@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms
Running the ('peter','alfred')
* *method on our sample loaded lookup directory produces the following output:
As we can see, we have two tuples which point to one ('peter','alfred')
and one tuple ('peter','alfred')
which points to 3 persons (Obviously surnames is re-used by multiple persons). The two tuples pointing to one person have the same id ('peter','alfred')
. That means our match identified exactly one person. Would we have single person tuples in our result which are pointing to different persons would mean our match isn't unique.
So we enhance our method to do this uniqueness check, as well (code line 12–20):
@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms
Let’s verify the algorithm on our test sample (as explained above all the content is also available as an interactive Jupyter notebook)
So we are ready to apply the new lookup strategy to our program.
We enhance our abstract ('peter','alfred')
class by integrating a ('peter','alfred')
class instance.
@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms
We enhance ('peter','alfred')
with the code sequence to build up our lookup directory (code line 22–29)
@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms
We also add a new method for the ('peter','alfred')
check, which will be called when we try to merge table records.
@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms
The below('peter','alfred')
method is not required anymore.
@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms
In this class we have to refactor the ('peter','alfred')
method as well, we now call for the matching, the('peter','alfred')
method of the ('peter','alfred')
class instance (line 5).
@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms
If we have a match (7–14), we retrieve the full person record from the govAPI class.
@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms
Remember that the('peter','alfred')
method gets called via the Panda ('peter','alfred')
method on each row record and as a result, the row will be complemented with the additional new columns:
@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms
When we execute our program once again, our table Twitter retrieved table looks like this:
In ('peter','alfred')
we list the govAPI unique id and in ('peter','alfred')
our result tuple list, which was analyzed. E.g.
Let’s check out algorithm for false positives.
What are false positives?
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.
False positive means there are problems with the accuracy of our algorithm.
When we browse through our Twitter table, we encounter the records below.
They are both pointing to the same person id. When checking the govAPI table for the record, we get back the following record “74 Christoph Eymann”, which hasn’t a Twitter account and therefore cannot be found in the Twitter table.
What went wrong:
“Christophe Darbellay” as well as “Christoph Mörgeli” were in the past politicians of the Swiss council and therefore not part of the govAPI list which we filtered for active members only.
“Christophe” as well as “Christoph” are converted to the same double metaphone string and are matching to the govAPI record 74 of “Christoph Leymann”. Due to the fact that the govAPI list has only one person with a surname “Christoph” our algorithm returns a false positive for any person with a surname “Christoph(e)” and match it to “Christoph Leymann”. Would the govAPI list two persons with the surname “Christoph” the match record would point to two persons id and wouldn’t unique anymore. Our algorithm wouldn’t result in this case a false positive.
Solution:
Well, we have to readjust our algorithm and make it more strict. That means we change the condition of our name tuple generator that we
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.
So we readjust our method and ask for the position of the last name in the tuple by the caller, when he adds a person to the directory.
@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms
In the ('peter','alfred')
method we now add only tuples which contain the last_name (line 3).
@classmethod
def generate_combinations(cls,name_tuple):
coms = []
coms.append(name_tuple)
i = len(list(name_tuple))-1
while i > 0:
coms.extend(itertools.combinations(name_tuple,i))
i -=1
return coms
For the Twitter Name ‘Christian Levrat’ we found three entries in our lookup table:
Our matching algorithm had a positive match because both entries are pointing to the same person id.
Rerunning our program results in the following matching statistics.
Matching via Lookup Directory
This is actually not better as in our first attempt (refer to the following tutorial), but we got a more robust matching algorithm. It looks like that the used Twitter politician list isn’t really up-to-date in context active members of the federal assembly. However, that’s something for the next lesson, where we want to finalize the topic of data matching and move on.
The source code you can find in the corresponding Github project.
Happy coding then.
Complete hands-on Machine Learning tutorial with Data Science, Tensorflow, Artificial Intelligence, and Neural Networks. Introducing Tensorflow, Using Tensorflow, Introducing Keras, Using Keras, Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Learning Deep Learning, Machine Learning with Neural Networks, Deep Learning Tutorial with Python
Machine Learning, Data Science and Deep Learning with PythonExplore the full course on Udemy (special discount included in the link): http://learnstartup.net/p/BkS5nEmZg
In less than 3 hours, you can understand the theory behind modern artificial intelligence, and apply it with several hands-on examples. This is machine learning on steroids! Find out why everyone’s so excited about it and how it really works – and what modern AI can and cannot really do.
In this course, we will cover:
• Deep Learning Pre-requistes (gradient descent, autodiff, softmax)
• The History of Artificial Neural Networks
• Deep Learning in the Tensorflow Playground
• Deep Learning Details
• Introducing Tensorflow
• Using Tensorflow
• Introducing Keras
• Using Keras to Predict Political Parties
• Convolutional Neural Networks (CNNs)
• Using CNNs for Handwriting Recognition
• Recurrent Neural Networks (RNNs)
• Using a RNN for Sentiment Analysis
• The Ethics of Deep Learning
• Learning More about Deep Learning
At the end, you will have a final challenge to create your own deep learning / machine learning system to predict whether real mammogram results are benign or malignant, using your own artificial neural network you have learned to code from scratch with Python.
Separate the reality of modern AI from the hype – by learning about deep learning, well, deeply. You will need some familiarity with Python and linear algebra to follow along, but if you have that experience, you will find that neural networks are not as complicated as they sound. And how they actually work is quite elegant!
This is hands-on tutorial with real code you can download, study, and run yourself.
Python tutorial for beginners - Learn Python for Machine Learning and Web Development. Can Python be used for machine learning? Python is widely considered as the preferred language for teaching and learning ML (Machine Learning). Can I use Python for web development? Python can be used to build server-side web applications. Why Python is suitable for machine learning? How Python is used in AI? What language is best for machine learning?
Python tutorial for beginners - Learn Python for Machine Learning and Web DevelopmentTABLE OF CONTENT
Thanks for reading ❤
If you liked this post, share it with all of your programming buddies!
Follow us on Facebook | Twitter
☞ Complete Python Bootcamp: Go from zero to hero in Python 3
☞ Machine Learning A-Z™: Hands-On Python & R In Data Science
☞ Python and Django Full Stack Web Developer Bootcamp
☞ Python Programming Tutorial | Full Python Course for Beginners 2019 👍
☞ Top 10 Python Frameworks for Web Development In 2019
☞ Python for Financial Analysis and Algorithmic Trading
☞ Building A Concurrent Web Scraper With Python and Selenium
This complete Machine Learning full course video covers all the topics that you need to know to become a master in the field of Machine Learning.
Machine Learning Full Course | Learn Machine Learning | Machine Learning Tutorial
It covers all the basics of Machine Learning (01:46), the different types of Machine Learning (18:32), and the various applications of Machine Learning used in different industries (04:54:48).This video will help you learn different Machine Learning algorithms in Python. Linear Regression, Logistic Regression (23:38), K Means Clustering (01:26:20), Decision Tree (02:15:15), and Support Vector Machines (03:48:31) are some of the important algorithms you will understand with a hands-on demo. Finally, you will see the essential skills required to become a Machine Learning Engineer (04:59:46) and come across a few important Machine Learning interview questions (05:09:03). Now, let's get started with Machine Learning.
Below topics are explained in this Machine Learning course for beginners:
Basics of Machine Learning - 01:46
Why Machine Learning - 09:18
What is Machine Learning - 13:25
Types of Machine Learning - 18:32
Supervised Learning - 18:44
Reinforcement Learning - 21:06
Supervised VS Unsupervised - 22:26
Linear Regression - 23:38
Introduction to Machine Learning - 25:08
Application of Linear Regression - 26:40
Understanding Linear Regression - 27:19
Regression Equation - 28:00
Multiple Linear Regression - 35:57
Logistic Regression - 55:45
What is Logistic Regression - 56:04
What is Linear Regression - 59:35
Comparing Linear & Logistic Regression - 01:05:28
What is K-Means Clustering - 01:26:20
How does K-Means Clustering work - 01:38:00
What is Decision Tree - 02:15:15
How does Decision Tree work - 02:25:15
Random Forest Tutorial - 02:39:56
Why Random Forest - 02:41:52
What is Random Forest - 02:43:21
How does Decision Tree work- 02:52:02
K-Nearest Neighbors Algorithm Tutorial - 03:22:02
Why KNN - 03:24:11
What is KNN - 03:24:24
How do we choose 'K' - 03:25:38
When do we use KNN - 03:27:37
Applications of Support Vector Machine - 03:48:31
Why Support Vector Machine - 03:48:55
What Support Vector Machine - 03:50:34
Advantages of Support Vector Machine - 03:54:54
What is Naive Bayes - 04:13:06
Where is Naive Bayes used - 04:17:45
Top 10 Application of Machine Learning - 04:54:48
How to become a Machine Learning Engineer - 04:59:46
Machine Learning Interview Questions - 05:09:03