This dataset contains the votes From Country
to To Country
for Eurovision 2016. There are the Jury Votes
and the Televote
. We would like to see how people voted in Eurovision 2016 and for that reason, we will consider only the Televote
. Our ultimate goal is to create a dendrogram that will show the relationship between countries. The algorithm will be the Hierarchical Clustering.
We will load the data and we will keep only three columns such as the From Country
, To Country
and the Televote Rank
. Then we will reshape the data where the rows will be the From Country
,the columns will the To Country
and the values will be the Televote Rank
. Notice that each country cannot vote itself and for that reason will be NA
values. We will impute the NAs with the Televote Rank=1 assuming that each country would have given the highest score to itself if that was allowed. Bear in mind that we want to cluster the countries based on their vote preferences.
from scipy.cluster.hierarchy import linkage, dendrogram
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.cluster.vq import whiten
%matplotlib inline
eurovision = pd.read_csv("eurovision-2016.csv")
televote_Rank = eurovision.pivot(index='From country', columns='To country', values='Televote Rank')
## fill NAs with 1
televote_Rank.fillna(1, inplace=True)
#data-science #clustering #unsupervised-learning #python