Detecting regions of Neanderthal ancestry with Deep Learning

Detecting regions of Neanderthal ancestry with Deep Learning

Detecting regions of Neanderthal ancestry with Deep Learning.I demonstrated how to use Deep Learning for Ancient DNA, Single Cell Biology OMICs Data Integration,

In the previous posts, I demonstrated how to use Deep Learning for Ancient DNA, Single Cell Biology OMICs Data Integration, Clinical Diagnostics and Microscopy Imaging Today we are going to dive into the exciting History of Human Evolution and learn that it is straightforward to borrow methodology from the Natural Language Processing (NLP) and apply it to Human Population Genetics in order to infer regions of [Neanderthal introgression]

Brief History: Out of Africa

When ancestors of Modern Humans migrated out of Africa ~ 50 000 - 70 000 years ago, they encountered Neanderthals and Denisovans, two groups of ancient hominins that populated Europe and Asia at that time. We know that Modern Humans interbred with both Neanderthals and Denisovans since there is evidence of the presence of their DNA in genomes of Modern Humans of non-African origin. These great genetic advances in the History of Human Evolution became possible due to the sequencing of Neanderthal and Denisovan genomes in 2010 and 2012, respectively, by the group of Svante Pääbo.

There were a few attempts made to infer the exact positions of DNA segments inherited from Neanderthals and Denisovans. They utilized various genomics resources such as 1000 Genomes project and computed different measures of local similarity of a non-African genome toward the high-quality Neanderthal genome and divergence with respect to African genomes (right figure below).

S* / S’ statistics and Conditional Random Field (CRF) were applied to detect Neanderthal introgression

Those similarity measures (summary statistics) despite being interpretable and efficient result in the loss of information because they attempt to capture a stretch of DNA in a single number without considering the ordered sequence of nucleotides itself. Other attempts to scan Neanderthal introgression across a genome use Hidden Markov Model (HMM) which is a memoryless model that again does not account for long-range correlations between nucleotides along the DNA sequence. This is where the power of Deep Learning to process raw genomic sequences and utilize the long memory of nucleotide linkage across the genome with RNNs / LSTMs and CNNs can be incredibly beneficial.

Potential of Deep Learning for Ancient Genomics

Currently, Deep Learning is basically ignored in Evolutionary Biology despite its great potential for working with Genomics data that represent one of the major data sources in the present Evolutionary Science and Ancient DNA research areas. Recent attempts utilize simulated genomic data despite the availability of real Genomics data that often have annotated regions with well understood biological function. Deep Learning is ideally suited for working with Genomics and Ancient Genomics simply because one genome is actually a Big Data. To realize this, just think about chopping the 310⁹ nucleotides long genome into stretches of 1000 nucleotides which brings 310⁶ training examples to be used for Deep Learning providing that an annotation (labels) can be figured out for each of the DNA stretches. That is a massive amount of training data!

Deep Learning for Genomics from Zou et al. Nature Genetics 51, pages12–18 (2019)

I demonstrated how to use this idea for discriminating ancient from modern DNA sequences in my previous post. There I used a 1D Convolutional Neural Network (CNN) and a one-hot-encoded representation of the sequences. Here I am going to demonstrate another way to represent DNA sequences for input into Deep Learning. Please look at the left figure below, this is what we usually mean by text. However, this is not what people in Genomics mean by text. What they mean by text is shown to the right. DNA sequence is a text! It looks boring but with time you start seeing things in strings. What if I told you that I see a gene there? What if I told you that I see a gene strongly linked to Type 2 Diabetes (T2D) and we likely inherited this gene from Neanderthals?

What people usually mean by text (left), vs. what bioinformaticians mean by text (right), SLC16A11 gene

Now, if DNA sequence is a text we can apply all the apparatus from the Natural Language Processing (NLP) to such texts. However, where are sentences in the DNA texts and where are words? If we were to predict whether a bunch of DNA sequences (the bunch can be considered as a text) was inherited from Neanderthals or not, one DNA sequence from the bunch can be considered as a sentence of the text, and a k-mer (sub-sequence) can be treated as a word.

DNA sequence is a sentence that can be split into k-mers, space-delimited k-mers can be seen as words

Once we converted DNA sequences into space-delimited k-mers / words, we are done. Now we can turn to advanced NLP techniques and use e.g. a simple Bag Of Words model for comparing frequencies of words / k-mers between the sequences inherited from Neanderthals and sequences of depleted archaic ancestry.

Prepare Sequences for Sentiment Analysis

Now let us demonstrate how to practically use Machine / Deep Learning and Natural Language Processing (NLP) for identifying the regions of Neanderthal introgression in modern human genomes. Here I will formulate the following problem to solve:

Show me a stretch of your DNA and I will predict how likely it was inherited from Neanderthals

As a training data set we are going to use coordinates of the candidate regions for Neanderthal introgression from Vernot and Akey, Science 2016 identified using S*-statistic on Europeans and Asians from the 1000 Genomes Project, the data can be downloaded from here We used the introgressed haplotypes file LL.callsetEUR.mr_0.99.neand_calls_by_hap.bed.merged.by_chr.bed and selected only unique coordinates, so we ended up with 83 601 regions of Neanderthal ancestry in modern Europeans. Let us read the coordinates and have a look at the length distribution of the Neanderthal introgressed regions.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
intr_coords = pd.read_csv('Akey_intr_coords.bed', header = None, sep = "\t")
intr_lengths = intr_coords.iloc[:, 2]-intr_coords.iloc[:, 1]
plt.title("Distribution of Lengths of Neandertal Introgressed Regions", fontsize = 20)
plt.xlabel("Lengths of Neandertal Introgressed Regions", fontsize = 20)
plt.ylabel("Frequency", fontsize = 20)

We can see that the length of Neanderthal introgression segments varies from 10 kbp to 1.2 Mbp with the mean of ~100 kbp. Now we are going to use those coordinates and extract the actual sequences from the hg19 version of the human reference genome which has a fasta-file format and can be downloaded from We can of course use Python for sequence extraction but it is much faster to do it with the C++ based samtools.

import subprocess
a = 0
with open('hg19_intr_regions.fa', 'a') as fp:
    for i in range(intr_coords.shape[0]):
        coord = str(str(intr_coords.iloc[i, 0]) + ':' 
                    + str(intr_coords.iloc[i, 1]) + '-' + str(intr_coords.iloc[i, 2]))['samtools', 'faidx', 'hg19.fa.gz', str(coord)], stdout = fp)
        a = a + 1
        if a%10000 == 0:
            print('Finished ' + str(a) + ' Neanderthal introgressed haplotypes')

Now we are going to build a Pandas DataFrame with coordinates of segments outside of the Neanderthal introgression coordinates. For this purpose, we are going to randomly draw DNA stretches of the same length as the introgression regions on the same chromosome and check whether these stretches intersect with any of the introgressed regions. In this way, we construct two sets of non-overlapping coordinates: introgressed and depleted regions.

import numpy as np
chr_sizes = pd.read_csv("hg19.fa.gz.fai", header = None, sep = "\t")
chr_sizes = chr_sizes.drop([2, 3, 4], axis = 1)
chr_list = []; start_list = []; end_list = []
intr_lengths = list(intr_coords.iloc[:, 2] - intr_coords.iloc[:, 1])
a = 0
for i in range(intr_coords.shape[0]):
    chr_df = intr_coords[intr_coords[0].isin([intr_coords.iloc[i,0]])]
    overlap = True
    while overlap == True:
        reg_start = np.random.randint(1, int(chr_sizes[chr_sizes[0] == 
        reg_end = reg_start + intr_lengths[i]
        for j in range(chr_df.shape[0]):
            b1 = chr_df.iloc[j,1]
            b2 = chr_df.iloc[j,2]
            if (reg_start > b1 and reg_start < b2) or (reg_end > b1 and reg_end < b2) or \
            (b1 > reg_start and b1 < reg_end) or (b2 > reg_start and b2 < reg_end):
                overlap = True
                overlap = False
    a = a + 1
    if a%10000 == 0:
            print('Finished ' + str(a) + ' Neanderthal introgressed haplotypes')
depl_coords = pd.DataFrame({'0': chr_list, '1': start_list, '2': end_list})
depl_coords.to_csv("Akey_depl_coords.bed", index = False, header = False, sep = "\t")

Once the DataFrame containing regions of depleted Neanderthal ancestry has been built, we can again extract the actual DNA sequences from the hg19 fasta corresponding to the depleted segments with samtools, here I skip for brevity this step but check the complete Jupyter notebook on github to see the details. Taking a closer look at the extracted introgressed and depleted sequences we can notice quite a few N-nucleotide containing sequences. This is the missing data in Genomics, i.e. positions not resolved by the sequencing technologies. In order not to let the missing data affect our analysis I omitted the sequences containing at least one N-nucleotide, this can be efficiently done in BioPython:

from Bio import SeqIO
intr_file = 'hg19_intr_regions.fa'
depl_file = 'hg19_depl_regions.fa'
a = 0; i = 0
with open('hg19_intr_clean.fa', 'a') as intr_out,open('hg19_depl_clean.fa', 'a') as depl_out:
    for intr, depl in zip(SeqIO.parse(intr_file, 'fasta'), SeqIO.parse(depl_file, 'fasta')):
        upper_intr = intr.seq.upper()
        upper_depl = depl.seq.upper()
        a = a + 1
        if a%10000 == 0:
            print('Finished ' + str(a) + ' entries')
        if 'N' not in str(upper_intr) and 'N' not in str(upper_depl):
            intr.seq = upper_intr
            SeqIO.write(intr, intr_out, 'fasta')
            depl.seq = upper_depl
            SeqIO.write(depl, depl_out, 'fasta')
            i = i + 1
print('We have processed ' + str(a) + ' entries and written ' + str(i) 
      + ' entries to two fasta-files')

Now the data is prepared for inputting into NLP analysis. Let us proceed with the simple Bag Of Words model, i.e. looking at the difference in frequency of k-mers between Neanderthal introgressed vs. depleted sequences.

Sentiment Analysis: Introgressed vs. Depleted

We are going to start with formatting the sequences from the two fasta-files as texts with space-delimited k-mers as words. Because of memory limitations of my laptop I read only first 10,000 nucleotides from each sequence, then I used the function getKmers that splits each sequence into k-mers and merged the k-mers in space-delimited manner, so at the end, I had a list of sentences, each of them represents a list of words / k-mers.

from Bio import SeqIO
from Bio.Seq import Seq
intr_file = 'hg19_intr_clean.fa'; depl_file = 'hg19_depl_clean.fa'; e = 0
intr_seqs = []; depl_seqs = []
for intr, depl in zip(SeqIO.parse(intr_file, 'fasta'), SeqIO.parse(depl_file, 'fasta')):
    cutoff = 10000
    my_intr_seq = str(intr.seq)[0:cutoff]
    my_depl_seq = str(depl.seq)[0:cutoff]
    e = e + 1
    if e%20000 == 0:
        print('Finished ' + str(e) + ' entries')
def getKmers(sequence, size):
    return [sequence[x:x+size].upper() for x in range(len(sequence) - size + 1)]

kmer = 5
intr_texts = [' '.join(getKmers(i, kmer)) for i in intr_seqs]
depl_texts = [' '.join(getKmers(i, kmer)) for i in depl_seqs]

Now we can easily visualize k-mer frequencies in Neanderthal introgressed and depleted sequences using the Counter class in Python for efficient word counting.

from collections import Counter
import matplotlib.pyplot as plt
fig = plt.figure(figsize = (20,18))
fig.subplots_adjust(hspace = 0.4, wspace = 0.4)

plt.subplot(2, 1, 1)
intr_sentences = [item.split(' ') for item in intr_texts]
D = dict(Counter([item for sublist in intr_sentences for item in sublist]).most_common(20)), list(D.values()), align='center')
plt.title('Most Common K-mers for Neanderthal Introgressed Regions', fontsize = 20)
plt.ylabel("Counts", fontsize = 20); plt.xticks(rotation = 90)
plt.xticks(range(len(D)), list(D.keys()), fontsize = 20)

plt.subplot(2, 1, 2)
depl_sentences = [item.split(' ') for item in depl_texts]
D = dict(Counter([item for sublist in depl_sentences for item in sublist]).most_common(20)), list(D.values()), align='center')
plt.title('Most Common K-mers for Neanderthal Depleted Regions', fontsize = 20)
plt.ylabel("Counts", fontsize = 20); plt.xticks(rotation = 90)
plt.xticks(range(len(D)), list(D.keys()), fontsize = 20)

Although A- and T-rich k-mers seem to be most common for both Neanderthal introgressed and depleted regions, we observe small differences in the k-mer counts between the two cases which can be indicative for non-identical k-mer composition in the introgressed and depleted sequences. Next, we encode the words / k-mers as integers with CountVectorizer class which simply builds a vocabulary of the text (number of unique k-mers) and counts occurrences of each word / k-mer in each sentence/sequence.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
merge_texts = intr_texts + depl_texts
labels = list(np.ones(len(intr_texts))) + list(np.zeros(len(depl_texts)))
cv = CountVectorizer()
X = cv.fit_transform(merge_texts)
X = np.int32(X.toarray())
X_train, X_test, y_train, y_test = train_test_split(X, labels, 
                                                    test_size = 0.20, random_state = 42)

After we have split the data set into train and test subsets, we define a feed-forward neural network with Stochastic Gradient Descent (SGD) optimizer + Momentum and L1 weight regularizer.

from keras.models import Sequential
from keras.regularizers import l2, l1
from keras.callbacks import ModelCheckpoint
from keras.optimizers import SGD, Adam, Adadelta
from keras.layers import Dense, Flatten, Dropout

model = Sequential()
model.add(Dense(3000, input_shape = (X.shape[1], ), activation = 'sigmoid', 
                kernel_regularizer = l1(0.00001)))
model.add(Dense(1, activation = 'sigmoid'))
sgd = SGD(lr = 0.0001, momentum = 0.9, nesterov = False)
model.compile(loss = 'binary_crossentropy', optimizer = sgd, metrics = ['binary_accuracy'])
checkpoint = ModelCheckpoint("", monitor='val_binary_accuracy', verbose=1, 
                             save_best_only = True, mode = 'max')
history =, y_train, 
                    epochs = 200, verbose = 1, validation_split = 0.2, batch_size = 32, 
                    shuffle = True, callbacks = [checkpoint])

MLP Accuracy training curve (left) and confusion matrix of evaluation on test data set (right)

The model reached the accuracy of 82.2% on the test data set classifying the sequences of Neanderthal introgressed vs. depleted origin. This was a clearly better result compared to linear models such as Logistic Regression, Support Vector Machines (SVM) or Naive Bayes Classifier, however, unexpectedly, the Random Forest Classifier reached even higher accuracy of 84.5% on the same test data set. Displaying the feature importances of the Random Forest model we observe such k-mers as AAAAA, CAAAA, CATTT and TTTTT to be the most predictive. We immediately get an intuition that the prediction of Neanderthal introgressed/depleted regions has something to do with the GC/AT content because the most predictive k-mers are extremely AT-rich.

import pickle
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(n_estimators = 500), y_train)
y_pred = classifier.predict(X_test)
pickle.dump(classifier, open('RF_model_Neand_Intr_vs_Depl.sav', 'wb'))

importances = classifier.feature_importances_
std = np.std([tree.feature_importances_ for tree in classifier.estimators_], axis = 0)
indices = np.argsort(importances)[::-1]
plt.title("Feature importances", fontsize = 20)[1])[0:50], importances[indices][0:50], 
        yerr = std[indices][0:50], align = "center")
plt.xticks(rotation = 90); plt.ylabel('Score', fontsize = 20)
plt.xticks(range(X_train.shape[1])[0:50], np.array(names)[indices][0:50])

Random Forest classifier confusion matrix (left) and feature importances dominated by AT-rich k-mers (right)

From the Confusion Matrix for Random Forest Classifier (above) it becomes clear that the model had a very high accuracy when predicting segments of depleted Neanderthal ancestry, however, performed very poorly (worse than the neural network) on classifying introgressed regions.

Predict Genes Inherited from Neanderthals

Now let us come back to the question of predicting from a given DNA sequence whether it was inherited from Neanderthals or not. Here I will generate such predictions for human protein-coding genes. Let us download the RefSeq gene annotation file for hg19 here, clean it and use the coordinates of the genes in order to extract their sequences from the reference genome fasta-files similarly to how we did it for the introgressed and depleted regions, see details in the Jupyter notebook on github. Again we construct the gene texts via space-delimited concatenation of the k-mers.

from Bio import SeqIO
gene_seqs = []; gene_ids = [], a = 0
for gene in SeqIO.parse('hg19_gene_clean.fa', 'fasta'):
    cut = 8800
    if len(str(gene.seq)) < cut:
    s_gene = str(gene.seq)[0:cut]
    if s_gene.count('A')>0 and s_gene.count('C')>0 and s_gene.count('G')>0 \
    and s_gene.count('T')>0:
    a = a + 1
    if a%10000 == 0:
        print('Finished ' + str(a) + ' genes')

def getKmers(sequence, size):
    return [sequence[x:x+size].upper() for x in range(len(sequence) - size + 1)]

kmer = 5
gene_texts = [' '.join(getKmers(i, kmer)) for i in gene_seqs]

Next, we are going to use the trained Random Forest classifier to generate predictions for the sequences of the genes that were previously converted to texts. Hence, the gene texts are converted to integers with CountVectorizer.

import pickle
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

classifier = pickle.load(open('RF_model_Neand_Intr_vs_Depl.sav', 'rb'))
cv = CountVectorizer(); X_gene = cv.fit_transform(gene_texts)
gene_predictions = classifier.predict(X_gene.toarray())
gene_predictions_prob = classifier.predict_proba(X_gene.toarray())
gene_predictions_prob_0 = [i[0] for i in gene_predictions_prob]
gene_predictions_prob_1 = [i[1] for i in gene_predictions_prob]

gene_ids = []; gene_symbol = []
with open('gene_ids.txt','r') as fin:
    for line in fin:
        line = line.split('\t')

gene_pred_df = pd.DataFrame({'Gene': gene_ids, 'Gene_Symbol': gene_symbol, 
                            'Predict': gene_predictions, 
                            'Prob_0': gene_predictions_prob_0, 
                            'Prob_1': gene_predictions_prob_1})
gene_pred_df = gene_pred_df.sort_values(['Prob_1'], ascending = False)

Genes predicted to have high (left) and low (right) probability to be inherited from Neanderthals

In this way, for each gene we generate a probability (Prob_1) to be inherited from Neanderthals and to be a native human gene (Prob_0). We can easily display the genes with highest Prob_1 (lowest Prob_0) and highest Prob_0 (lowest Prob_1), please see the gene lists above. Remarkably, out of ~22000 protein-coding genes, only 477 were predicted to contain Neanderthal DNA. That is a very interesting observation, we will revisit it later.

Visualization of K-mer / Word Embeddings

The vocabulary of the genomic texts produced by concatenating k-mers can be visualized via a Word Embedding model Word2Vec which makes intelligent guesses about the similarity between words (k-mers in our case). Once fit on our data Word2Vec represents each k-mer as a 100-dimensional latent vector, the vectors can be viewed as another numeric representation of our data that can be input into any dimension reduction technique for further exploration. Here we are going to embed and visualize with UMAP and tSNE the k-mers used for the Neanderthal introgression vs. depletion Sentiment Analysis. We can detect two obvious k-mer clusters. A closer look shows that one of them seems to be enriched for AT-rich k-mers and the other one (smaller one) is predominantly composed of GC-rich k-mers.

from umap import UMAP
import matplotlib.pyplot as plt
from gensim.models import Word2Vec

kmer = 5
sequences = intr_seqs + depl_seqs; sentences = []
for i in range(len(sequences)):
    sentences.append(getKmers(sequences[i], kmer))

model = Word2Vec(sentences, min_count = 2, workers = 4)
X = model[model.wv.vocab]

X_reduced = PCA(n_components = 5).fit_transform(X)
umap_model = UMAP(n_neighbors = 30, min_dist = 0.2, n_components = 2)
umap = umap_model.fit_transform(X_reduced)
plt.scatter(umap[:, 0], umap[:, 1], s = 10, cmap = 'tab10')
plt.title('UMAP: All K-mers', fontsize = 20)
plt.xlabel("UMAP1", fontsize = 20); plt.ylabel("UMAP2", fontsize = 20)
words = list(model.wv.vocab)
for i, word in enumerate(words):
    if word == 'AAAAA':
        plt.text(umap[i, 0], umap[i, 1], word, fontsize = 30, c = 'green')
    elif word == 'CAAAA':
        plt.text(umap[i, 0], umap[i, 1], word, fontsize = 30, c = 'green')
    elif word == 'CATTT':
        plt.text(umap[i, 0], umap[i, 1], word, fontsize = 30, c = 'green')
    elif word == 'TTTTT':
        plt.text(umap[i, 0], umap[i, 1], word, fontsize = 30, c = 'green')
        plt.text(umap[i, 0], umap[i, 1], word, fontsize = 10, c = 'red')

I specifically highlighted by green the most predictive k-mers according to the Random Forest classifier that discriminates between Neanderthal introgressed vs. depleted sequences. As expected they all fall into the bigger AT-rich cluster. This Word Embedding visualization is yet another evidence of some structure present in the k-mer sentences, which is capable of predicting the Neanderthal ancestry and has something to do with the balance between GC- and AT-rich segments of DNA.

Evolution Eliminates Neanderthal DNA from Genes

Summarizing everything discussed in the previous sections one can conclude that the presence of Neanderthal ancestry in our DNA suspiciously correlates with the GC / AT-content across the genome. It is well known that our genes are GC-rich, their GC-content is around 47% compared to 41% genome-wide, which is a dramatic difference in GC-content.

Human genes are typically GC-rich which is indicative for absence of Neanderthal ancestry

Therefore we can ask how much actually the Neanderthal introgression and depletion regions overlap with the genes? We can quickly compute the number of intersects between the genes and introgression coordinates with bedtools intersect. However, since the coordinates of the Neanderthal depleted regions were previously selected randomly (we only demanded their non-overlapping with the introgression coordinates), we should repeat the selection procedure a number of times in order to estimate the significance of the intersects.

import subprocess
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

perm_n = []
for k in range(20):
    chr_list = []; start_list = []; end_list = []
    intr_lengths = list(intr_coords.iloc[:, 2] - intr_coords.iloc[:, 1])
    a = 0
    for i in range(intr_coords.shape[0]):
        chr_df = intr_coords[intr_coords[0].isin([intr_coords.iloc[i,0]])]
        overlap = True
        while overlap == True:
            reg_end = reg_start + intr_lengths[i]
            for j in range(chr_df.shape[0]):
                b1 = chr_df.iloc[j,1]; b2 = chr_df.iloc[j,2]
                if (reg_start > b1 and reg_start < b2) or (reg_end > b1 and reg_end < b2) \ 
                or (b1 > reg_start and b1 < reg_end) or (b2 > reg_start and b2 < reg_end):
                    overlap = True
                    overlap = False
        start_list.append(reg_start); end_list.append(reg_end)
        a = a + 1
        if a%20000 == 0:
            print('Finished ' + str(a) + ' Neanderthal haplotypes')
    depl_coords = pd.DataFrame({'0': chr_list, '1': start_list, '2': end_list})
    depl_coords.to_csv("depl_temp.txt", index = False, header = False, sep = "\t")

    with open('n_intersects.txt', 'w') as fp:['bedtools', 'intersect', '-a', 'depl_temp.txt', '-b', 
                        'gene_coords.txt'], stdout = fp)
    akey_n = pd.read_csv('n_intersects.txt', header = None, sep = "\t")
    print(k, akey_n.shape[0])
plt.axvline(x = 140821, linewidth = 4, color = 'r') 
plt.title("Distribution of Gene-Depletion Intersects: Vernot and Akey, Science 2016", 
          fontsize = 30)
plt.xlabel("Number of Intersects Between Gene and Neanderthal Depleted Regions", 
           fontsize = 30)
plt.ylabel("Frequency", fontsize = 30)

We can see that the number of intersects between the genes and Neanderthal introgression segments, 140 821, is significantly lower than the number of intersects between genes and Neanderthal depleted regions. In fact, none of the 17 randomly drawn regions of Neanderthal depletion had the number of intersects with the genes below or equal 140 821, so we can compute p-value as p-value < 1 / 17. What we see is that the Neanderthal introgressed regions fall predominantly outside of the genes, which implies that Evolution did not prioritize the Neanderthal ancestry and tried to push it away from the most functional elements of our genome, which are protein-coding genes. Therefore for some mysterious reasons interbreeding with Neanderthals did not improve our fitness and Evolution tried to correct for this. They talk a lot in newspapers and magazines about a very cool fact of interbreeding between Modern Humans and Neanderthals, but they never mention that this was not beneficial for Modern Humans, do they?


In this post, we have learnt about a huge potential of Machine / Deep Learning and Natural Language Processing (NLP) for Evolutionary Biology and Ancient DNA research areas, which is still not fully utilized. Genomics data provides one of the main sources of information in Evolutionary Biology and comprises millions and billions of short DNA fragments that can and should be analyzed with Deep Learning. Here we demonstrated how to prepare Genomics data for NLP and analyze it with Bag Of Words and Word Embedding. We trained a model capable of predicting whether a genomic sequence was inherited from Neanderthals. Using the model we constructed lists of genes that were likely inherited from Neanderthals and discovered a lack of archaic ancestry in our genes which suggests that interbreeding with Neanderthals was evolutionary, not beneficial for modern humans.

Important to mention that I am currently working on various types of neural network architectures with long memories across the sequence such as RNNs / LSTMs, CNNs / multichannel CNNs with application to Ancient Genomics and am going to come back to this topic in my later posts.

As usual, let me know in the comments your favorite area in Life Sciences which you would like to address within the Deep Learning framework. Follow me at Medium Nikolay Oskolkov, in Twitter @NikolayOskolkov and connect in Linkedin. The complete Jupyter notebook can be found on my github. I plan to write the next post about How to predict population size back in time with Deep Learning, stay tuned.

Artificial Intelligence vs. Machine Learning vs. Deep Learning

Artificial Intelligence vs. Machine Learning vs. Deep Learning

Learn the Difference between the most popular Buzzwords in today's tech. World — AI, Machine Learning and Deep Learning

In this article, we are going to discuss we difference between Artificial Intelligence, Machine Learning, and Deep Learning.

Furthermore, we will address the question of why Deep Learning as a young emerging field is far superior to traditional Machine Learning.

Artificial Intelligence, Machine Learning, and Deep Learning are popular buzzwords that everyone seems to use nowadays.

But still, there is a big misconception among many people about the meaning of these terms.

In the worst case, one may think that these terms describe the same thing — which is simply false.

A large number of companies claim nowadays to incorporate some kind of “ Artificial Intelligence” (AI) in their applications or services.

But artificial intelligence is only a broader term that describes applications when a machine mimics “ cognitive “ functions that humans associate with other human minds, such as “learning” and “problem-solving”.

On a lower level, an AI can be only a programmed rule that determines the machine to behave in a certain way in certain situations. So basically Artificial Intelligence can be nothing more than just a bunch of if-else statements.

An if-else statement is a simple rule explicitly programmed by a human. Consider a very abstract, simple example of a robot who is moving on a road. A possible programmed rule for that robot could look as follows:

Instead, when speaking of Artificial Intelligence it's only worthwhile to consider two different approaches: Machine Learning and Deep Learning. Both are subfields of Artificial Intelligence

Machine Learning vs Deep Learning

Now that we now better understand what Artificial Intelligence means we can take a closer look at Machine Learning and Deep Learning and make a clearer distinguishment between these two.

Machine Learning incorporates “ classical” algorithms for various kinds of tasks such as clustering, regression or classification. Machine Learning algorithms must be trained on data. The more data you provide to your algorithm, the better it gets.

The “training” part of a Machine Learning model means that this model tries to optimize along a certain dimension. In other words, the Machine Learning models try to minimize the error between their predictions and the actual ground truth values.

For this we must define a so-called error function, also called a loss-function or an objective function … because after all the model has an objective. This objective could be for example classification of data into different categories (e.g. cat and dog pictures) or prediction of the expected price of a stock in the near future.

When someone says they are working with a machine-learning algorithm, you can get to the gist of its value by asking: What’s the objective function?

At this point, you may ask: How do we minimize the error?

One way would be to compare the prediction of the model with the ground truth value and adjust the parameters of the model in a way so that next time, the error between these two values is smaller. This is repeated again and again and again.

Thousands and millions of times, until the parameters of the model that determine the predictions are so good, that the difference between the predictions of the model and the ground truth labels are as small as possible.

In short machine learning models are optimization algorithms. If you tune them right, they minimize their error by guessing and guessing and guessing again.

Machine Learning is old…

Machine Learning is a pretty old field and incorporates methods and algorithms that have been around for dozens of years, some of them since as early as the sixties.

Some known methods of classification and prediction are the Naive Bayes Classifier and the Support Vector Machines. In addition to the classification, there are also clustering algorithms such as the well-known K-Means and tree-based clustering. To reduce the dimensionality of data to gain more insights about it’ nature methods such as Principal component analysis and tSNE are used.

Deep Learning — The next big Thing

Deep Learning, on the other hand, is a very young field of Artificial Intelligence that is powered by artificial neural networks.

It can be viewed again as a subfield of Machine Learning since Deep Learning algorithms also require data in order to learn to solve tasks. Although methods of Deep Learning are able to perform the same tasks as classic Machine Learning algorithms, it is not the other way round.

Artificial neural networks have unique capabilities that enable Deep Learning models to solve tasks that Machine Learning models could never solve.

All recent advances in intelligence are due to Deep Learning. Without Deep Learning we would not have self-driving cars, chatbots or personal assistants like Alexa and Siri. Google Translate app would remain primitive and Netflix would have no idea which movies or TV series we like or dislike.

We can even go so far as to say that the new industrial revolution is driven by artificial neural networks and Deep Learning. This is the best and closest approach to true machine intelligence we have so far. The reason is that Deep Learning has two major advantages over Machine Learning.

Why is Deep Learning better than Machine Learning?

The first advantage is the needlessness of Feature Extraction. What do I mean by this?

Well if you want to use a Machine Learning model to determine whether a given picture shows a car or not, we as humans, must first program the unique features of a car (shape, size, windows, wheels etc.) into the algorithm. This way the algorithm would know what to look after in the given pictures.

In the case of a Deep Learning model, is step is completely unnecessary. The model would recognize all the unique characteristics of a car by itself and make correct predictions.

In fact, the needlessness of feature extraction applies to any other task for a deep learning model. You simply give the neural network the raw data, the rest is done by the model. While for a machine learning model, you would need to perform additional steps, such as the already mentioned extraction of the features of the given data.

The second huge advantage of Deep Learning and a key part in understanding why it’s becoming so popular is that it’s powered by massive amounts of data. The “Big Data Era” of technology will provide huge amounts of opportunities for new innovations in deep learning. To quote Andrew Ng, the chief scientist of China’s major search engine Baidu and one of the leaders of the Google Brain Project:

The analogy to deep learning is that the rocket engine is the deep learning models and the fuel is the huge amounts of data we can feed to these algorithms.

Deep Learning models tend to increase their accuracy with the increasing amount of training data, where’s traditional machine learning models such as SVM and Naive Bayes classifier stop improving after a saturation point.

Learn Python Programming

Learn Python Programming

Learn Python Programming

Learn Python Programming

Learn Python Programming and increase your python programming skills with Coder Kovid.

Python is the highest growing programming language in this era. You can use Python to do everything like, web development, software development, cognitive development, machine learning, artificial intelligence, etc. You should learn python programming and increase your skills of programming.

In this course of learn python programming you don't need any prior programming knowledge. Every beginner can start with.

Basic knowledge
No prior knowledge needed to learn this course
What will you learn
Write Basic Syntax of Python Programming
Create Basic Real World Application
Program in a fluent manner
Get Familiar in Programming Environment

How to Perform Face Detection with Deep Learning in Keras

How to Perform Face Detection with Deep Learning in Keras

Perform Face Detection with Deep Learning in Keras .If you’re a regular user of Google Photos, you may have noticed how the application automatically extracts and groups faces of people from the photos that you back up to the cloud.

A photo application such as Google’s achieves this through the detection of faces of humans (and pets too!) in your photos and by then grouping similar faces together. Detection and then classification of faces in images is a common task in deep learning with neural networks.

In the first step of this tutorial, we’ll use a pre-trained MTCNN model in Keras to detect faces in images. Once we’ve extracted the faces from an image, we’ll compute a similarity score between these faces to find if they belong to the same person.


Before you start with detecting and recognizing faces, you need to set up your development environment. First, you need to “read” images through Python before doing any processing on them. We’ll use the plotting library matplotlib to read and manipulate images. Install the latest version through the installer pip:

pip3 install matplotlib

To use any implementation of a CNN algorithm, you need to install keras. Download and install the latest version using the command below:

pip3 install keras

The algorithm that we’ll use for face detection is MTCNN (Multi-Task Convoluted Neural Networks), based on the paper Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks (Zhang et al., 2016). An implementation of the MTCNN algorithm for TensorFlow in Python3.4 is available as a package. Run the following command to install the package through pip:

pip3 install mtcnn

To compare faces after extracting them from images, we’ll use the VGGFace2 algorithm developed by the Visual Geometry Group at the University of Oxford. A TensorFlow-based Keras implementation of the VGG algorithm is available as a package for you to install:

pip3 install keras_vggface

While you may feel the need to build and train your own model, you’d need a huge training dataset and vast processing power. Since this tutorial focuses on the utility of these models, it uses existing, trained models by experts in the field.

Now that you’ve successfully installed the prerequisites, let’s jump right into the tutorial!

Step 1: Face Detection with the MTCNN Model

The objectives in this step are as follows:

  • retrieve images hosted externally to a local server
  • read images through matplotlib‘s imread() function
  • detect and explore faces through the MTCNN algorithm
  • extract faces from an image.

1.1 Store External Images

You may often be doing an analysis from images hosted on external servers. For this example, we’ll use two images of Lee Iacocca, the father of the Mustang, hosted on the BBC and The Detroit News sites.

To temporarily store the images locally for our analysis, we’ll retrieve each from its URL and write it to a local file. Let’s define a function store_image for this purpose:

import urllib.request

def store_image(url, local_file_name):
  with urllib.request.urlopen(url) as resource:
    with open(local_file_name, 'wb') as f:

You can now simply call the function with the URL and the local file in which you’d like to store the image:


After successfully retrieving the images, let’s detect faces in them.

1.2 Detect Faces in an Image

For this purpose, we’ll make two imports — matplotlib for reading images, and mtcnn for detecting faces within the images:

from matplotlib import pyplot as plt
from mtcnn.mtcnn import MTCNN

Use the imread() function to read an image:

image = plt.imread('iacocca_1.jpg')

Next, initialize an MTCNN() object into the detector variable and use the .detect_faces() method to detect the faces in an image. Let’s see what it returns:

detector = MTCNN()

faces = detector.detect_faces(image)
for face in faces:

For every face, a Python dictionary is returned, which contains three keys. The box key contains the boundary of the face within the image. It has four values: x- and y- coordinates of the top left vertex, width, and height of the rectangle containing the face. The other keys are confidence and keypoints. The keypoints key contains a dictionary containing the features of a face that were detected, along with their coordinates:

{'box': [160, 40, 35, 44], 'confidence': 0.9999798536300659, 'keypoints': {'left_eye': (172, 57), 'right_eye': (188, 57), 'nose': (182, 64), 'mouth_left': (173, 73), 'mouth_right': (187, 73)}}

1.3 Highlight Faces in an Image

Now that we’ve successfully detected a face, let’s draw a rectangle over it to highlight the face within the image to verify if the detection was correct.

To draw a rectangle, import the Rectangle object from matplotlib.patches:

from matplotlib.patches import Rectangle

Let’s define a function highlight_faces to first display the image and then draw rectangles over faces that were detected. First, read the image through imread() and plot it through imshow(). For each face that was detected, draw a rectangle using the Rectangle() class.

Finally, display the image and the rectangles using the .show() method. If you’re using Jupyter notebooks, you may use the %matplotlib inline magic command to show plots inline:

def highlight_faces(image_path, faces):
  # display image
    image = plt.imread(image_path)

    ax = plt.gca()

    # for each face, draw a rectangle based on coordinates
    for face in faces:
        x, y, width, height = face['box']
        face_border = Rectangle((x, y), width, height,
                          fill=False, color='red')

Let’s now display the image and the detected face using the highlight_faces() function:

highlight_faces('iacocca_1.jpg', faces)

![Detected face in an image of Lee Iacocca]( face in an image of Lee Iacocca. Source: [BBC](

Let’s display the second image and the face(s) detected in it:

image = plt.imread('iacocca_2.jpg')
faces = detector.detect_faces(image)

highlight_faces('iacocca_2.jpg', faces)

In these two images, you can see that the MTCNN algorithm correctly detects faces. Let’s now extract this face from the image to perform further analysis on it.

1.4 Extract Face for Further Analysis

At this point, you know the coordinates of the faces from the detector. Extracting the faces is a fairly easy task using list indices. However, the VGGFace2 algorithm that we use needs the faces to be resized to 224 x 224 pixels. We’ll use the PIL library to resize the images.

The function extract_face_from_image() extracts all faces from an image:

from numpy import asarray
from PIL import Image

def extract_face_from_image(image_path, required_size=(224, 224)):
  # load image and detect faces
    image = plt.imread(image_path)
    detector = MTCNN()
    faces = detector.detect_faces(image)

    face_images = []

    for face in faces:
        # extract the bounding box from the requested face
        x1, y1, width, height = face['box']
        x2, y2 = x1 + width, y1 + height

        # extract the face
        face_boundary = image[y1:y2, x1:x2]

        # resize pixels to the model size
        face_image = Image.fromarray(face_boundary)
        face_image = face_image.resize(required_size)
        face_array = asarray(face_image)

    return face_images

extracted_face = extract_face_from_image('iacocca_1.jpg')

# Display the first face from the extracted faces

Here is how the extracted face looks from the first image.

Step 2: Face Recognition with VGGFace2 Model

In this section, let’s first test the model on the two images of Lee Iacocca that we’ve retrieved. Then, we’ll move on to compare faces from images of the starting eleven of the Chelsea football team in 2018 and 2019. You’ll then be able to assess if the algorithm identifies faces of common players between the images.

2.1 Compare Two Faces

In this section, you need to import three modules: VGGFace to prepare the extracted faces to be used in the face recognition models, and the cosine function from SciPy to compute the distance between two faces:

from keras_vggface.utils import preprocess_input
from keras_vggface.vggface import VGGFace
from scipy.spatial.distance import cosine

Let’s define a function that takes the extracted faces as inputs and returns the computed model scores. The model returns a vector, which represents the features of a face:

def get_model_scores(faces):
    samples = asarray(faces, 'float32')

    # prepare the data for the model
    samples = preprocess_input(samples, version=2)

    # create a vggface model object
    model = VGGFace(model='resnet50',
      input_shape=(224, 224, 3),

    # perform prediction
    return model.predict(samples)

faces = [extract_face_from_image(image_path)
         for image_path in ['iacocca_1.jpg', 'iacocca_2.jpg']]

model_scores = get_model_scores(faces)

Since the model scores for each face are vectors, we need to find the similarity between the scores of two faces. We can typically use a Euclidean or Cosine function to calculate the similarity.

Vector representation of faces are suited to the cosine similarity. Here’s a detailed comparison between cosine and Euclidean distances with an example.

The cosine() function computes the cosine distance between two vectors. The lower this number, the better match your faces are. In our case, we’ll put the threshold at 0.4 distance. This threshold is debatable and would vary with your use case. You should set this threshold based on case studies on your dataset:

if cosine(model_scores[0], model_scores[1]) <= 0.4:
  print("Faces Matched")

In this case, the two faces of Lee Iacocca matched.

Faces Matched

2.2 Compare Multiple Faces in Two Images

Let’s put the model to good use in this section of the tutorial. We’ll compare the faces in two images of starting elevens of the Chelsea Football Club in a Europa League match vs Slavia Prague in the 2018–19 season and the UEFA Super Cup match vs Liverpool in the 2019–20 season. While many of the players feature in both match day squads, let’s see if the algorithm is able to detect all common players.

First, let’s retrieve the resources from the URLs, detect the faces in each image and highlight them:


image = plt.imread('chelsea_1.jpg')
faces_staring_xi = detector.detect_faces(image)

highlight_faces('chelsea_1.jpg', faces_staring_xi)


image = plt.imread('chelsea_2.jpg')
faces = detector.detect_faces(image)

highlight_faces('chelsea_2.jpg', faces)

Before we proceed further, here are the starting elevens from both matches:

We have eight players who are common to both starting XIs and who ideally should be matched by the algorithm.

Let’s first compute scores:

slavia_faces = extract_face_from_image('chelsea_1.jpg')
liverpool_faces = extract_face_from_image('chelsea_2.jpg')

model_scores_starting_xi_slavia = get_model_scores(slavia_faces)
model_scores_starting_xi_liverpool = get_model_scores(liverpool_faces)
for idx, face_score_1 in enumerate(model_scores_starting_xi_slavia):
  for idy, face_score_2 in enumerate(model_scores_starting_xi_liverpool):
    score = cosine(face_score_1, face_score_2)
    if score <= 0.4:
      # Printing the IDs of faces and score
      print(idx, idy, score)
      # Displaying each matched pair of faces

Here’s the list of pairs of faces that the algorithm matched. Notice that it has been able to match all eight pairs of faces.

While we were successfully able to match each face in our images, I’d like to take a step back to discuss ramifications of the scores. As discussed earlier, there’s no universal threshold that would match two images together. You may need to re-define these thresholds with new data coming into the analysis. For instance, even Google Photos takes your inputs when it’s unable to programmatically determine the best threshold for a pair.

The best way forward is to carefully assess cases when matching different types of faces. Emotions of the faces and their angles play a role in determining the precision too. In our use case, notice how I have deliberately used photos of starting elevens as players are staring right into the camera! You can try matching the starting eleven faces with those of a trophy celebration and I’m pretty sure the accuracy would drop.


In this tutorial, we first detected faces in images using the MTCNN model and highlighted them in the images to determine if the model worked correctly. Next, we used the VGGFace2 algorithm to extract features from faces in the form of a vector and matched different faces to group them together.