16 REAL DATA SCIENCE AND MACHINE LEARNING INTERVIEW QUESTIONS- Ah the dreaded machine learning interview. You feel like you know everythingâ€¦ until youâ€™re tested on it! But it doesnâ€™t have to be this way...đź‘Źđź‘Źđź‘Źđź‘Źđź‘Ź
Ah the dreaded machine learning interview. You feel like you know everythingâ€¦ until youâ€™re tested on it! But it doesnâ€™t have to be this way...
Over the past few months Iâ€™ve interviewed with many companies for entry-level roles involving data science and machine learning. To give you a bit of perspective, I was in graduate school in the last few months of my masters in machine learning and computer vision with most of my previous experience being research/academic, but with eight months at an early stage startup (unrelated to ML). The roles included work in data science, general machine learning, and specializations in natural language processing or computer vision. I interviewed with big companies like Amazon, Tesla, Samsung, Uber, Huawei, but also with many startups ranging from early-stage to well established and funded.
Today Iâ€™m going to share with you all of the interview questions I was asked and how to approach them. Many of the questions were quite common and expected theory, but many others were quite creative and curious. Iâ€™m going to simply list the most common ones since thereâ€™s many resources about them online and go more in depth into some of the less common and trickier ones. I hope in reading this post that you can get great at machine learning interviews and land your dream job!
Many of the questions were quite common and expected theory, but many others were quite creative and curious.
16 DATA SCIENCE & MACHINE LEARNING INTERVIEW QUESTIONS**1. WHAT IS DATA NORMALIZATION AND WHY DO WE NEED IT? **
I felt this one would be important to highlight. Data normalization is a very important preprocessing step, used to rescale values to fit in a specific range to assure better convergence during backpropagation. In general, it boils down to subtracting the mean of each data point and dividing by its standard deviation. If we donâ€™t do this then some of the features (those with high magnitude) will be weighted more in the cost function (if a higher-magnitude feature changes by 1%, then that change is pretty big, but for smaller features itâ€™s quite insignificant). The data normalization makes all features weighted equally.
2. EXPLAIN DIMENSIONALITY REDUCTION, WHERE ITâ€™S USED, AND ITS BENEFITS?
Dimensionality reduction is the process of reducing the number of feature variables under consideration by obtaining a set of principal variables which are basically the important features. Importance of a feature depends on how much the feature variable contributes to the information representation of the data and depends on which technique you decide to use. Deciding which technique to use comes down to trial-and-error and preference. Itâ€™s common to start with a linear technique and move to non-linear techniques when results suggest inadequate fit. Benefits of dimensionality reduction for a data set may be:
**3. HOW DO YOU HANDLE MISSING OR CORRUPTED DATA IN A DATASET? **
You could find missing/corrupted data in a dataset and either drop those rows or columns, or decide to replace them with another value. In Pandas, there are two very useful methods: isnull() and dropna() that will help you find columns of data with missing or corrupted data and drop those values. If you want to fill the invalid values with a placeholder value (for example, 0), you could use the fillna() method.
**4. EXPLAIN THIS CLUSTERING ALGORITHM? **
I wrote a popular article on the The 5 Clustering Algorithms Data Scientists Need to Know explaining all of them in detail with some great visualizations.
5. HOW WOULD YOU GO ABOUT DOING AN EXPLORATORY DATA ANALYSIS (EDA)?
The goal of an EDA is to gather some insights from the data before applying your predictive model i.e gain some information. Basically, you want to do your EDA in a coarse to fine manner. We start by gaining some high-level global insights. Check out some imbalanced classes. Look at mean and variance of each class. Check out the first few rows to see what itâ€™s all about. Run a pandas df.info()
to see which features are continuous, categorical, their type (int, float, string). Next, drop unnecessary columns that wonâ€™t be useful in analysis and prediction. These can simply be columns that look useless, oneâ€™s where many rows have the same value (i.e it doesnâ€™t give us much information), or itâ€™s missing a lot of values. We can also fill in missing values with the most common value in that column, or the median. Now we can start making some basic visualizations. Start with high-level stuff. Do some bar plots for features that are categorical and have a small number of groups. Bar plots of the final classes. Look at the most â€śgeneral featuresâ€ť. Create some visualizations about these individual features to try and gain some basic insights. Now we can start to get more specific. Create visualizations between features, two or three at a time. How are features related to each other? You can also do a PCA to see which features contain the most information. Group some features together as well to see their relationships. For example, what happens to the classes when A = 0 and B = 0? How about A = 1 and B = 0? Compare different features. For example, if feature A can be either â€śfemaleâ€ť or â€śmaleâ€ť then we can plot feature A against which cabin they stayed in to see if males and females stay in different cabins. Beyond bar, scatter, and other basic plots, we can do a PDF/CDF, overlaid plots, etc. Look at some statistics like distribution, p-value, etc. Finally itâ€™s time to build the ML model. Start with easier stuff like Naive Bayes and linear regression. If you see that those suck or the data is highly non-linear, go with polynomial regression, decision trees, or SVMs. The features can be selected based on their importance from the EDA. If you have lots of data you can use a neural network. Check ROC curve. Precision, Recall.
6. HOW DO YOU KNOW WHICH MACHINE LEARNING MODEL YOU SHOULD USE?
While one should always keep the â€śno free lunch theoremâ€ť in mind, there are some general guidelines. I wrote an article on how to select the proper regression model here. This cheatsheet is also fantastic!
7. WHY DO WE USE CONVOLUTIONS FOR IMAGES RATHER THAN JUST FC LAYERS?
This one was pretty interesting since itâ€™s not something companies usually ask. As you would expect, I got this question from a company focused on computer vision. This answer has two parts to it. Firstly, convolutions preserve, encode, and actually use the spatial information from the image. If we used only FC layers we would have no relative spatial information. Secondly, Convolutional Neural Networks (CNNs) have a partially built-in translation in-variance, since each convolution kernel acts as itâ€™s own filter/feature detector.
8. WHAT MAKES CNNS TRANSLATION INVARIANT?
As explained above, each convolution kernel acts as its own filter/feature detector. So letâ€™s say youâ€™re doing object detection, it doesnâ€™t matter where in the image the object is since weâ€™re going to apply the convolution in a sliding window fashion across the entire image anyways.
9. WHY DO WE HAVE MAX-POOLING IN CLASSIFICATION CNNS?
Again as you would expect this is for a role in computer vision. Max-pooling in a CNN allows you to reduce computation since your feature maps are smaller after the pooling. You donâ€™t lose too much semantic information since youâ€™re taking the maximum activation. Thereâ€™s also a theory that max-pooling contributes a bit to giving CNNs more translation in-variance. Check out this great video from Andrew Ng on the benefits of max-pooling.
10. WHY DO SEGMENTATION CNNS TYPICALLY HAVE AN ENCODER-DECODER STYLE / STRUCTURE?
The encoder CNN can basically be thought of as a feature extraction network, while the decoder uses that information to predict the image segments by â€śdecodingâ€ť the features and upscaling to the original image size.
11. WHAT IS THE SIGNIFICANCE OF RESIDUAL NETWORKS?
The main thing that residual connections did was allow for direct feature access from previous layers. This makes information propagation throughout the network much easier. One very interesting paper about this shows how using local skip connections gives the network a type of ensemble multi-path structure, giving features multiple paths to propagate throughout the network.
12. WHAT IS BATCH NORMALIZATION AND WHY DOES IT WORK?
Training Deep Neural Networks is complicated by the fact that the distribution of each layerâ€™s inputs changes during training, as the parameters of the previous layers change. The idea is then to normalize the inputs of each layer in such a way that they have a mean output activation of zero and standard deviation of one. This is done for each individual mini-batch at each layer i.e compute the mean and variance of that mini-batch alone, then normalize. This is analogous to how the inputs to networks are standardized. How does this help? We know that normalizing the inputs to a network helps it learn. But a network is just a series of layers, where the output of one layer becomes the input to the next. That means we can think of any layer in a neural network as the first layer of a smaller subsequent network. Thought of as a series of neural networks feeding into each other, we normalize the output of one layer before applying the activation function, and then feed it into the following layer (sub-network).
13. HOW WOULD YOU HANDLE AN IMBALANCED DATASET?
I have an article about this! Check out #3 :)
14. WHY WOULD YOU USE MANY SMALL CONVOLUTIONAL KERNELS SUCH AS 3X3 RATHER THAN A FEW LARGE ONES?
This is very well explained in the VGGNet paper. There are two reasons: First, you can use several smaller kernels rather than few large ones to get the same receptive field and capture more spatial context, but with the smaller kernels you are using less parameters and computations. Secondly, because with smaller kernels you will be using more filters, youâ€™ll be able to use more activation functions and thus have a more discriminative mapping function being learned by your CNN.
15. DO YOU HAVE ANY OTHER PROJECTS THAT WOULD BE RELATED HERE?
Here youâ€™ll really draw connections between your research and their business. Is there anything you did or any skills you learned that could possibly connect back to their business or the role you are applying for? It doesnâ€™t have to be 100% exact, just somehow related such that you can show that you will be able to directly add lots of value.
16. EXPLAIN YOUR CURRENT MASTERS RESEARCH? WHAT WORKED? WHAT DIDNâ€™T? FUTURE DIRECTIONS?
Same as the last question!
**ADDITIONAL DATA SCIENCE INTERVIEW QUESTIONS: **
There you have it! All of the interview questions I got when apply for roles in data science and machine learning. I hope you enjoyed this post and learned something new and useful!
This complete Machine Learning full course video covers all the topics that you need to know to become a master in the field of Machine Learning.
Machine Learning Full Course | Learn Machine Learning | Machine Learning Tutorial
It covers all the basics of Machine Learning (01:46), the different types of Machine Learning (18:32), and the various applications of Machine Learning used in different industries (04:54:48).This video will help you learn different Machine Learning algorithms in Python. Linear Regression, Logistic Regression (23:38), K Means Clustering (01:26:20), Decision Tree (02:15:15), and Support Vector Machines (03:48:31) are some of the important algorithms you will understand with a hands-on demo. Finally, you will see the essential skills required to become a Machine Learning Engineer (04:59:46) and come across a few important Machine Learning interview questions (05:09:03). Now, let's get started with Machine Learning.
Below topics are explained in this Machine Learning course for beginners:
Basics of Machine Learning - 01:46
Why Machine Learning - 09:18
What is Machine Learning - 13:25
Types of Machine Learning - 18:32
Supervised Learning - 18:44
Reinforcement Learning - 21:06
Supervised VS Unsupervised - 22:26
Linear Regression - 23:38
Introduction to Machine Learning - 25:08
Application of Linear Regression - 26:40
Understanding Linear Regression - 27:19
Regression Equation - 28:00
Multiple Linear Regression - 35:57
Logistic Regression - 55:45
What is Logistic Regression - 56:04
What is Linear Regression - 59:35
Comparing Linear & Logistic Regression - 01:05:28
What is K-Means Clustering - 01:26:20
How does K-Means Clustering work - 01:38:00
What is Decision Tree - 02:15:15
How does Decision Tree work - 02:25:15Â
Random Forest Tutorial - 02:39:56
Why Random Forest - 02:41:52
What is Random Forest - 02:43:21
How does Decision Tree work- 02:52:02
K-Nearest Neighbors Algorithm Tutorial - 03:22:02
Why KNN - 03:24:11
What is KNN - 03:24:24
How do we choose 'K' - 03:25:38
When do we use KNN - 03:27:37
Applications of Support Vector Machine - 03:48:31
Why Support Vector Machine - 03:48:55
What Support Vector Machine - 03:50:34
Advantages of Support Vector Machine - 03:54:54
What is Naive Bayes - 04:13:06
Where is Naive Bayes used - 04:17:45
Top 10 Application of Machine Learning - 04:54:48
How to become a Machine Learning EngineerÂ - 04:59:46
Machine Learning Interview Questions - 05:09:03
Machine learning problems can generally be divided into three types. Classification and regression, which are known as supervised learning, and unsupervised learning which in the context of machine learning applications often refers to clustering.
Machine learning problems can generally be divided into three types. Classification and regression, which are known as supervised learning, and unsupervised learning which in the context of machine learning applications often refers to clustering.
In the following article, I am going to give a brief introduction to each of these three problems and will include a walkthrough in the popular python library scikit-learn.
Before I start Iâ€™ll give a brief explanation for the meaning behind the terms supervised and unsupervised learning.
Supervised Learning: In supervised learning, you have a known set of inputs (features) and a known set of outputs (labels). Traditionally these are known as X and y. The goal of the algorithm is to learn the mapping function that maps the input to the output. So that when given new examples of X the machine can correctly predict the corresponding y labels.
Unsupervised Learning: In unsupervised learning, you only have a set of inputs (X) and no corresponding labels (y). The goal of the algorithm is to find previously unknown patterns in the data. Quite often these algorithms are used to find meaningful clusters of similar samples of X so in effect finding the categories intrinsic to the data.
In classification, the outputs (y) are categories. These can be binary, for example, if we were classifying spam email vs not spam email. They can also be multiple categories such as classifying species of flowers, this is known as multiclass classification.
Letâ€™s walk through a simple example of classification using scikit-learn. If you donâ€™t already have this installed it can be installed either via pip or conda as outlined here.
Scikit-learn has a number of datasets that can be directly accessed via the library. For ease in this article, I will be using these example datasets throughout. To illustrate classification I will use the wine dataset which is a multiclass classification problem. In the dataset, the inputs (X) consist of 13 features relating to various properties of each wine type. The known outputs (y) are wine types which in the dataset have been given a number 0, 1 or 2.
The imports I am using for all the code in this article are shown below.
import pandas as pd
import numpy as np
from sklearn.datasets import load_wine
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.metrics import f1_score
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn import linear_model
from sklearn.linear_model import ElasticNetCV
from sklearn.svm import SVR
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
from yellowbrick.cluster import SilhouetteVisualizer
In the below code I am downloading the data and converting to a pandas data frame.
wine = load_wine()
wine_df = pd.DataFrame(wine.data, columns=wine.feature_names)
wine_df['TARGET'] = pd.Series(wine.target)
The next stage in a supervised learning problem is to split the data into test and train sets. The train set can be used by the algorithm to learn the mapping between inputs and outputs, and then the reserved test set can be used to evaluate how well the model has learned this mapping. In the below code I am using the scikit-learn model_selection function train_test_split
to do this.
X_w = wine_df.drop(['TARGET'], axis=1)
y_w = wine_df['TARGET']
X_train_w, X_test_w, y_train_w, y_test_w = train_test_split(X_w, y_w, test_size=0.2)
In the next step, we need to choose the algorithm that will be best suited to learn the mapping in your chosen dataset. In scikit-learn there are many different algorithms to choose from, all of which use different functions and methods to learn the mapping, you can view the full list here.
To determine the best model I am running the following code. I am training the model using a selection of algorithms and obtaining the F1-score for each one. The F1 score is a good indicator of the overall accuracy of a classifier. I have written a detailed description of the various metrics that can be used to evaluate a classifier here.
classifiers = [
KNeighborsClassifier(3),
SVC(kernel="rbf", C=0.025, probability=True),
NuSVC(probability=True),
DecisionTreeClassifier(),
RandomForestClassifier(),
AdaBoostClassifier(),
GradientBoostingClassifier()
]
for classifier in classifiers:
model = classifier
model.fit(X_train_w, y_train_w)
y_pred_w = model.predict(X_test_w)
print(classifier)
print("model score: %.3f" % f1_score(y_test_w, y_pred_w, average='weighted'))
A perfect F1 score would be 1.0, therefore, the closer the number is to 1.0 the better the model performance. The results above suggest that the Random Forest Classifier is the best model for this dataset.
In regression, the outputs (y) are continuous values rather than categories. An example of regression would be predicting how many sales a store may make next month, or what the future price of your house might be.
Again to illustrate regression I will use a dataset from scikit-learn known as the boston housing dataset. This consists of 13 features (X) which are various properties of a house such as the number of rooms, the age and crime rate for the location. The output (y) is the price of the house.
I am loading the data using the code below and splitting it into test and train sets using the same method I used for the wine dataset.
boston = load_boston()
boston_df = pd.DataFrame(boston.data, columns=boston.feature_names)
boston_df['TARGET'] = pd.Series(boston.target)
X_b = boston_df.drop(['TARGET'], axis=1)
y_b = boston_df['TARGET']
X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(X_b, y_b, test_size=0.2)
We can use this cheat sheet to see the available algorithms suited to regression problems in scikit-learn. We will use similar code to the classification problem to loop through a selection and print out the scores for each.
There are a number of different metrics used to evaluate regression models. These are all essentially error metrics and measure the difference between the actual and predicted values achieved by the model. I have used the root mean squared error (RMSE). For this metric, the closer to zero the value is the better the performance of the model. This article gives a really good explanation of error metrics for regression problems.
regressors = [
linear_model.Lasso(alpha=0.1),
linear_model.LinearRegression(),
ElasticNetCV(alphas=None, copy_X=True, cv=5, eps=0.001, fit_intercept=True,
l1_ratio=0.5, max_iter=1000, n_alphas=100, n_jobs=None,
normalize=False, positive=False, precompute='auto', random_state=0,
selection='cyclic', tol=0.0001, verbose=0),
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
gamma='auto_deprecated', kernel='rbf', max_iter=-1, shrinking=True,
tol=0.001, verbose=False),
linear_model.Ridge(alpha=.5)
]
for regressor in regressors:
model = regressor
model.fit(X_train_b, y_train_b)
y_pred_b = model.predict(X_test_b)
print(regressor)
print("mean squared error: %.3f" % sqrt(mean_squared_error(y_test_b, y_pred_b)))
The RMSE score suggests that either the linear regression and ridge regression algorithms perform best for this dataset.
There are a number of different types of unsupervised learning but for simplicity here I am going to focus on the clustering methods. There are many different algorithms for clustering all of which use slightly different techniques to find clusters of inputs.
Probably one of the most widely used methods is Kmeans. This algorithm performs an iterative process whereby a specified number of randomly generated means are initiated. A distance metric, Euclidean distance is calculated for each data point from the centroids, thus creating clusters of similar values. The centroid of each cluster then becomes the new mean and this process is repeated until the optimum result has been achieved.
Letâ€™s use the wine dataset we used in the classification task, with the y labels removed, and see how well the k-means algorithm can identify the wine types from the inputs.
As we are only using the inputs for this model I am splitting the data into test and train using a slightly different method.
np.random.seed(0)
msk = np.random.rand(len(X_w)) < 0.8
train_w = X_w[msk]
test_w = X_w[~msk]
As Kmeans is reliant on the distance metric to determine the clusters it is usually necessary to perform feature scaling (ensuring that all features have the same scale) before training the model. In the below code I am using the MinMaxScaler to scale the features so that all values fall between 0 and 1.
x = train_w.values
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
X_scaled = pd.DataFrame(x_scaled,columns=train_w.columns)
With K-means you have to specify the number of clusters the algorithm should use. So one of the first steps is to identify the optimum number of clusters. This is achieved by iterating through a number of values of k and plotting the results on a chart. This is known as the Elbow method as it typically produces a plot with a curve that looks a little like the curve of your elbow. The yellowbrick library (which is a great library for visualising scikit-learn models and can be pip installed) has a really nice plot for this. The code below produces this visualisation.
model = KMeans()
visualizer = KElbowVisualizer(model, k=(1,8))
visualizer.fit(X_scaled)
visualizer.show()
Ordinarily, we wouldnâ€™t already know how many categories we have in a dataset where we are using a clustering technique. However, in this case, we know that there are three wine types in the data â€” the curve has correctly selected three as the optimum number of clusters to use in the model.
The next step is to initialise the K-means algorithm and fit the model to the training data and evaluate how effectively the algorithm has clustered the data.
One method used for this is known as the silhouette score. This measures the consistency of values within the clusters. Or in other words how similar to each other the values in each cluster are, and how much separation there is between the clusters. The silhouette score is calculated for each value and will range from -1 to +1. These values are then plotted to form a silhouette plot. Again yellowbrick provides a simple way to construct this type of plot. The code below creates this visualisation for the wine dataset.
model = KMeans(3, random_state=42)
visualizer = SilhouetteVisualizer(model, colors='yellowbrick')
visualizer.fit(X_scaled)
visualizer.show()
A silhouette plot can be interpreted in the following way:
The plot for the wine data set above shows that cluster 0 may not be as consistent as the others due to most data points being below the average score and a few data points having a score below 0.
Silhouette scores can be particularly useful in comparing one algorithm against another or different values of k.
In this post, I wanted to give a brief introduction to each of the three types of machine learning. There are many other steps involved in all of these processes including feature engineering, data processing and hyperparameter optimisation to determine both the best data preprocessing techniques and the best models to use.
Thanks for reading!