16 REAL DATA SCIENCE AND MACHINE LEARNING INTERVIEW QUESTIONS- Ah the dreaded machine learning interview. You feel like you know everything… until you’re tested on it! But it doesn’t have to be this way...👏👏👏👏👏
Ah the dreaded machine learning interview. You feel like you know everything… until you’re tested on it! But it doesn’t have to be this way...
Over the past few months I’ve interviewed with many companies for entry-level roles involving data science and machine learning. To give you a bit of perspective, I was in graduate school in the last few months of my masters in machine learning and computer vision with most of my previous experience being research/academic, but with eight months at an early stage startup (unrelated to ML). The roles included work in data science, general machine learning, and specializations in natural language processing or computer vision. I interviewed with big companies like Amazon, Tesla, Samsung, Uber, Huawei, but also with many startups ranging from early-stage to well established and funded.
Today I’m going to share with you all of the interview questions I was asked and how to approach them. Many of the questions were quite common and expected theory, but many others were quite creative and curious. I’m going to simply list the most common ones since there’s many resources about them online and go more in depth into some of the less common and trickier ones. I hope in reading this post that you can get great at machine learning interviews and land your dream job!
Many of the questions were quite common and expected theory, but many others were quite creative and curious.16 DATA SCIENCE & MACHINE LEARNING INTERVIEW QUESTIONS
**1. WHAT IS DATA NORMALIZATION AND WHY DO WE NEED IT? **
I felt this one would be important to highlight. Data normalization is a very important preprocessing step, used to rescale values to fit in a specific range to assure better convergence during backpropagation. In general, it boils down to subtracting the mean of each data point and dividing by its standard deviation. If we don’t do this then some of the features (those with high magnitude) will be weighted more in the cost function (if a higher-magnitude feature changes by 1%, then that change is pretty big, but for smaller features it’s quite insignificant). The data normalization makes all features weighted equally.
2. EXPLAIN DIMENSIONALITY REDUCTION, WHERE IT’S USED, AND ITS BENEFITS?
Dimensionality reduction is the process of reducing the number of feature variables under consideration by obtaining a set of principal variables which are basically the important features. Importance of a feature depends on how much the feature variable contributes to the information representation of the data and depends on which technique you decide to use. Deciding which technique to use comes down to trial-and-error and preference. It’s common to start with a linear technique and move to non-linear techniques when results suggest inadequate fit. Benefits of dimensionality reduction for a data set may be:
**3. HOW DO YOU HANDLE MISSING OR CORRUPTED DATA IN A DATASET? **
You could find missing/corrupted data in a dataset and either drop those rows or columns, or decide to replace them with another value. In Pandas, there are two very useful methods: isnull() and dropna() that will help you find columns of data with missing or corrupted data and drop those values. If you want to fill the invalid values with a placeholder value (for example, 0), you could use the fillna() method.
**4. EXPLAIN THIS CLUSTERING ALGORITHM? **
I wrote a popular article on the The 5 Clustering Algorithms Data Scientists Need to Know explaining all of them in detail with some great visualizations.
5. HOW WOULD YOU GO ABOUT DOING AN EXPLORATORY DATA ANALYSIS (EDA)?
The goal of an EDA is to gather some insights from the data before applying your predictive model i.e gain some information. Basically, you want to do your EDA in a coarse to fine manner. We start by gaining some high-level global insights. Check out some imbalanced classes. Look at mean and variance of each class. Check out the first few rows to see what it’s all about. Run a pandas
df.info() to see which features are continuous, categorical, their type (int, float, string). Next, drop unnecessary columns that won’t be useful in analysis and prediction. These can simply be columns that look useless, one’s where many rows have the same value (i.e it doesn’t give us much information), or it’s missing a lot of values. We can also fill in missing values with the most common value in that column, or the median. Now we can start making some basic visualizations. Start with high-level stuff. Do some bar plots for features that are categorical and have a small number of groups. Bar plots of the final classes. Look at the most “general features”. Create some visualizations about these individual features to try and gain some basic insights. Now we can start to get more specific. Create visualizations between features, two or three at a time. How are features related to each other? You can also do a PCA to see which features contain the most information. Group some features together as well to see their relationships. For example, what happens to the classes when A = 0 and B = 0? How about A = 1 and B = 0? Compare different features. For example, if feature A can be either “female” or “male” then we can plot feature A against which cabin they stayed in to see if males and females stay in different cabins. Beyond bar, scatter, and other basic plots, we can do a PDF/CDF, overlaid plots, etc. Look at some statistics like distribution, p-value, etc. Finally it’s time to build the ML model. Start with easier stuff like Naive Bayes and linear regression. If you see that those suck or the data is highly non-linear, go with polynomial regression, decision trees, or SVMs. The features can be selected based on their importance from the EDA. If you have lots of data you can use a neural network. Check ROC curve. Precision, Recall.
6. HOW DO YOU KNOW WHICH MACHINE LEARNING MODEL YOU SHOULD USE?
While one should always keep the “no free lunch theorem” in mind, there are some general guidelines. I wrote an article on how to select the proper regression model here. This cheatsheet is also fantastic!
7. WHY DO WE USE CONVOLUTIONS FOR IMAGES RATHER THAN JUST FC LAYERS?
This one was pretty interesting since it’s not something companies usually ask. As you would expect, I got this question from a company focused on computer vision. This answer has two parts to it. Firstly, convolutions preserve, encode, and actually use the spatial information from the image. If we used only FC layers we would have no relative spatial information. Secondly, Convolutional Neural Networks (CNNs) have a partially built-in translation in-variance, since each convolution kernel acts as it’s own filter/feature detector.
8. WHAT MAKES CNNS TRANSLATION INVARIANT?
As explained above, each convolution kernel acts as its own filter/feature detector. So let’s say you’re doing object detection, it doesn’t matter where in the image the object is since we’re going to apply the convolution in a sliding window fashion across the entire image anyways.
9. WHY DO WE HAVE MAX-POOLING IN CLASSIFICATION CNNS?
Again as you would expect this is for a role in computer vision. Max-pooling in a CNN allows you to reduce computation since your feature maps are smaller after the pooling. You don’t lose too much semantic information since you’re taking the maximum activation. There’s also a theory that max-pooling contributes a bit to giving CNNs more translation in-variance. Check out this great video from Andrew Ng on the benefits of max-pooling.
10. WHY DO SEGMENTATION CNNS TYPICALLY HAVE AN ENCODER-DECODER STYLE / STRUCTURE?
The encoder CNN can basically be thought of as a feature extraction network, while the decoder uses that information to predict the image segments by “decoding” the features and upscaling to the original image size.
11. WHAT IS THE SIGNIFICANCE OF RESIDUAL NETWORKS?
The main thing that residual connections did was allow for direct feature access from previous layers. This makes information propagation throughout the network much easier. One very interesting paper about this shows how using local skip connections gives the network a type of ensemble multi-path structure, giving features multiple paths to propagate throughout the network.
12. WHAT IS BATCH NORMALIZATION AND WHY DOES IT WORK?
Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. The idea is then to normalize the inputs of each layer in such a way that they have a mean output activation of zero and standard deviation of one. This is done for each individual mini-batch at each layer i.e compute the mean and variance of that mini-batch alone, then normalize. This is analogous to how the inputs to networks are standardized. How does this help? We know that normalizing the inputs to a network helps it learn. But a network is just a series of layers, where the output of one layer becomes the input to the next. That means we can think of any layer in a neural network as the first layer of a smaller subsequent network. Thought of as a series of neural networks feeding into each other, we normalize the output of one layer before applying the activation function, and then feed it into the following layer (sub-network).
13. HOW WOULD YOU HANDLE AN IMBALANCED DATASET?
I have an article about this! Check out #3 :)
14. WHY WOULD YOU USE MANY SMALL CONVOLUTIONAL KERNELS SUCH AS 3X3 RATHER THAN A FEW LARGE ONES?
This is very well explained in the VGGNet paper. There are two reasons: First, you can use several smaller kernels rather than few large ones to get the same receptive field and capture more spatial context, but with the smaller kernels you are using less parameters and computations. Secondly, because with smaller kernels you will be using more filters, you’ll be able to use more activation functions and thus have a more discriminative mapping function being learned by your CNN.
15. DO YOU HAVE ANY OTHER PROJECTS THAT WOULD BE RELATED HERE?
Here you’ll really draw connections between your research and their business. Is there anything you did or any skills you learned that could possibly connect back to their business or the role you are applying for? It doesn’t have to be 100% exact, just somehow related such that you can show that you will be able to directly add lots of value.
16. EXPLAIN YOUR CURRENT MASTERS RESEARCH? WHAT WORKED? WHAT DIDN’T? FUTURE DIRECTIONS?
Same as the last question!
**ADDITIONAL DATA SCIENCE INTERVIEW QUESTIONS: **
There you have it! All of the interview questions I got when apply for roles in data science and machine learning. I hope you enjoyed this post and learned something new and useful!
This complete Machine Learning full course video covers all the topics that you need to know to become a master in the field of Machine Learning.
Machine Learning Full Course | Learn Machine Learning | Machine Learning Tutorial
It covers all the basics of Machine Learning (01:46), the different types of Machine Learning (18:32), and the various applications of Machine Learning used in different industries (04:54:48).This video will help you learn different Machine Learning algorithms in Python. Linear Regression, Logistic Regression (23:38), K Means Clustering (01:26:20), Decision Tree (02:15:15), and Support Vector Machines (03:48:31) are some of the important algorithms you will understand with a hands-on demo. Finally, you will see the essential skills required to become a Machine Learning Engineer (04:59:46) and come across a few important Machine Learning interview questions (05:09:03). Now, let's get started with Machine Learning.
Below topics are explained in this Machine Learning course for beginners:
Basics of Machine Learning - 01:46
Why Machine Learning - 09:18
What is Machine Learning - 13:25
Types of Machine Learning - 18:32
Supervised Learning - 18:44
Reinforcement Learning - 21:06
Supervised VS Unsupervised - 22:26
Linear Regression - 23:38
Introduction to Machine Learning - 25:08
Application of Linear Regression - 26:40
Understanding Linear Regression - 27:19
Regression Equation - 28:00
Multiple Linear Regression - 35:57
Logistic Regression - 55:45
What is Logistic Regression - 56:04
What is Linear Regression - 59:35
Comparing Linear & Logistic Regression - 01:05:28
What is K-Means Clustering - 01:26:20
How does K-Means Clustering work - 01:38:00
What is Decision Tree - 02:15:15
How does Decision Tree work - 02:25:15
Random Forest Tutorial - 02:39:56
Why Random Forest - 02:41:52
What is Random Forest - 02:43:21
How does Decision Tree work- 02:52:02
K-Nearest Neighbors Algorithm Tutorial - 03:22:02
Why KNN - 03:24:11
What is KNN - 03:24:24
How do we choose 'K' - 03:25:38
When do we use KNN - 03:27:37
Applications of Support Vector Machine - 03:48:31
Why Support Vector Machine - 03:48:55
What Support Vector Machine - 03:50:34
Advantages of Support Vector Machine - 03:54:54
What is Naive Bayes - 04:13:06
Where is Naive Bayes used - 04:17:45
Top 10 Application of Machine Learning - 04:54:48
How to become a Machine Learning Engineer - 04:59:46
Machine Learning Interview Questions - 05:09:03
This Machine Learning tutorial for beginners will enable you to learn Machine Learning algorithms with python examples. Become a pro in Machine Learning.
Mastering the Machine Learning Course would easily develop one's career. This is the reason why studying Machine Learning Tutorial becomes so important in the career of a particular student.
Making a part of the machine learning course would enact and studying the Machine Learning Tutorial would make one carve out a new niche.
Machine Learning (ML) is one of the fastest-growing technologies today. ML has a lot of frameworks to build a successful app, and so as a developer, you might be getting confused about using the right framework. Herein we have curated top 5...
Machine Learning (ML) is one of the fastest-growing technologies today. ML has a lot of frameworks to build a successful app, and so as a developer, you might be getting confused about using the right framework. Herein we have curated top 5 machine learning frameworks that are cutting edge technology in your hands.
Through the machine learning frameworks, mobile phones and tablets are getting powerful enough to run the software that can learn and react in real-time. It is a complex discipline. But the implementation of ML models is far less daunting and difficult than it used to be. Now, it automatically improves the performance with the pace of time, interactions, and experiences, and the most important acquisition of useful data pertaining to the tasks allocated.
As we know that ML is considered as a subset of Artificial Intelligence (AI). The scientific study of statistical models and algorithms help a computing system to accomplish designated tasks efficiently. Now, as a mobile app developer, when you are planning to choose machine learning frameworks you must keep the following things in mind.
The framework should be performance-oriented
The grasping and coding should be quick
It allows to distribute the computational process, the framework must have parallelization
It should consist of a facility to create models and provide a developer-friendly tool
Let’s learn about the top five machine learning frameworks to make the right choice for your next ML application development project. Before we dive deeper into these mentioned frameworks, know the different types of ML frameworks that are available on the web. Here are some ML frameworks:
Linear algebra tools
Now, let’s have an insight into ML frameworks that will help you in selecting the right framework for your ML application.
Don’t Miss Out on These 5 Machine Learning Frameworks of 2019
TensorFlow is an open-source software library for data-based programming across multiple tasks. The framework is based on computational graphs which is essentially a network of codes. Each node represents a mathematical operation that runs some function as simple or as complex as multivariate analysis. This framework is said to be best among all the ML libraries as it supports regressions, classifications, and neural networks like complicated tasks and algorithms.
machine learning frameworks
This machine learning library demands additional efforts while learning TensorFlow Python framework. Your job becomes easy in the n-dimensional array of the framework when you have grasped the Python frameworks and libraries.
The benefits of this framework are flexibility. TensorFlow allows non-automatic migration to newer versions. It runs on the GPU, CPU, servers, desktops, and mobile devices. It provides auto differentiation and performance. There are a few goliaths like Airbus, Twitter, IBM, who have innovatively used the TensorFlow frameworks.
#2 FireBase ML Kit
Firebase machine learning framework is a library that allows effortless, minimal code, with highly accurate, pre-trained deep models. We at Space-O Technologies use this machine learning technology for image classification and object detection. The Firebase framework offers models both locally and on the Google Cloud.
machine learning frameworks
This is one of our ML tutorials to make you understand the Firebase frameworks. First of all, we collected photos of empty glass, half watered glass, full watered glass, and targeted into the machine learning algorithms. This helped the machine to search and analyze according to the nature, behavior, and patterns of the object placed in front of it.
The first photo that we targeted through machine learning algorithms was to recognize an empty glass. Thus, the app did its analysis and search for the correct answer, we provided it with certain empty glass images prior to the experiment.
The other photo that we targeted was a half water glass. The core of the machine learning app is to assemble data and to manage it as per its analysis. It was able to recognize the image accurately because of the little bits and pieces of the glass given to it beforehand.
The last one is a full glass recognition image.
Note: For correct recognition, there has to be 1 label that carries at least 100 images of a particular object.
#3 CAFFE (Convolutional Architecture for Fast Feature Embedding)
CAFFE framework is the fastest way to apply deep neural networks. It is the best machine learning framework known for its model-Zoo a pre-trained ML model that is capable of performing a great variety of tasks. Image classification, machine vision, recommender system are some of the tasks performed easily through this ML library.
machine learning frameworks
This framework is majorly written in CPP. It can run on multiple hardware and can switch between CPU and GPU with the use of a single flag. It has systematically organized the structure of Mat lab and python interface.
Now, if you have to make a machine learning app development, then it is mainly used in academic research projects and to design startups prototypes. It is the aptest machine learning technology for research experiments and industry deployment. At a time this framework can manage 60 million pictures every day with a solitary Nvidia K40 GPU.
#4 Apache Spark
The Apache Spark machine learning is a cluster-computing framework written in different languages like Java, Scala, R, and Python. Spark’s machine learning library, MLlib is considered as foundational for the Spark’s success. Building MLlib on top of Spark makes it possible to tackle the distinct needs of a single tool instead of many disjointed ones.
machine learning frameworks
The advantages of such ML library lower learning curves, less complex development and production environments, which ultimately results in a shorter time to deliver high-performing models. The key benefit of MLlib is that it allows data scientists to solve multiple data problems in addition to their machine learning problems.
It can easily solve graph computations (via GraphX), streaming (real-time calculations), and real-time interactive query processing with Spark SQL and DataFrames. The data professionals can focus on solving the data problems instead of learning and maintaining a different tool for each scenario.
Scikit-learn is said to be one of the greatest feats of Python community. This machine learning framework efficiently handles data mining and supports multiple practical tasks. It is built on foundations like SciPy, Numpy, and matplotlib. This framework is known for supervised & unsupervised learning algorithms as well as cross-validation. The Scikit learn is largely written in Python with some core algorithms in Cython to achieve performance.
machine learning frameworks
The machine learning framework can work on multiple tasks without compromising on speed. There are some remarkable machine learning apps using this framework like Spotify, Evernote, AWeber, Inria.
With the help of machine learning to build iOS apps, Android apps powered by ML have become quite an easy process. With this emerging technology trend varieties of available data, computational processing has become cheaper and more powerful, and affordable data storage. So being an app developer or having an idea for machine learning apps should definitely dive into the niche.
Still have any query or confusion regarding ML frameworks, machine learning app development guide, the difference between Artificial Intelligence and machine learning, ML algorithms from scratch, how this technology is helpful for your business? Just fill our contact us form. Our sales representatives will get back to you shortly and resolve your queries. The consultation is absolutely free of cost.
Author Bio: This blog is written with the help of Jigar Mistry, who has over 13 years of experience in the web and mobile app development industry. He has guided to develop over 200 mobile apps and has special expertise in different mobile app categories like Uber like apps, Health and Fitness apps, On-Demand apps and Machine Learning apps. So, we took his help to write this complete guide on machine learning technology and machine app development areas.