Niko  Smith

Niko Smith

1623377815

A/B Testing "Real-Life" Example: A Step by Step Walkthrough | Data Science Interview

This is a comprehensive walkthrough of an A/B testing example. I will go through the details of designing A/B testing experiments, running experiments and interpreting results. The idea is inspired by the book Trustworthy online controlled experiments, a practical guide to A/B testing. I’m sure it will be extremely helpful for data science interview preparation.

Timestamps

  • 0:00 Intro
  • 1:00 Background
  • 2:02 Prerequisites
  • 5:50 Experiment design
  • 11:08 Result to decision

#testing #data-science #interview-questions

What is GEEK

Buddha Community

A/B Testing "Real-Life" Example: A Step by Step Walkthrough | Data Science Interview
Niko  Smith

Niko Smith

1623377815

A/B Testing "Real-Life" Example: A Step by Step Walkthrough | Data Science Interview

This is a comprehensive walkthrough of an A/B testing example. I will go through the details of designing A/B testing experiments, running experiments and interpreting results. The idea is inspired by the book Trustworthy online controlled experiments, a practical guide to A/B testing. I’m sure it will be extremely helpful for data science interview preparation.

Timestamps

  • 0:00 Intro
  • 1:00 Background
  • 2:02 Prerequisites
  • 5:50 Experiment design
  • 11:08 Result to decision

#testing #data-science #interview-questions

How Data Scientists Build Machine Learning Models in Real Life

The web is already flooded by data science and machine learning related resources nowadays. There are numerous blogs, websites, YouTube videos, and forums that are providing useful information regarding data science related topics. Now it has become tedious to choose the right material for any data science quest.

When I started my journey in data science, a few years back, I faced the same dilemma. But one thing I observed in most of these resources they are not complete. You have to traverse through a number of resources to get exhaustive information.

Also, I saw a lack of real-life perspective of the articles written on machine learning models. So I thought of writing a post on the overall picture of building a machine learning model for any use case in real-life.

To perform any data science project, a data scientist needs to go through several steps. Broadly these steps can be presented as:

  1. Formulation of the data science problem from the given business problem

2. Data source exploration and data collection

3. Exploration of the variables (EDA)

4. Model building

5. Model evaluation

6. Model deployment

Steps 1 and 2 depend on the context of the problem. Step 6 depends more on the business requirement and the available infrastructure. And steps 2, 3, 4, and 5 are the sole responsibilities of a data scientist.

In this post, I shall discuss how to build a classification model end-to-end. I shall take you through the entire journey of a data scientist in any project that requires building a classification model. I shall try to organize this post in such a way that it can be readily adapted for similar situations.

I used a Random Forst model to describe the methods. Even if you use any other classifier, the execution will be pretty similar.

#data-science #real-life-experiences #machine-learning #predictive-modeling #classification

5 Things I Wish I Knew About Real-Life AI

It was at an incredible startup where I kick-started my industrial career in Data Science 6 years ago. That was my first industrial job after 4 years in academia and research. When I decided to switch career paths, I simply thought that it will only be a domain shift. During academia, my focus was on biomedical data science and in the startup my focus will be on using data science for building automation. So I thought that the shift from academia to industry would only change the domain —** after all both are data science**, right? Boy, was I naive!

Looking back, it is true that the technical skillset in academia and industry is very similar, but the mindset is vastly different. Yes, both academia and industry careers are data science focused, but the goals are drastically different. The goal of academia is to create a state-of-the-art solution to address novel problems, with the ultimate hope of getting a paper published, but the goal of industry is to **generate revenue **and have happy customers, with the ultimate hope of having a functioning software.

Data Science is like a knife that could be used to cut cake and steak. If you’ve been cutting steak your whole life, probably you will end up breaking the plate the first time you cut a cake. And if you have been cutting cake your whole life, cutting your first piece of steak will be a quite frustrating task. So, if you are in this transitioning state, the following five points might prepare you to what you could expect and save you a significant amount of frustration.

1. It is all about the delivered value, not the method

Let me break the news: no one cares about how deep your neural network is or if you are using the latest transformer, no one cares if you are using decision tree or a gradient boosting machine. Customers care about the accuracy, stability and usability of the AI system you will create, regardless of the method you are using for your machine learning model.

#data-science #product #big-data #artificial-intelligence #machine-learning #data analysis

George  Koelpin

George Koelpin

1603511400

Quora Question Pairs Similarity: Tackling a Real-Life NLP Problem

In this article, I will be walking you through the process of solving a real-life, NLP problem. The article will be divided in parts, which are as follows:

  1. Problem background and data source.
  2. Exploring the data set and its format.
  3. Data cleaning and preprocessing.
  4. Featurization.
  5. Visualizing features and removing redundant ones.
  6. Vectorizing textual features.
  7. Applying machine learning for classification.

Before we go into the details, I want to address a few things. I will try my best to keep my explanations simple and short as my goal here is not to be confusing (duh.) but to make anyone who reads this article, be they seasoned data scientists or someone who has limited touch with the subject, understand what is being done, and why.

That being said, it will help you to understand the code (which is commented quite a bit, so it should be easy to understand) if you are familiar with Python syntax.

You can find the Python code here (also linked at the end of the article). All that I will discuss here, can be found in code. You can also view the files at my Github. Alrighty then, let’s jump right into it.

Part 1: Problem background and data source

The data is from a Kaggle competition help in 2017. The competition was organized by Quora and had a first place prize of $12,500. Quora is a question-and-answer website which receives millions of questions not all of which are new and unique. Quite a few of them have already been asked on Quora and have rich answers.

If duplicates are allowed, it would corrupt the quality of answers thereby negatively effecting the experience of the person asking the questions, person answering the questions and the person searching the web for an answer (imagine searching Google for a question and finding 3 results from Quora instead of 1). This problem however is not unique to Quora and many organizations have similar issues (for Ex: Stackoverflow).

Ideally, what would happen is that once a question is asked, Quora would use some “technique” to find a subset of its existing question data base such that this subset contains questions which are “similar” to or about the same topic as the new question being asked. Once this subset has been identified, Quora would employ a machine learning technique to then determine if a duplicate question exists in this selected subset. If yes, it would notify the questioner and point them to it, else it creates the question.

Our job in this project, is to develop the machine learning part. We are not concerned with how Quora generates the subset, assume that it is there and works properly.

The process can be summarized as follows:

Step 1: Receive a request for posting a new question by a user.

Step 2: Analyze the question on a high level and generate a subset of questions which are similar and already existing in the data base.

Step 3: Pair the new question with all the questions in the subset and apply machine learning to determine if any of the pairs is a duplicate.

Step 4: Depending on the results of step 3, take appropriate action.

We are only concerned with step 3 in this project.

#data-science #python-data-science #machine-learning #nlp #classification

3 Ways to Get Real-Life Data Science Experience Before Your First Job

Introduction

Getting my first data science job was hard.

It’s especially hard to break into data science when companies typically require a Master’s degree and a minimum of 2–3 years of experience. That being said, there are a number of great resources that I came across that I want to share with you.

In this article, I’m going to give you three ways where you can get practical data science experience on your own. By completing these projects, you’ll develop a strong understanding of SQLPandas, and machine learning modeling.

  1. First, I’m going to provide you with real-life SQL case studies in which you’re given a business problem and are required to query databases to diagnose the problem and formulate a solution.
  2. Second, I’m going to provide you with dozens of practice problems for Pandas, a library in Python meant for data manipulation and analysis. This will help you develop the skills that are required for data wrangling and data cleaning.
  3. Lastly, I’m going to provide you with a variety of machine learning problems where you can develop a machine learning model to make predictions. By doing so, you’ll learn how to approach a machine learning problem, as well as the fundamental steps required to develop a machine learning model from start to finish.

With that said, let’s dive into it!


1. SQL Case Studies

If you want to be a data scientist, you have to have strong SQL skills. Mode provides three practical SQL case studies that simulate real-life business problems, as well as an online SQL editor where you can write and run queries.

_To open Mode’s SQL editor, go to _this link_ and click on the hyperlink where it says ‘Open another window to Mode’._

Learning SQL

If you’re new to SQL, I would first start with Mode’s SQL tutorials where you can learn basic, intermediate, and advanced SQL techniques. Feel free to skip this if you already have a good understanding of SQL.

Case Study 1: Investigating a Drop in User Engagement

Link to the case.

The objective of this case is to determine the cause for a drop in user engagement for Yammer’s project. Before diving into the data, you should read the overview of what Yammer does here. There are 4 tables that you should work with.

The link to the case will provide you with much more detail pertaining to the problem, the data, and the questions that should be answered.

Check out how I approached this case study here if you’d like guidance.

#technology #work #machine-learning #data-science #artificial-intelligence