Another Machine Learning Walk-Through and a Challenge

Another Machine Learning Walk-Through and a Challenge

In this article, I’ll present another machine learning walk-through in Python and also leave you with a challenge: try to develop a better solution (some helpful tips are included)!

In this article, I’ll present another machine learning walk-through in Python and also leave you with a challenge: try to develop a better solution (some helpful tips are included)!

Don’t just read about machine learning — practice it!

After spending considerable time and money on courses, books, and videos, I’ve arrived at one conclusion: the most effective way to learn data science is by doing data science projects. Reading, listening, and taking notes is valuable, but it’s not until you work through a problem that concepts solidify from abstractions into tools you feel confident using.

In this article, I’ll present another machine learning walk-through in Python and also leave you with a challenge: try to develop a better solution (some helpful tips are included)! The complete Jupyter Notebook for this project can be run on Kaggle — no download required — or accessed on GitHub.

Problem Statement

The New York City Taxi Fare prediction challenge, currently running on Kaggle, is a *supervised regression *machine learning task. Given pickup and dropoff locations, the pickup timestamp, and the passenger count, the objective is to predict the fare of the taxi ride. Like most Kaggle competitions, this problem isn’t 100% reflective of those in industry, but it does present a realistic dataset and task on which we can hone our machine learning skills.

To solve this problem, we’ll follow a standard data science pipeline plan of attack:

  1. Understand the problem and data
  2. Data exploration / data cleaning
  3. Feature engineering / feature selection
  4. Model evaluation and selection
  5. Model optimization
  6. Interpretation of results and predictions

This outline may seem to present a linear path from start to finish, but data science is a highly non-linear process where steps are repeated or completed out of order. As we gain familiarity with the data, we often want to go back and revisit our past decisions or take a new approach.

While the final Jupyter Notebook may present a cohesive story, the development process is very messy, involving rewriting code and changing earlier decisions.
Throughout the article, I’ll point out a number of areas in which I think an enterprising data scientist — you — could improve on my solution. I have labeled these *potential improvements *because as a largely empirical field, there are no guarantees in machine learning.

New York City Taxi Fare Prediction Challenge (Link)

Getting Started

The Taxi Fare dataset is relatively large at 55 million training rows, but simple to understand, with only 6 features. The fare_amount is the target, the continuous value we’ll train a model to predict:

Training Data

Throughout the notebook, I used only a sample of 5,000,000 rows to make the calculations quicker. My first recommendation thus is:

  • Potential improvement 1: use more data for training the model

It’s not assured that larger quantities of data will help, but empirical studies have found that generally, as the amount of data used for training a model increases, performance increases. A model trained on more data can better learn the actual signals, especially in a high-dimensional problem with a large number of features (this is not a high-dimensional dataset so there could be limited returns to using more data).

While the sheer size of the dataset may be intimidating, frameworks such as Dask allow you to handle even massive datasets on a personal laptop. Moreover, learning how to set-up and use cloud computing, such as Amazon ECS, is a vital skill once the data exceeds the capability of your machine.

Fortunately, it doesn’t take much research to understand this data: most of us have taken taxi rides before and we know that taxis charge based on miles driven. Therefore for feature engineering, we’ll want to find a way to represent the distance traveled based on the information we are given. We can also read notebooks from other data scientists or read through the competition discussion for ideas about how to solve the problem.

Data Exploration and Cleaning

Although Kaggle data is usually cleaner than real-world data, this dataset still has a few problems, namely anomalies in several of the features. I like to carry out data cleaning as part of the exploration process, correcting anomalies or data errors as I find them. For this problem, we can spot outliers by looking at statistics of the data using df.describe().

Statistical description of training data

The anomalies in passenger_count, the coordinates, and the fare_amount were addressed through a combination of domain knowledge and looking at the distribution of the data. For example, reading about taxi fares in NYC, we see that the minimum fare amount is $2.50 which means we should exclude some of the rides based on the fare. For the coordinates, we can look at the distribution and exclude values that fall well outside the norm. Once we’ve identified outliers, we can remove them using code like the following:

# Remove latitude and longtiude outliers
data = data.loc[data['pickup_latitude'].between(40, 42)]
data = data.loc[data['pickup_longitude'].between(-75, -72)]
data = data.loc[data['dropoff_latitude'].between(40, 42)]
data = data.loc[data['dropoff_longitude'].between(-75, -72)]

Once we clean the data, we can get to the fun part: visualization. Below is a plot of the pickup and dropoff locations on top of NYC colored by the binned fare (binning is a way of turning a continuous variable into a discrete one).

Pickups and Dropoffs plotted on New York City

We also want to take a look at the target variable. Following is an Empirical Cumulative Distribution Function (ECDF) plot of the Target variable, the fare amount. The ECDF can be a better visualization choice than a histogram for one variable because it doesn’t have artifacts from binning.

ECDF of Target

Besides being interesting to look at, plots can help us identify anomalies, relationships, or ideas for new features. In the maps, the color represents the fare, and we can see that the fares starting or ending at the airport (bottom right) tend to be among the most expensive. Going back to the domain knowledge, we read the standard fare for rides to JFK airport is $45, so if we could find a way to identify airport rides, then we’d know the fare accurately.

While I didn’t go that far in this notebook, using domain knowledge for data cleaning and feature engineering is extremely valuable. My second recommendation for improvement is:

  • Potential improvement 1: use more data for training the model

This can be done with domain knowledge (such as a map), or statistical methods (such as z-scores). One interesting approach to this problem is in this notebook, where the author removed rides that began or ended in the water.

The inclusion / exclusion of outliers can have a significant effect on model performance. However, like most problems in machine learning, there’s no standard approach (here’s one statistical method in Python you could try).

Feature Engineering

Feature engineering is the process of creating new features — predictor variables — out of an existing dataset. Because a machine learning model can only learn from the features it is given, this is the most important step of the machine learning pipeline.

For datasets with multiple tables and relationships between the tables, we’ll probably want to use automated feature engineering, but because this problem has a relatively small number of columns and only one table, we can hand-build a few high-value features.

For example, since we know that the cost of a taxi ride is proportional to the distance, we’ll want to use the start and stop points to try and find the distance traveled. One rough approximation of distance is the absolute value of the difference between the start and end latitudes and longitudes.

# Absolute difference in latitude and longitude
data['abs_lat_diff'] = (data['dropoff_latitude'] - data['pickup_latitude']).abs()
data['abs_lon_diff'] = (data['dropoff_longitude'] - data['pickup_longitude']).abs()

Features don’t have to be complex to be useful! Below is a plot of these new features colored by the binned fare.

Absolute Longitude vs Absolute Latitude

What these features give us is a *relative *measure of distance because they are calculated in terms of latitude and longitude and not an actual metric. These features are useful for comparison, but if we want a measurement in kilometers, we can apply the Haversine formula between the start and end of the trip, which calculates the Great Circle distance. This is still an approximation because it gives distance along a line drawn on the spherical surface of the Earth (I’m told the Earth is a sphere) connecting the two points, and clearly, taxis do not travel along straight lines. (See notebook for details).

Haversine distance by Fare (binned)

The other major source of features for this problem are time based. Given a date and time, there are numerous new variables we can extract. Constructing time features is a common task, and in the notebook I’ve included a useful function that builds a dozen features from a single timestamp.

Fare amount colored by time of day

Although I built almost 20 features in this project, there are still more to be found. The tough part about feature engineering is you never know when you have fully exhausted all the options. My next recommendation is:

  • Potential improvement 1: use more data for training the model

Feature engineering also involves problem expertise or applying algorithms that automatically build features for you. After building features, you’ll often have to apply feature selection to find the most relevant ones.

Once you have clean data and a set of features, you start testing models. Even though feature engineering comes before modeling on the outline, I often return to this step again and again over the course of a project.

Model Evaluation and Selection

A good first choice of model for establishing a baseline on a regression task is a simple linear regression. Moreover, if we look at the Pearson correlation of the features with the fare amount for this problem, we find several very strong linear relationships as shown below.

Pearson Correlations of Features with Target

Based on the strength of the linear relationships between some of the features and the target, we can expect a linear model to do reasonably well. While ensemble models and deep neural networks get all the attention, there’s no reason to use an overly complex model if a simple, interpretable model can achieve nearly the same performance. Nonetheless, it still makes sense to try different models, especially because they are easy to build with Scikit-Learn.

The starting model, a linear regression trained on only three features (the abs location differences and the passenger_count) achieved a validation root mean squared error (RMSE) of $5.32 and a mean absolute percentage error of 28.6%. The benefit to a simple linear regression is that we can inspect the coefficients and find for example that an increase in one passenger raises the fare by $0.02 according to the model.

# Linear Regression learned parameters
Intercept 5.0819
abs_lat_diff coef:  113.6661 	
abs_lon_diff coef: 163.8758
passenger_count coef: 0.0204

For Kaggle competitions, we can evaluate a model using both a validation set — here I used 1,000,000 examples — and by submitting test predictions to the competition. This allows us to compare our model to other data scientists — the linear regression places about 600/800. Ideally, we want to use the test set only once to get an estimate of how well our model will do on new data and perform any optimization using a validation set (or cross validation). The problem with Kaggle is that the leaderboard can encourage competitors to build complex models over-optimized to the testing data.

We also want to compare our model to a naive baseline that uses no machine learning, which in the case of regression can be guessing the mean value of the target on the training set. This results in an RMSE of $9.35 which gives us confidence machine learning is applicable to the problem.

Even training the linear regression on additional features does not result in a great leaderboard score and the next step is to try a more complex model. My next choice is usually the random forest, which is where I turned in this problem. The random forest is a more flexible model than the linear regression which means it has a reduced bias — it can fit the training data better. The random forest also generally has low variance meaning it can generalize to new data. For this problem, the random forest outperforms the linear regression, achieving a $4.20 validation RMSE on the same feature set.

Test Set Prediction Distributions

The reason a random forest typically outperforms a linear regression is because it has more flexibility — lower bias — and it has reduced variance because it combines together the predictions of many decision trees. A linear regression is a simple method and as such has a high bias — it assumes the data is linear. A linear regression can also be highly influenced by outliers because it solves for the fit with the lowest sum of squared errors.

The choice of model (and hyperparameters) represents the bias-variance tradeoff in machine learning: a model with high bias cannot learn even the training data accurately while a model with high variance essentially memorizes the training data and cannot generalize to new examples. Because the goal of machine learning is to generalize to new data, we want a model with both low bias and low variance.

The best model on one problem won’t necessarily be the best model on all problems, so it’s important to investigate several models spanning the range of complexity. Every model should be evaluated using the validation data and the best performing model can then be optimized in model tuning. I selected the random forest because of the validation results, and I’d encourage you to try out a few other models (or even combine models in an ensemble).

Model Optimization

In a machine learning problem, we have a few approaches for improving performance:

  1. Understand the problem and data
  2. Data exploration / data cleaning
  3. Feature engineering / feature selection
  4. Model evaluation and selection
  5. Model optimization
  6. Interpretation of results and predictions

There are still gains to be made from 1. and 2. (that’s part of the challenge), but I also wanted to provide a framework for optimizing the selected model.

Model optimization is the process of finding the best hyperparameters for a model on a given dataset. Because the best values of the hyperparameters depend on the data, this has to be done again for each new problem.

While the final Jupyter Notebook may present a cohesive story, the development process is very messy, involving rewriting code and changing earlier decisions.
There are a number of methods for optimization, ranging from manual tuning to automated hyperparameter tuning, but in practice, random search works well and is simple to implement. In the notebook, I provide code for running random search for model optimization. To make the computation times reasonable, I again sampled the data and only ran 50 iterations. Even this takes a considerable amount of time because the hyperparameters are evaluated using 3-fold cross validation. This means on each iteration, the model is trained with a selected combination of hyperparameters 3 times!

The best parameters were 
{'n_estimators': 41, 'min_samples_split': 2, 'max_leaf_nodes': 49, 'max_features': 0.5, 'max_depth': 22, 'bootstrap': True}
with a negative mae of -2.0216735083205952

I also tried out a number of different features and found the best model used only 12 of the 27 features. This makes sense because many of the features are highly correlated and hence are not necessary.

Heatmap of Correlation of Subset of Features

After running the random search and choosing the features, the final random forest model achieved an RMSE of 3.38 which represents a percentage error of 19.0%. This is a 66% reduction in the error from the naive baseline, and a 30% reduction in error from the first linear model. This performance illustrates a critical point in machine learning:

While the final Jupyter Notebook may present a cohesive story, the development process is very messy, involving rewriting code and changing earlier decisions.
Although I ran 50 iterations of random search, the hyperparameter values have probably not been fully optimized. My next recommendation is:

  • Potential improvement 1: use more data for training the model

The returns from this will probably be less than from feature engineering, but it’s possible there are still performance gains to be found. If you are feeling up to the task, you can also try out automated model tuning using a tool such as Hyperopt (I’ve written a guide which can be found here.)

Interpret Model and Make Predictions

While the random forest is more complex than the linear regression, it’s not a complete black box. A random forest is made up of an ensemble of decision trees which by themselves are very intuitive flow-chart-like models. We can even inspect individual trees in the forest to get a sense of how they make decisions. Another method for peering into the black box of the random forest is by examining the feature importances. The technical details aren’t that important at the moment, but we can use the relative values to determine which features are considered relevant to the model.

Feature Importances from training on all features

The most important feature by far is the Euclidean distance of the taxi ride, followed by the pickup_Elapsed , one of the time variables. Given that we made both of these features, we should be confident that our feature engineering went to good use! We could also feature importances for feature engineering or selection since we do not need to keep all of the variables.

Finally, we can take a look at the model predictions both on the validation data and on the test data. Because we have the validation answers, we can compute the error of the predictions, and we can examine the testing predictions for extreme values (there were several in the linear regression). Below is a plot of the validation predictions for the final model.

Random Forest validation predictions and true values.

Model interpretation is still a relatively new field, but there are some promising methods for examining a model. While the primary goal of machine learning is making accurate predictions on new data, it’s also equally important to know *why *the model is accurate and if it can teach us anything about the problem.


In this article and accompanying Jupyter Notebook, I presented a complete machine learning walk-through on a realistic dataset. We implemented the solution in Python code, and touched on many key concepts in machine learning. Machine learning is not a magical art, but rather a craft that can be honed through repetition. There is nothing stopping anyone from learning how to solve real-world problems using machine learning, and the most effective method for becoming adept is to work through projects.

In the spirit of learn by doing, my challenge to you is to improve upon my best model in the notebook and I’ve left you with a few recommendations that I believe will improve performance. There’s no one right answer in data science, and I look forward to seeing what everyone can come up with! If you take on the challenge, please leave a link to your notebook in the comments. Hopefully this article and notebook have given you the start necessary to get out there and solve this problem or others. And, when you do need help, don’t hesitate to ask because the data science community is always supportive.

Machine Learning, Data Science and Deep Learning with Python

Machine Learning, Data Science and Deep Learning with Python

Complete hands-on Machine Learning tutorial with Data Science, Tensorflow, Artificial Intelligence, and Neural Networks. Introducing Tensorflow, Using Tensorflow, Introducing Keras, Using Keras, Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Learning Deep Learning, Machine Learning with Neural Networks, Deep Learning Tutorial with Python

Machine Learning, Data Science and Deep Learning with Python

Complete hands-on Machine Learning tutorial with Data Science, Tensorflow, Artificial Intelligence, and Neural Networks

Explore the full course on Udemy (special discount included in the link):

In less than 3 hours, you can understand the theory behind modern artificial intelligence, and apply it with several hands-on examples. This is machine learning on steroids! Find out why everyone’s so excited about it and how it really works – and what modern AI can and cannot really do.

In this course, we will cover:
• Deep Learning Pre-requistes (gradient descent, autodiff, softmax)
• The History of Artificial Neural Networks
• Deep Learning in the Tensorflow Playground
• Deep Learning Details
• Introducing Tensorflow
• Using Tensorflow
• Introducing Keras
• Using Keras to Predict Political Parties
• Convolutional Neural Networks (CNNs)
• Using CNNs for Handwriting Recognition
• Recurrent Neural Networks (RNNs)
• Using a RNN for Sentiment Analysis
• The Ethics of Deep Learning
• Learning More about Deep Learning

At the end, you will have a final challenge to create your own deep learning / machine learning system to predict whether real mammogram results are benign or malignant, using your own artificial neural network you have learned to code from scratch with Python.

Separate the reality of modern AI from the hype – by learning about deep learning, well, deeply. You will need some familiarity with Python and linear algebra to follow along, but if you have that experience, you will find that neural networks are not as complicated as they sound. And how they actually work is quite elegant!

This is hands-on tutorial with real code you can download, study, and run yourself.

Best Python Libraries For Data Science & Machine Learning

Best Python Libraries For Data Science & Machine Learning

Best Python Libraries For Data Science & Machine Learning | Data Science Python Libraries

This video will focus on the top Python libraries that you should know to master Data Science and Machine Learning. Here’s a list of topics that are covered in this session:

  • Introduction To Data Science And Machine Learning
  • Why Use Python For Data Science And Machine Learning?
  • Python Libraries for Data Science And Machine Learning
  • Python libraries for Statistics
  • Python libraries for Visualization
  • Python libraries for Machine Learning
  • Python libraries for Deep Learning
  • Python libraries for Natural Language Processing

Thanks for reading

If you liked this post, share it with all of your programming buddies!

Follow us on Facebook | Twitter

Further reading about Python

Complete Python Bootcamp: Go from zero to hero in Python 3

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python and Django Full Stack Web Developer Bootcamp

Complete Python Masterclass

Python Tutorial - Python GUI Programming - Python GUI Examples (Tkinter Tutorial)

Computer Vision Using OpenCV

OpenCV Python Tutorial - Computer Vision With OpenCV In Python

Python Tutorial: Image processing with Python (Using OpenCV)

A guide to Face Detection in Python

Machine Learning Tutorial - Image Processing using Python, OpenCV, Keras and TensorFlow

PyTorch Tutorial for Beginners

The Pandas Library for Python

Introduction To Data Analytics With Pandas

Python Programming for Data Science and Machine Learning

Python Programming for Data Science and Machine Learning

This article provides an overview of Python and its application to Data Science and Machine Learning and why it is important.

Originally published by Chris Kambala  at

Python is a general-purpose, high-level, object-oriented, and easy to learn programming language. It was created by Guido van Rossum who is known as the godfather of Python.

Python is a popular programming language because of its simplicity, ease of use, open source licensing, and accessibility — the foundation of its renowned community, which provides great support and help in creating tons of packages, tutorials, and sample programs.

Python can be used to develop a wide variety of applications — ranging from Web, Desktop GUI based programs/applications to science and mathematics programs, and Machine learning and other big data computing systems.

Let’s explore the use of Python in Machine Learning, Data Science, and Data Engineering.

Machine Learning

Machine learning is a relatively new and evolving system development paradigm that has quickly become a mandatory requirement for companies and programmers to understand and use. See our previous article on Machine Learning for the background. Due to the complex, scientific computing nature of machine learning applications, Python is considered the most suitable programming language. This is because of its extensive and mature collection of mathematics and statistics libraries, extensibility, ease of use and wide adoption within the scientific community. As a result, Python has become the recommended programming language for machine learning systems development.

Data Science

Data science combines cutting edge computer and storage technologies with data representation and transformation algorithms and scientific methodology to develop solutions for a variety of complex data analysis problems encompassing raw and structured data in any format. A Data Scientist possesses knowledge of solutions to various classes of data-oriented problems and expertise in applying the necessary algorithms, statistics, and mathematic models, to create the required solutions. Python is recognized among the most effective and popular tools for solving data science related problems.

Data Engineering

Data Engineers build the foundations for Data Science and Machine Learning systems and solutions. Data Engineers are technology experts who start with the requirements identified by the data scientist. These requirements drive the development of data platforms that leverage complex data extraction, loading, and transformation to deliver structured datasets that allow the Data Scientist to focus on solving the business problem. Again, Python is an essential tool in the Data Engineer’s toolbox — one that is used every day to architect and operate the big data infrastructure that is leveraged by the data scientist.

Use Cases for Python, Data Science, and Machine Learning

Here are some example Data Science and Machine Learning applications that leverage Python.

  • Netflix uses data science to understand user viewing pattern and behavioral drivers. This, in turn, helps Netflix to understand user likes/dislikes and predict and suggest relevant items to view.
  • Amazon, Walmart, and Target are heavily using data science, data mining and machine learning to understand users preference and shopping behavior. This assists in both predicting demands to drive inventory management and to suggest relevant products to online users or via email marketing.
  • Spotify uses data science and machine learning to make music recommendations to its users.
  • Spam programs are making use of data science and machine learning algorithm(s) to detect and prevent spam emails.

This article provided an overview of Python and its application to Data Science and Machine Learning and why it is important.

Originally published by Chris Kambala  at


Thanks for reading :heart: If you liked this post, share it with all of your programming buddies! Follow me on Facebook | Twitter

Learn More

☞ Jupyter Notebook for Data Science

☞ Data Science, Deep Learning, & Machine Learning with Python

☞ Deep Learning A-Z™: Hands-On Artificial Neural Networks

☞ Machine Learning A-Z™: Hands-On Python & R In Data Science

☞ Python for Data Science and Machine Learning Bootcamp

☞ Machine Learning, Data Science and Deep Learning with Python

☞ [2019] Machine Learning Classification Bootcamp in Python

☞ Introduction to Machine Learning & Deep Learning in Python

☞ Machine Learning Career Guide – Technical Interview

☞ Machine Learning Guide: Learn Machine Learning Algorithms

☞ Machine Learning Basics: Building Regression Model in Python

☞ Machine Learning using Python - A Beginner’s Guide