Mercedes Green Manufacturing: Kaggle Competition

As part of my continuing data analysis learning journey I thought of trying out past completed Kaggle competition in order to test my skills and knowledge so far . While going through the datasets I came across this Mercedes Green Manufacturing Kaggle competition conducted in sometime in 2017.

Coming from a automotive domain I though this could be a good dataset to apply by data analysis skills. On reading the competition description I could relate to this problem even more closely . The competition is asking, given a set of anonymous categorical and binary variable can you predict the time which the car will take to complete its testing.

As a engineer from this domain I can completely see the importance of such a model . I know how time consuming vehicle testing can be. The process consists of building a prototype car, instrumenting it and then running the required tests . The major bottle neck in car testing occurs during instrumentation phase which requires to de-assemble the car ,fit the required recording instruments and then re-assemble the car.

Another bottle neck during testing is also the availability of testing equipment such as drive cells required to run the test.

All this factors results in man-hours wastage and a increased development time in the vehicle development program. This adds unplanned over-head cost to the company. Hence a model which can predict how much time a car will take to complete a test will help better plan and manage cost and resources.

#stacking #mercedes-benz #xgboost #ensemble-learning #automotive #deep learning

What is GEEK

Buddha Community

Mercedes Green Manufacturing: Kaggle Competition
Vern  Greenholt

Vern Greenholt

1598236620

Kaggle Beginner Competitions Can Be Cheated

The purpose of this article is to warn new kagglers before they waste their time on trying to get an impossible score. Some kagglers got maximum accuracy with one click. Before we discuss how they did it and why — let’s introduce shortly Kaggle scoring model to understand why would even somebody try to cheat.

Kaggle Progression System

Kaggle is a portal where data scientists, machine learning experts, and analytics can challenge their skills, share knowledge, and take part in various competitions. And it is open to every level of experience — from complete newbie to grandmaster. You can use open datasets to broaden your knowledge, gain kudos/swag, and even win money.

Image for post

Some of the available competitions. (Image by author)

Winning competitons, taking part in discusions, and sharing your ideas result in medals. Medals are presented on your profile along with all your achievements.

Image for post

#data-science #beginner #kaggle-competition #competition #kaggle #data science

Kaggle Beginner Competitions Can Be Cheated

The purpose of this article is to warn new kagglers before they waste their time on trying to get an impossible score. Some kagglers got maximum accuracy with one click. Before we discuss how they did it and why — let’s introduce shortly Kaggle scoring model to understand why would even somebody try to cheat.

Kaggle Progression System

Kaggle is a portal where data scientists, machine learning experts, and analytics can challenge their skills, share knowledge, and take part in various competitions. And it is open to every level of experience — from complete newbie to grandmaster. You can use open datasets to broaden your knowledge, gain kudos/swag, and even win money.

Image for post

Some of the available competitions. (Image by author)

Winning competitons, taking part in discusions, and sharing your ideas result in medals. Medals are presented on your profile along with all your achievements.

#data-science #beginner #kaggle-competition #competition #kaggle #data science

Mercedes Green Manufacturing: Kaggle Competition

As part of my continuing data analysis learning journey I thought of trying out past completed Kaggle competition in order to test my skills and knowledge so far . While going through the datasets I came across this Mercedes Green Manufacturing Kaggle competition conducted in sometime in 2017.

Coming from a automotive domain I though this could be a good dataset to apply by data analysis skills. On reading the competition description I could relate to this problem even more closely . The competition is asking, given a set of anonymous categorical and binary variable can you predict the time which the car will take to complete its testing.

As a engineer from this domain I can completely see the importance of such a model . I know how time consuming vehicle testing can be. The process consists of building a prototype car, instrumenting it and then running the required tests . The major bottle neck in car testing occurs during instrumentation phase which requires to de-assemble the car ,fit the required recording instruments and then re-assemble the car.

Another bottle neck during testing is also the availability of testing equipment such as drive cells required to run the test.

All this factors results in man-hours wastage and a increased development time in the vehicle development program. This adds unplanned over-head cost to the company. Hence a model which can predict how much time a car will take to complete a test will help better plan and manage cost and resources.

#stacking #mercedes-benz #xgboost #ensemble-learning #automotive #deep learning

Angela  Dickens

Angela Dickens

1595040900

Detailed Solution to Mercedes Benz Green Manufacturing Competition

Table of Contents:

  1. Business Problem
  2. Problem Statement
  3. Data Preparation
  4. Exploratory Data Analysis
  5. Feature Engineering
  6. Data Preprocessing
  7. Feature Selection
  8. Modeling
  9. Summary
  10. Predictions on Test Data
  11. Conclusion and Future Work
  12. References

1. Business Problem:

Vehicle Testing is an important aspect in the automobile manufacturing process. Every vehicle must pass a certain standard before it is delivered to the customer. Mercedes Benz offers a wide range of vehicles with different customization. Each vehicle must undergo testing in order to ensure vehicle satisfies the safety requirements and meets the emission norms. Each model requires a different test stand configuration due to the customization. Since the number of models are more, large number of tests need to be conducted. More tests result in more time spent on the test stand increasing costs for Mercedes-Benz and generating higher carbon dioxide (green house pollutant gas).

The Mercedes Benz Green Manufacturing Competition hosted on Kaggle intends to optimize the vehicle testing process by developing a machine learning model which can predict the time spent by a vehicle on test stand. The ultimate goal is to reduce the time spent on a test stand which will result in reduced carbon dioxide emissions during testing phase. The dataset for this study is provided by Mercedes Benz. The data can be downloaded from this link.

2. Problem Statement:

The task is to develop a machine learning model that can predict the time a car will spend on the test bench based on the vehicle configuration. The vehicle configuration is defined as the set of customization options and features selected for the particular vehicle. The motivation behind the problem is that an accurate model will be able to reduce the total time spent on testing vehicles by allowing cars with similar testing configurations to be run successively. This problem is a supervised machine learning regression task since it involves predicting a continuous target variable based on a bunch of independent variables by learning for a labelled training data.

The evaluation metric is R-squared also known as co-efficient of determination. It quantifies the percentage of variation in target variables that is explained by the features. R-squared value can lie between 0 and 1. The best possible value of R-squared is 1 which indicates that all the variation in target variables is explained by the input features.

3. Data Preparation:

Two datasets are provided by Mercedes-Benz for this competition namely train.csv and test.csv. The file train.csv is the labeled dataset on which machine learning model has to be developed. The file test.csv is the dataset on which predictions are to be made. Both training and test data contain 377 features which represent the vehicle configuration during the vehicle testing phase. The features have names such as ‘X0’, ‘X1’, ‘X2’ and so on. There is a feature ‘ID’ which represents the ID assigned to each vehicle test. The features are anonymous and do not have any physical representation. The description of the data states that these features are configuration options such as suspension setting, adaptive cruise control, all-wheel drive and a number of different options that together define a car model. A subset of training data in shown in image below.

#Load dataset
data = pd.read_csv('train.csv')
data.head()

Image for post

There are 377 features out of which 368 are binary, 8 are categorical and 1 is continuous. The target variable y is continuous value which represents the time taken by vehicle for testing in seconds. There are no missing values present in the dataset. The train.csv file is split into training and validation set. Below image shows the code for this operation.

#Separate the dependent and independent features
X = data.drop(columns=['y'])
Y = data['y']

#Split the dataset
X_train, X_val, y_train, y_val = train_test_split(X, Y, 
test_size=0.2, random_state=25)
#Concatenate X_train and y_train
train_data = pd.merge(X_train,y_train.to_frame(),left_index=True, right_index=True)
train_data.head()

Image for post

Exploratory Data Analysis (EDA) is performed on train_data mentioned in above code.

4. Exploratory Data Analysis:

4.1. Analyze target/dependent variable:

Below image contains the histogram and box-plot of target variable.

Image for post

The target variable has a mean of 100 seconds. Points with target values above 137.5 can be inferred as outlier points from boxplot. For this competition, points above 150 are classified as outlier points and they are removed from the training set.

4.2. Analyze categorical variables:

There are 8 categorical features namely X0, X1, X2, X3, X4, X5, X6, and X8. For each of these features histogram of counts of unique values and boxplot of unique values is plotted.

4.2.1. X0 feature:

Image for post

Observations from above plots:

  1. aa, ab, g and ac occur only once.
  2. The box-plots of z, y, t, o, f, n, s, al, m, ai, e, ba, aq, am, u, i, ad and b are nearly same. The mean of these categories is nearly 93.
  3. The box-plots of ak, x, j, w, af, at, a, ax, i, au, as, r and c are nearly same. The mean of these categories is nearly 110.
  4. Thus there appears to be grouping among different categories of X0.

4.2.2. X1 feature:

Image for post

Observations from above plots:

  1. Most of the categories of X1 have mean of 100.
  2. y of X1 category is clearly separated from rest of the categories.

4.2.3. X2 feature:

Image for post

Observations from above plots:

  1. ae category dominates in X2. 39% values of X2 are ae.
  2. Similar to X0 there appears to be grouping in X2. X2 has less grouping than X0.
  3. Most of categories of X2 have mean close to 97.

4.2.4. X3 feature:

Image for post

Observations from above plots:

  1. c category dominates in X3. 46% values of X3 are c.
  2. Almost all the categories of X3 have mean of 100.
  3. There appears to be less variation in dependent variable y across the categories of X3. The box-plots for most of the categories of X3 match.

4.2.5. X4 feature:

Image for post

Observations from above plots:

  1. d category dominates in X4. 99.9% values of X4 are d.
  2. This feature must be dropped as there is no variance present in the feature.

4.2.6. X5 feature:

Image for post

Observations from above plots:

  1. x, h, g, y and u occur very rarely in the data.
  2. The mean of all categories of X5 is close to 98.
  3. There appears to be less variation in dependent variable y across the categories of X5. The boxplots for most of the categories of X5 match.

#deep-learning #machine-learning #kaggle #stacking #top-5 #deep learning

Lessons From My First Kaggle Competition

How I chose my first Kaggle competition to enter and what I learned from doing it.

A little background

I find starting out in a new area of programming a somewhat daunting experience. I have been programming for the past 8 years, but only recently have developed a keen interest in Data Science. I want to share my experience to encourage you to take the plunge too!

I started out dipping my toe in the ocean of this vast topic with a couple of the Kaggle mini-courses. I didn’t need to learn how to write in Python, but I needed to equip myself with the tools to do the programming that I wanted. First up was Intro to Machine Learning — it seemed like a good place to start. As part of this course you contribute to an in course competition, but even after completing it, I didn’t feel prepared to do a public competition. Cue Intermediate Machine Learning, where I learned to use a new model and how to think deeper about a data problem.

#2020 sep tutorials # overviews #competition #data science #kaggle