_You can reach all Python scripts relative to this on my GitHub page. If you are interested, you can also find the scripts used for data cleaning and data visualization for this study in the same repository. And the project is also deployed using Django on Heroku. _View Deployment
In the regression model, for any fixed value of X, Y is distributed in this problem data-target value (Price ) not normally distributed, it is right skewed.
To solve this problem, the log transformation on the target variable is applied when it has skewed distribution and we need to apply an inverse function on the predicted values to get the actual predicted target value.
Due to this, for evaluating the model, the RMSLE is calculated to check the error and the R2 Score is also calculated to evaluate the accuracy of the model.
The dataset used in this project was downloaded from Kaggle.
The first step is to remove irrelevant/useless features like ‘URL’, ’region_url’, ’vin’, ’image_url’, ’description’, ’county’, ’state’ from the dataset.
As a next step, check missing values for each feature.
Showing missing values (Image By Panwar Abhash Anil)
Next, now missing values were filled with appropriate values by an appropriate method.
To fill the missing values, IterativeImputer method is used and different estimators are implemented then calculated MSE of each estimator using cross_val_score
MSE with Different Imputation Methods (Image By Panwar Abhash Anil)
From the above figure, we can conclude that the _ExtraTreesRegressor _estimator will be better for the imputation method to fill the missing value.
Missing values after filling (Image By Panwar Abhash Anil)
At last, after dealing with missing values there zero null values.
**Outliers: **InterQuartile Range (IQR) method is used to remove the outliers from the data.
Box Plot of price showing outliers (Image By Panwar Abhash Anil)
Box Plot of Odometer showing outliers (Image By Panwar Abhash Anil)
Box Plot & Histogram of the year (Image By Panwar Abhash Anil)
At last, Shape of dataset before process= (435849, 25) and after process= (374136, 18). Total 61713 rows and 7 cols removed.
#data-visualization #django #deployment #machine-learning #data-science #deep learning