Kaggle House Prices Prediction with Linear Regression and Gradient Boosting. This notebook achieved a score of 0.12 and within the top 25% in this Kaggle House Price competition

**My **[**Kaggle Notebook Link is here**](https://www.kaggle.com/paulrohan2020/eda-and-simple-linear-regression-for-house-price)

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import mean_squared_log_error
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor
import math
```

The evaluation criteria for this Kaggle Competition is RMSLE — “Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)”

The Root Mean Squared Log Error (RMSLE) can be defined using a slight modification on sklearn’s **mean_squared_log_error** function, which itself a modification on the familiar Mean Squared Error (MSE) metric.

```
def root_mean_squared_log_error(y_validations, y_predicted):
if len(y_predicted) != len(y_validations):
return 'error: mismatch in number of data points between x and y'
y_predict_modified = [math.log(i) for i in y_predicted]
y_validations_modified = [math.log(i) for i in y_validations]
return mean_squared_error(y_validations_modified, y_predict_modified, squared=False)
df_train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
df_test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
df_train.head()
```

`df_test.head()`

First, I am getting a basic description of the data, to look quickly at the overall data to get some simple, easy-to-understand information. This is just to get a basic feel for the data. Using describe() function to get various summary statistics that exclude NaN values.

`df_train.describe()`

kaggle data-science kaggle-competition python machine-learning

Practice your skills in Data Science with Python, by learning and then trying all these hands-on, interactive projects, that I have posted for you.