Welcome back! In my previous post I wrote an EDA (Exploratory Data Analysis) on Titanic Survival dataset. Check it out now if you haven’t already. Anyway, in this article I would like to be more focusing on how to create a machine learning model which is able to predict whether a Titanic passenger survived based on their attributes i.e. gender, title, age and many more.

Before going any further, I also want you to know that the project I do here is inspired by this article: https://towardsdatascience.com/kaggle-titanic-machine-learning-model-top-7-fa4523b7c40. I do implement several feature engineering techniques explained in that article with several modifications for the sake of simplicity. Now let’s do this :)

Note: full code available at the end of this article.


As always, the very first thing I do is importing all required modules and loading the dataset.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
df = pd.read_csv('train.csv')

Feature engineering 1: SibSp & Parch

Now let’s start the feature engineering stuff from the _SibSp _and _Parch _columns. According to the dataset details (which you can access it from this link), the two columns represent the number of siblings/spouses and the number of parents/children abroad the Titanic respectively. The idea here is to create a new column called FamilySize in which the value is taken from the two columns I mentioned earlier.

#sklearn #logistic-regression #classification #machine-learning #ai

Titanic Survival Dataset Part 2/2: Logistic Regression
1.35 GEEK