Titanic – Surviving the Disaster

|

The Titanic classification problem is a widely recognized and classic challenge in the field of machine learning. It revolves around predicting the survival outcomes of passengers aboard the ill-fated RMS Titanic during its maiden voyage in April 1912. This dataset is a favorite among data scientists and machine learning enthusiasts due to its real-world historical context and inherent complexities.

The task involves developing a predictive model that assigns each passenger to one of two classes: „survived“ or „did not survive,“ based on a set of input features such as age, gender, class, and more. The Titanic dataset, often used for training and testing models, presents numerous opportunities to explore data preprocessing, feature engineering, and model selection, making it an excellent learning resource for those looking to hone their data science and machine learning skills.

Beyond its educational value, the Titanic classification problem highlights the ethical implications of decision-making in life-and-death scenarios and showcases the significance of feature selection and model accuracy in a broader context. This challenge continues to be a cornerstone in the journey of aspiring data scientists, illustrating the power of machine learning in understanding historical events and improving predictive analytics.

1. Exploratory Data Analysis

At first I examined the Titanic-dataset to understand its features, their correlations and their relevance for the classification task of predicting the survival of a passenger.

The main insights I gained from analyzing the dataset are as follows:

  • Female passengers had a much better chance of survival compared to male passengers.
  • First-class passengers were more likely to survive than those in the other two classes, with the majority of fatalities occurring in third class.
  • Age doesn’t appear to significantly impact a passenger’s likelihood of survival, except for babies under 1 year, all of whom survived.
  • Passengers traveling with 1 or 2 relatives had the best chances of survival, while larger family sizes were associated with lower survival rates.
  • This relationship between family size and survival may be explained by the fact that larger families often traveled in third class, where the chances of survival were smaller, as previously established.

Click here to see the code on Kaggle.

2. Feature Engineering and Model training

To address the classification problem, I undertook various steps, including data preprocessing (particularly handling numerous entries with missing data), working on the feature set (which involved feature encoding, feature scaling, creating new features, and selecting relevant features), as well as experimenting with different classification algorithms and optimizing hyperparameters. The models I trained included:

  • Logistic Regression
  • K-nearest-neighbors
  • Naives Bayes
  • Decision Tree
  • Random Forest
  • Linear SVM, Kernel SVM
  • CatBoost
  • XGBoost
  • Artifical neural network (ANN)
  • Ensemble Voting Classifier

My best results were achieved with an Ensemble Voting Classifier and XGBoost with Grid Search, earning me a top 12% ranking on the Kaggle competition leaderboard. You can find the code on my project’s GitHub page.

Image by NoName_13 .

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert