Portfolio

In the course of 2023, I completed training in Data Analysis and Machine Learning with Python. I covered a wide range of topics encompassing every stage of the ML pipeline, beginning with Exploratory Data Analysis and Data Visualization, Data Preprocessing, Feature Engineering, Model Training, and Scoring to address various tasks such as regression, classification, clustering, and time series forecasting. I also delved into the fundamentals of Natural Language Processing, Artificial Neural Networks (ANN), and Convolutional Neural Networks (CNN), while gaining proficiency in Python’s powerful APIs like NumPy, Pandas, Scikit-learn, TensorFlow, Keras, and spaCy. I’m still at the early stages of my journey and eagerly anticipate delving deeper into this captivating world.

Over the past few months, I have undertaken three projects that have served as valuable learning experiences.

Topic Modeling Of News With Latent Dirichlet Allocation

This project originated during my early days of learning Python over a year ago. Focused on News Scraping, I developed a module to extract data from various German news RSS feeds. My goal was to build a database of newspaper headlines for data analysis, resulting in a dataset of over 70,000 entries.

The project’s main aim is to cluster this dataset into dominant news topics using their content. I created a Machine Learning pipeline for data cleaning, natural language processing, and feature engineering. After that, I generated a document-term matrix as input for Latent Dirichlet Allocation and predicted the topics on both the training corpus as well as new data.

Try the Demo App to see the process in action!

Titanic – Surviving the Disaster

This project is highly popular for studying various algorithms used in classification tasks. The historical dataset that records information about the passengers of the Titanic serves as the foundation for training a machine learning model to predict survival, based on features such as gender, age, and class, with as much accuracy as possible. I participated in the Kaggle competition and trained several models to improve my results. Currently, I am ranked in the top 12% on the competition leaderboard, with my best model achieving a score of 0.785 using an Ensemble Voting Classifier.

Possum Regression

This project focuses on regression techniques for predicting both continuous and categorical target variables using Abram Beyer’s dataset hosted on Kaggle. The dataset contains information on mountain brushtail possums trapped at various locations across Australia. My work encompasses Exploratory Data Analysis, the prediction of head length and earconch measurements, and the classification of the possum population through the training of various regression models.

Image by Freepik