Machine Learning Pipeline for Topic Modeling¶

The dataset provided here was scraped from various RSS feeds between June 2022 and September 2023, serving as the foundation for a Data Science and Machine Learning project. This project focuses on several key tasks, including exploratory data analysis, gaining insights from the data, performing topic modeling, and mastering fundamental techniques.

The dataset is available in two formats: CSV text files and a PostgreSQL database. It includes the following columns:

  • id: a unique identifier for each entry in the database
  • date: content from the RSS pubdate-tag
  • title: content from the RSS title-tag
  • description: content from the RSS description-tag
  • author: content from the RSS author-tag
  • category: content from the RSS category-tag
  • copyright: content from the RSS copyright-tag
  • url: the URL of the RSS feed
  • text: a legacy field for storing additional textual content
  • source: the name of the news magazine hosting the RSS feed (including Focus, Zeit, Tagesschau, Stern, Welt, taz, and ZDF heute)

The pipeline is designed to perform the following tasks: loading data from a PostgreSQL database, basic data preprocessing, natural language processing, and feature engineering. In this process, we do not split the data into a training and test set because we are conducting unsupervised learning, and no target labels are provided.

Imports¶

In [1]:
# data manipulation and plotting
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# loading data from postgresql database 
import sqlalchemy as sql

from datetime import datetime

# saving the pipeline
import joblib

# from scikit-learn
from sklearn.pipeline import Pipeline

# from feature-engine
from feature_engine.imputation import CategoricalImputer, AddMissingIndicator, DropMissingData
from feature_engine.encoding import RareLabelEncoder
from feature_engine.selection import DropFeatures

# from preprocessors
from preprocessors import preprocessors as pp

Load the data from database¶

The entries of the dataset are recorded up from June 2022. The modell will be trained and tested with data from 01.06.2022 to 30.09.2023. Data up from 01.10.2023 will be treated as new data and just used for prediction.

In [2]:
# connect to db
engine = sql.create_engine('postgresql+psycopg2://news:news@localhost:5432/news')
con = engine.connect()

start_date = datetime(2022, 6, 1, 0, 0, 0)
end_date = datetime(2023, 9, 30, 23, 59, 59)

with con:
    
    # query data for model training and testing
    query = sql.text("""
        SELECT *
        FROM headlines
        WHERE (date >= :start_date
        AND date <= :end_date)
        ORDER BY date ASC
        """)
    result = con.execute(query, start_date=start_date, end_date=end_date)
    train = pd.DataFrame(result.fetchall(), columns=result.keys())
In [3]:
train.head()
Out[3]:
id date title description author category copyright url text source
0 71650 2022-06-01 00:13:42 Preise: Grüne halten Senkung der Spritsteuer f... Heute tritt die Steuersenkung auf Kraftstoffe ... None Steuersenkung, Bundestag, Katharina Dröge, Spr... None https://www.stern.de/politik/deutschland/preis... None stern
1 71649 2022-06-01 01:55:03 Biden warnt Putin: USA liefern moderne Raketen... Die USA rüsten die Ukraine mit fortschrittlich... None Ukraine, USA, Joe Biden, Russland, Raketensyst... None https://www.stern.de/politik/ausland/biden-war... None stern
2 71648 2022-06-01 02:04:08 Soziale Medien: FDP-Politiker Kuhle: Internet-... Eine «ZDF Magazin Royale»-Recherche beschäftig... None Konstantin Kuhle, FDP, Straftat, Berlin, ZDF, ... None https://www.stern.de/politik/deutschland/sozia... None stern
3 71675 2022-06-01 02:26:58 Liveblog: ++ Zwei von drei ukrainischen Kinder... Rund zwei von drei Mädchen und Jungen in der U... None None None https://www.tagesschau.de/newsticker/liveblog-... None Tagesschau
4 71647 2022-06-01 02:31:43 Finanzen: Dänemark stimmt über EU-Verteidigung... Vorbehalt verteidigen oder Verteidigung ohne V... None Dänemark, EU, Volksabstimmung, Finanzen, Ukrai... None https://www.stern.de/politik/ausland/finanzen-... None stern
In [4]:
print(train.shape)
(75461, 10)
In [5]:
# replace None by Nan
train = train.fillna(value=np.nan)
train.head()
Out[5]:
id date title description author category copyright url text source
0 71650 2022-06-01 00:13:42 Preise: Grüne halten Senkung der Spritsteuer f... Heute tritt die Steuersenkung auf Kraftstoffe ... NaN Steuersenkung, Bundestag, Katharina Dröge, Spr... NaN https://www.stern.de/politik/deutschland/preis... NaN stern
1 71649 2022-06-01 01:55:03 Biden warnt Putin: USA liefern moderne Raketen... Die USA rüsten die Ukraine mit fortschrittlich... NaN Ukraine, USA, Joe Biden, Russland, Raketensyst... NaN https://www.stern.de/politik/ausland/biden-war... NaN stern
2 71648 2022-06-01 02:04:08 Soziale Medien: FDP-Politiker Kuhle: Internet-... Eine «ZDF Magazin Royale»-Recherche beschäftig... NaN Konstantin Kuhle, FDP, Straftat, Berlin, ZDF, ... NaN https://www.stern.de/politik/deutschland/sozia... NaN stern
3 71675 2022-06-01 02:26:58 Liveblog: ++ Zwei von drei ukrainischen Kinder... Rund zwei von drei Mädchen und Jungen in der U... NaN NaN NaN https://www.tagesschau.de/newsticker/liveblog-... NaN Tagesschau
4 71647 2022-06-01 02:31:43 Finanzen: Dänemark stimmt über EU-Verteidigung... Vorbehalt verteidigen oder Verteidigung ohne V... NaN Dänemark, EU, Volksabstimmung, Finanzen, Ukrai... NaN https://www.stern.de/politik/ausland/finanzen-... NaN stern

Save raw data for train to csv¶

In [6]:
train.to_csv('../data/00_train_no_split_raw.csv')

Configuration¶

In [7]:
# variables with duplicates
VARS_WITH_DUPLICATES = ['title', 'description']

# features to drop
DROP_FEATURES = ['id', 'copyright', 'author', 'url']

# variables with NA in train set that will be filled with 'Missing' value
VARS_WITH_NA_MISSING = ['source', 'category']

# variables with frequent values in train set
VARS_WITH_FREQUENT = ['category']

# variables to be combined 
VARS_TO_COMBINE = ('title_description_text', ['title', 'description', 'text'])

# features that are used for topic modelling (for each feature a modell will be trained)
NLP_FEATURES = ['title', 'title_description_text']

Feature Engineering on train_test¶

Drop duplicates from train_test¶

In [8]:
print(len(train.index))
75461
In [9]:
# Count the number of duplicate rows based on the specified subset of columns
duplicate_count = train.duplicated(subset=VARS_WITH_DUPLICATES).sum()
print("Number of duplicate rows:", duplicate_count)
Number of duplicate rows: 7947
In [10]:
train = pp.drop_duplicates(train, VARS_WITH_DUPLICATES)
In [11]:
print(train.duplicated(subset=VARS_WITH_DUPLICATES).sum())
print(len(train.index))
0
67514

Pipeline¶

Set up and train the Pipeline¶

In [12]:
# set up the Pipeline
topic_pipe = Pipeline([
    
    # ===== CREATION OF NEW FEATURES =====
    ('concat_features_encoder', pp.ConcatStringFeatureEncoder(
        new_col='title_description_text', ref_col_list=['title', 'description', 'text'])),
    
    # ===== IMPUTATION =====
    # impute 'title' with values from description and text
    ('missing_title_imputation', pp.MultiReferenceImputer(col='title', ref_col_list=['description', 'text'])),
    
    # impute text variables with string missing
    ('missing_imputation',  CategoricalImputer(imputation_method='missing', variables=VARS_WITH_NA_MISSING)),
    
    # ===== DROPPING observations with NA ===== 
    ('drop_missing_title', DropMissingData(variables=['title'])),
    
    # ===== DROPPING features
    ('drop_features', DropFeatures(features_to_drop=DROP_FEATURES)),
    
    # ===== ENCODING =====
    # encode rare labels
    ('rare_label_encoder', RareLabelEncoder(
        tol=0.01, n_categories=1, variables=VARS_WITH_FREQUENT, replace_with='Other')),
    
    # ===== CREATION OF NEW FEATURES =====
    ('nlp_feature_encoder', pp.NLPFeatureEncoder(col_list=NLP_FEATURES))
    
])
In [13]:
# train the pipeline
topic_pipe.fit(train)
Out[13]:
Pipeline(steps=[('concat_features_encoder',
                 ConcatStringFeatureEncoder(new_col='title_description_text',
                                            ref_col_list=['title',
                                                          'description',
                                                          'text'])),
                ('missing_title_imputation',
                 MultiReferenceImputer(col='title',
                                       ref_col_list=['description', 'text'])),
                ('missing_imputation',
                 CategoricalImputer(variables=['source', 'category'])),
                ('drop_missing_title', DropMissingData(variables=['title'])),
                ('drop_features',
                 DropFeatures(features_to_drop=['id', 'copyright', 'author',
                                                'url'])),
                ('rare_label_encoder',
                 RareLabelEncoder(n_categories=1, replace_with='Other',
                                  tol=0.01, variables=['category'])),
                ('nlp_feature_encoder',
                 NLPFeatureEncoder(col_list=['title',
                                             'title_description_text']))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('concat_features_encoder',
                 ConcatStringFeatureEncoder(new_col='title_description_text',
                                            ref_col_list=['title',
                                                          'description',
                                                          'text'])),
                ('missing_title_imputation',
                 MultiReferenceImputer(col='title',
                                       ref_col_list=['description', 'text'])),
                ('missing_imputation',
                 CategoricalImputer(variables=['source', 'category'])),
                ('drop_missing_title', DropMissingData(variables=['title'])),
                ('drop_features',
                 DropFeatures(features_to_drop=['id', 'copyright', 'author',
                                                'url'])),
                ('rare_label_encoder',
                 RareLabelEncoder(n_categories=1, replace_with='Other',
                                  tol=0.01, variables=['category'])),
                ('nlp_feature_encoder',
                 NLPFeatureEncoder(col_list=['title',
                                             'title_description_text']))])
ConcatStringFeatureEncoder(new_col='title_description_text',
                           ref_col_list=['title', 'description', 'text'])
MultiReferenceImputer(col='title', ref_col_list=['description', 'text'])
CategoricalImputer(variables=['source', 'category'])
DropMissingData(variables=['title'])
DropFeatures(features_to_drop=['id', 'copyright', 'author', 'url'])
RareLabelEncoder(n_categories=1, replace_with='Other', tol=0.01,
                 variables=['category'])
NLPFeatureEncoder(col_list=['title', 'title_description_text'])
In [14]:
train = topic_pipe.transform(train)

Evaluate the training set¶

In [15]:
train.head()
Out[15]:
date title description category text source title_description_text title_cleaned title_description_text_cleaned
0 2022-06-01 00:13:42 Preise: Grüne halten Senkung der Spritsteuer f... Heute tritt die Steuersenkung auf Kraftstoffe ... Other NaN stern Preise: Grüne halten Senkung der Spritsteuer f... Preis grüne halten Senkung Spritsteuer falsch ... Preis grüne halten Senkung Spritsteuer falsch ...
1 2022-06-01 01:55:03 Biden warnt Putin: USA liefern moderne Raketen... Die USA rüsten die Ukraine mit fortschrittlich... Other NaN stern Biden warnt Putin: USA liefern moderne Raketen... Biden warnen Putin USA liefern modern Raketens... Biden warnen Putin USA liefern modern Raketens...
2 2022-06-01 02:04:08 Soziale Medien: FDP-Politiker Kuhle: Internet-... Eine «ZDF Magazin Royale»-Recherche beschäftig... Other NaN stern Soziale Medien: FDP-Politiker Kuhle: Internet-... sozial Medium FDP-Politiker Kuhle Internet-Str... sozial Medium FDP-Politiker Kuhle Internet-Str...
3 2022-06-01 02:26:58 Liveblog: ++ Zwei von drei ukrainischen Kinder... Rund zwei von drei Mädchen und Jungen in der U... Missing NaN Tagesschau Liveblog: ++ Zwei von drei ukrainischen Kinder... Liveblog ukrainisch Kind vertreiben Liveblog ukrainisch Kind vertreiben rund Mädch...
4 2022-06-01 02:31:43 Finanzen: Dänemark stimmt über EU-Verteidigung... Vorbehalt verteidigen oder Verteidigung ohne V... Other NaN stern Finanzen: Dänemark stimmt über EU-Verteidigung... Finanz Dänemark stimmen EU-Verteidigungsvorbehalt Finanz Dänemark stimmen EU-Verteidigungsvorbeh...
In [16]:
train[train['title'].isnull()]
Out[16]:
date title description category text source title_description_text title_cleaned title_description_text_cleaned
In [17]:
train.isnull().sum()
Out[17]:
date                                  0
title                                 0
description                        3834
category                              0
text                              67513
source                                0
title_description_text                0
title_cleaned                         0
title_description_text_cleaned        0
dtype: int64
In [18]:
train['category'].value_counts()
Out[18]:
category
Other            27545
Missing          25696
News              4023
Ausland           3727
Deutschland       3342
Ukraine-Krise     1244
Wirtschaft        1122
Politik            814
Name: count, dtype: int64

Save preprocessed data for train to csv¶

In [19]:
train.to_csv('../data/01_train_nosplit_preprocessed.csv', index=False)

Save Pipeline¶

In [20]:
## Save Pipeline
joblib.dump(topic_pipe, 'topic_pipe_nosplit.joblib') 
Out[20]:
['topic_pipe_nosplit.joblib']