The dataset provided here was scraped from various RSS feeds between June 2022 and September 2023, serving as the foundation for a Data Science and Machine Learning project. This project focuses on several key tasks, including exploratory data analysis, gaining insights from the data, performing topic modeling, and mastering fundamental techniques.
The dataset is available in two formats: CSV text files and a PostgreSQL database. It includes the following columns:
The pipeline is designed to perform the following tasks: loading data from a PostgreSQL database, basic data preprocessing, natural language processing, and feature engineering. In this process, we do not split the data into a training and test set because we are conducting unsupervised learning, and no target labels are provided.
# data manipulation and plotting
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# loading data from postgresql database
import sqlalchemy as sql
from datetime import datetime
# saving the pipeline
import joblib
# from scikit-learn
from sklearn.pipeline import Pipeline
# from feature-engine
from feature_engine.imputation import CategoricalImputer, AddMissingIndicator, DropMissingData
from feature_engine.encoding import RareLabelEncoder
from feature_engine.selection import DropFeatures
# from preprocessors
from preprocessors import preprocessors as pp
The entries of the dataset are recorded up from June 2022. The modell will be trained and tested with data from 01.06.2022 to 30.09.2023. Data up from 01.10.2023 will be treated as new data and just used for prediction.
# connect to db
engine = sql.create_engine('postgresql+psycopg2://news:news@localhost:5432/news')
con = engine.connect()
start_date = datetime(2022, 6, 1, 0, 0, 0)
end_date = datetime(2023, 9, 30, 23, 59, 59)
with con:
# query data for model training and testing
query = sql.text("""
SELECT *
FROM headlines
WHERE (date >= :start_date
AND date <= :end_date)
ORDER BY date ASC
""")
result = con.execute(query, start_date=start_date, end_date=end_date)
train = pd.DataFrame(result.fetchall(), columns=result.keys())
train.head()
| id | date | title | description | author | category | copyright | url | text | source | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 71650 | 2022-06-01 00:13:42 | Preise: Grüne halten Senkung der Spritsteuer f... | Heute tritt die Steuersenkung auf Kraftstoffe ... | None | Steuersenkung, Bundestag, Katharina Dröge, Spr... | None | https://www.stern.de/politik/deutschland/preis... | None | stern |
| 1 | 71649 | 2022-06-01 01:55:03 | Biden warnt Putin: USA liefern moderne Raketen... | Die USA rüsten die Ukraine mit fortschrittlich... | None | Ukraine, USA, Joe Biden, Russland, Raketensyst... | None | https://www.stern.de/politik/ausland/biden-war... | None | stern |
| 2 | 71648 | 2022-06-01 02:04:08 | Soziale Medien: FDP-Politiker Kuhle: Internet-... | Eine «ZDF Magazin Royale»-Recherche beschäftig... | None | Konstantin Kuhle, FDP, Straftat, Berlin, ZDF, ... | None | https://www.stern.de/politik/deutschland/sozia... | None | stern |
| 3 | 71675 | 2022-06-01 02:26:58 | Liveblog: ++ Zwei von drei ukrainischen Kinder... | Rund zwei von drei Mädchen und Jungen in der U... | None | None | None | https://www.tagesschau.de/newsticker/liveblog-... | None | Tagesschau |
| 4 | 71647 | 2022-06-01 02:31:43 | Finanzen: Dänemark stimmt über EU-Verteidigung... | Vorbehalt verteidigen oder Verteidigung ohne V... | None | Dänemark, EU, Volksabstimmung, Finanzen, Ukrai... | None | https://www.stern.de/politik/ausland/finanzen-... | None | stern |
print(train.shape)
(75461, 10)
# replace None by Nan
train = train.fillna(value=np.nan)
train.head()
| id | date | title | description | author | category | copyright | url | text | source | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 71650 | 2022-06-01 00:13:42 | Preise: Grüne halten Senkung der Spritsteuer f... | Heute tritt die Steuersenkung auf Kraftstoffe ... | NaN | Steuersenkung, Bundestag, Katharina Dröge, Spr... | NaN | https://www.stern.de/politik/deutschland/preis... | NaN | stern |
| 1 | 71649 | 2022-06-01 01:55:03 | Biden warnt Putin: USA liefern moderne Raketen... | Die USA rüsten die Ukraine mit fortschrittlich... | NaN | Ukraine, USA, Joe Biden, Russland, Raketensyst... | NaN | https://www.stern.de/politik/ausland/biden-war... | NaN | stern |
| 2 | 71648 | 2022-06-01 02:04:08 | Soziale Medien: FDP-Politiker Kuhle: Internet-... | Eine «ZDF Magazin Royale»-Recherche beschäftig... | NaN | Konstantin Kuhle, FDP, Straftat, Berlin, ZDF, ... | NaN | https://www.stern.de/politik/deutschland/sozia... | NaN | stern |
| 3 | 71675 | 2022-06-01 02:26:58 | Liveblog: ++ Zwei von drei ukrainischen Kinder... | Rund zwei von drei Mädchen und Jungen in der U... | NaN | NaN | NaN | https://www.tagesschau.de/newsticker/liveblog-... | NaN | Tagesschau |
| 4 | 71647 | 2022-06-01 02:31:43 | Finanzen: Dänemark stimmt über EU-Verteidigung... | Vorbehalt verteidigen oder Verteidigung ohne V... | NaN | Dänemark, EU, Volksabstimmung, Finanzen, Ukrai... | NaN | https://www.stern.de/politik/ausland/finanzen-... | NaN | stern |
train.to_csv('../data/00_train_no_split_raw.csv')
# variables with duplicates
VARS_WITH_DUPLICATES = ['title', 'description']
# features to drop
DROP_FEATURES = ['id', 'copyright', 'author', 'url']
# variables with NA in train set that will be filled with 'Missing' value
VARS_WITH_NA_MISSING = ['source', 'category']
# variables with frequent values in train set
VARS_WITH_FREQUENT = ['category']
# variables to be combined
VARS_TO_COMBINE = ('title_description_text', ['title', 'description', 'text'])
# features that are used for topic modelling (for each feature a modell will be trained)
NLP_FEATURES = ['title', 'title_description_text']
print(len(train.index))
75461
# Count the number of duplicate rows based on the specified subset of columns
duplicate_count = train.duplicated(subset=VARS_WITH_DUPLICATES).sum()
print("Number of duplicate rows:", duplicate_count)
Number of duplicate rows: 7947
train = pp.drop_duplicates(train, VARS_WITH_DUPLICATES)
print(train.duplicated(subset=VARS_WITH_DUPLICATES).sum())
print(len(train.index))
0 67514
# set up the Pipeline
topic_pipe = Pipeline([
# ===== CREATION OF NEW FEATURES =====
('concat_features_encoder', pp.ConcatStringFeatureEncoder(
new_col='title_description_text', ref_col_list=['title', 'description', 'text'])),
# ===== IMPUTATION =====
# impute 'title' with values from description and text
('missing_title_imputation', pp.MultiReferenceImputer(col='title', ref_col_list=['description', 'text'])),
# impute text variables with string missing
('missing_imputation', CategoricalImputer(imputation_method='missing', variables=VARS_WITH_NA_MISSING)),
# ===== DROPPING observations with NA =====
('drop_missing_title', DropMissingData(variables=['title'])),
# ===== DROPPING features
('drop_features', DropFeatures(features_to_drop=DROP_FEATURES)),
# ===== ENCODING =====
# encode rare labels
('rare_label_encoder', RareLabelEncoder(
tol=0.01, n_categories=1, variables=VARS_WITH_FREQUENT, replace_with='Other')),
# ===== CREATION OF NEW FEATURES =====
('nlp_feature_encoder', pp.NLPFeatureEncoder(col_list=NLP_FEATURES))
])
# train the pipeline
topic_pipe.fit(train)
Pipeline(steps=[('concat_features_encoder',
ConcatStringFeatureEncoder(new_col='title_description_text',
ref_col_list=['title',
'description',
'text'])),
('missing_title_imputation',
MultiReferenceImputer(col='title',
ref_col_list=['description', 'text'])),
('missing_imputation',
CategoricalImputer(variables=['source', 'category'])),
('drop_missing_title', DropMissingData(variables=['title'])),
('drop_features',
DropFeatures(features_to_drop=['id', 'copyright', 'author',
'url'])),
('rare_label_encoder',
RareLabelEncoder(n_categories=1, replace_with='Other',
tol=0.01, variables=['category'])),
('nlp_feature_encoder',
NLPFeatureEncoder(col_list=['title',
'title_description_text']))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('concat_features_encoder',
ConcatStringFeatureEncoder(new_col='title_description_text',
ref_col_list=['title',
'description',
'text'])),
('missing_title_imputation',
MultiReferenceImputer(col='title',
ref_col_list=['description', 'text'])),
('missing_imputation',
CategoricalImputer(variables=['source', 'category'])),
('drop_missing_title', DropMissingData(variables=['title'])),
('drop_features',
DropFeatures(features_to_drop=['id', 'copyright', 'author',
'url'])),
('rare_label_encoder',
RareLabelEncoder(n_categories=1, replace_with='Other',
tol=0.01, variables=['category'])),
('nlp_feature_encoder',
NLPFeatureEncoder(col_list=['title',
'title_description_text']))])ConcatStringFeatureEncoder(new_col='title_description_text',
ref_col_list=['title', 'description', 'text'])MultiReferenceImputer(col='title', ref_col_list=['description', 'text'])
CategoricalImputer(variables=['source', 'category'])
DropMissingData(variables=['title'])
DropFeatures(features_to_drop=['id', 'copyright', 'author', 'url'])
RareLabelEncoder(n_categories=1, replace_with='Other', tol=0.01,
variables=['category'])NLPFeatureEncoder(col_list=['title', 'title_description_text'])
train = topic_pipe.transform(train)
train.head()
| date | title | description | category | text | source | title_description_text | title_cleaned | title_description_text_cleaned | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2022-06-01 00:13:42 | Preise: Grüne halten Senkung der Spritsteuer f... | Heute tritt die Steuersenkung auf Kraftstoffe ... | Other | NaN | stern | Preise: Grüne halten Senkung der Spritsteuer f... | Preis grüne halten Senkung Spritsteuer falsch ... | Preis grüne halten Senkung Spritsteuer falsch ... |
| 1 | 2022-06-01 01:55:03 | Biden warnt Putin: USA liefern moderne Raketen... | Die USA rüsten die Ukraine mit fortschrittlich... | Other | NaN | stern | Biden warnt Putin: USA liefern moderne Raketen... | Biden warnen Putin USA liefern modern Raketens... | Biden warnen Putin USA liefern modern Raketens... |
| 2 | 2022-06-01 02:04:08 | Soziale Medien: FDP-Politiker Kuhle: Internet-... | Eine «ZDF Magazin Royale»-Recherche beschäftig... | Other | NaN | stern | Soziale Medien: FDP-Politiker Kuhle: Internet-... | sozial Medium FDP-Politiker Kuhle Internet-Str... | sozial Medium FDP-Politiker Kuhle Internet-Str... |
| 3 | 2022-06-01 02:26:58 | Liveblog: ++ Zwei von drei ukrainischen Kinder... | Rund zwei von drei Mädchen und Jungen in der U... | Missing | NaN | Tagesschau | Liveblog: ++ Zwei von drei ukrainischen Kinder... | Liveblog ukrainisch Kind vertreiben | Liveblog ukrainisch Kind vertreiben rund Mädch... |
| 4 | 2022-06-01 02:31:43 | Finanzen: Dänemark stimmt über EU-Verteidigung... | Vorbehalt verteidigen oder Verteidigung ohne V... | Other | NaN | stern | Finanzen: Dänemark stimmt über EU-Verteidigung... | Finanz Dänemark stimmen EU-Verteidigungsvorbehalt | Finanz Dänemark stimmen EU-Verteidigungsvorbeh... |
train[train['title'].isnull()]
| date | title | description | category | text | source | title_description_text | title_cleaned | title_description_text_cleaned |
|---|
train.isnull().sum()
date 0 title 0 description 3834 category 0 text 67513 source 0 title_description_text 0 title_cleaned 0 title_description_text_cleaned 0 dtype: int64
train['category'].value_counts()
category Other 27545 Missing 25696 News 4023 Ausland 3727 Deutschland 3342 Ukraine-Krise 1244 Wirtschaft 1122 Politik 814 Name: count, dtype: int64
train.to_csv('../data/01_train_nosplit_preprocessed.csv', index=False)
## Save Pipeline
joblib.dump(topic_pipe, 'topic_pipe_nosplit.joblib')
['topic_pipe_nosplit.joblib']