Topic Modeling Of News With Latent Dirichlet Allocation

This project employs Latent Dirichlet Allocation (LDA) to perform topic modeling on a corpus of news headlines sourced from German RSS feeds. My study encompasses data scraping, preprocessing, feature engineering, LDA model training, evaluation, and application on new data. The primary objective is to uncover latent topics within the news headlines, offering insights into the underlying themes of the German news landscape. This research provides a robust framework for ongoing headline classification, enhancing the relevance and utility of this investigation within the field of data-driven journalism and information retrieval.

The project is implemented in Python, using scikit-learn, spaCy, pandas, NumPy, wordcloud, and pyLDAvis. You can find the code on my project’s GitHub page.

1. Input dataset

The dataset provided here was scraped from the RSS feeds of seven German news magazines: Focus, Zeit, Tagesschau, Stern, Welt, taz, and ZDF heute. This data collection began in June 2022 and is continually updated. The training corpus includes all entries up to September 30, 2023, with any newer entries being considered as new data for evaluation. The dataset is available in two formats: CSV text files and a PostgreSQL database.

The following information is stored:

id: a unique identifier for each entry in the database
date: content from the RSS pubdate-tag
title: content from the RSS title-tag
description: content from the RSS description-tag
author: content from the RSS author-tag
category: content from the RSS category-tag
copyright: content from the RSS copyright-tag
url: the URL of the RSS feed
text: a legacy field for storing additional textual content
source: the name of the news magazine

The dataset contains many missing values, especially the contents for author, category, copyright and text are often not present.

2. Preprocessing pipeline

My approach to deriving topics from features of the training corpus is twofold:

Using only the content from the title.
Using the content from the title, description, and text.

These inputs are used for the Latent Dirichlet Allocation (LDA) to determine which model is more effective at clustering the corpus into meaningful topics. All other dataset features are not relevant for this purpose, except for the category, which can be used for later comparison and evaluation.

To perform LDA on our corpus, the first step is to clean and tokenize the textual input data.

The preprocessing pipeline is designed to execute all the necessary steps for data cleaning, feature engineering, and text preprocessing to create two different input features, each represented as a set of tokenized terms. To score new data, the same preprocessing steps must be applied to the input data. I have developed a pipeline that can be saved and loaded for this purpose.

Preprocessing pipeline for input data

3. Model training

The Latent Dirichlet Allocation (LDA) requires two main inputs: the Document-Term Matrix (DTM) of the corpus, containing the frequency of terms in a collection of documents, and the number of topics to be extracted. Therefore, in the initial step, the DTM was generated from the preprocessed title, as well as from the preprocessed combination of the title, description, and text. For each of the resulting DTMs, I trained models with the assumption of n=4, 6, 7 and 8 topics to compare the results. In total, I trained 8 different LDA models, each providing clustering into n topics defined by their most frequent words. When the input is transformed with the model, each entry is assigned a topic number. The interpretation of these topics and the assignment of meaningful names must be done by a human in a manual step.

Training of LDA model with 7 topics

4.Discussing the results

The work on this section is still in progress, so please visit again later.

5. Scoring of new data

The work on this section is still in progress, so please visit again later. Meanwhile try the Demo App to see the process in action.

Manuela Rink