Introduction to Topic Modeling in Python

03:30 PM - 04:25 PM on August 16, 2015, Room 704

Christine Doig

Audience level:
novice
Watch:
http://youtu.be/kzmf6mOCm5M

Description

Topic models are a suite of algorithms that uncover the hidden thematic structure in document collections. These algorithms help us develop new ways to search, browse and summarize large archives of texts. This talk will introduce topic modeling and one of it's most widely used algorithms called LDA (Latent Dirichlet Allocation). Attendees will learn how to use Python to analyze the content of their text documents. The talk will go through the full topic modeling pipeline: from different ways of tokenizing your document, to using the Python library gensim, to visualizing your results and understanding how to evaluate them.

Abstract

Research papers, newspaper articles, blogs, websites' content; text data is everywhere and it is frequently found without labels that could help us classify them and extract meaningful understanding of their content. To solve these problems, an unsupervised Machine Learning method, called Topic Modeling, was developed to analyze large collections of unlabeled text documents. Topic models are a suite of algorithms that uncover the hidden thematic structure in document collections. These algorithms help us develop new ways to search, browse and summarize large archives of texts.

In Topic Modeling there are a variety of algorithms: LSI, LDA, Hierarchical Dirichlet process, Dynamic Topic Models... This talk will introduce some simple ideas to understand the theory behind LDA (Latent Dirichlet Allocation), but most importantly, what Python libraries exist for Topic Modeling and how to use them. Gensim (https://radimrehurek.com/gensim/) is one of the most popular Python libraries for Topic Modeling and the main library this talk will present, although other alternatives will also be discussed.

Preprocessing your data before feeding it to an LDA algorithm is also a very important task for a Data Scientist doing Topic Modeling. We will introduce some NLP (Natural Language Processing) concepts to understand different ways of tokenizing our documents: lemmatization, stemming, entity recognition, and how those can represent our text with a bag-of-words construct.

It is also important to understand how to interpret your model and how to evaluate your results. To better interpret your model, we'll make use of interactive visualizations. On the other hand, unsupervised model evaluation is harder than in supervised problems, since there are no real categories to compare to. Several different techniques exist in literature to calculate the goodness of a topic model. We'll present some of them in two different categories: human-in-the-loop and metric-driven evaluations.

In summary, this talk will consist on the following sections: Topic Modeling and its role in Machine Learning, the LDA algorithm and its implementation in Python libraries, preprocessing of text data and tokenization methods, and interpretation and evaluation of results. Finally, a full process example will be demonstrated and analyzed.