Topic Modeling in Google Colab

Chapter 14
  • Aleksandra Bordunos 
    Author
Topic modeling plays an important role among automated text analysis methods. It helps structure large volumes of data and uncover hidden connections within a collected set of texts. This chapter examines four key stages of topic modeling using Google Colab: creating a document corpus, preprocessing data, choosing a model, and analyzing results.
/01

Introduction to Topic Modeling

Topic modeling — a method that automatically identifies topics and hidden relationships in a collected set of texts.
In this context, topics are groups of keywords and key phrases (n-grams). The researcher’s task is to assign a meaningful name to each group. In many cases, topic modeling implies soft clustering: the same word may belong to different topics, but with different probability weights.
Not only words, but also texts receive a probabilistic distribution across topics. This is why beginner researchers especially value this method: the extracted topics make team work faster—topics help distribute texts evenly among team members for additional analysis and provide direction for interpretation and reporting.
Here are a few examples of how topic modeling is used:
  • Identify trends and emerging themes in letters, publications, patents, social media posts, and blogs;
  • Select target communities and experts based on the themes of their publications and posts;
  • Group video speeches and audio episodes into categories to make searching easier.

Ask a neural network…

  • Researcher
    sample prompt
    Continue this list of examples by asking a neural network about possible goals for applying topic modeling in your own research field.
  • Perplexity
    response fragment
    1. Crisis communication management: monitor emerging crises in real time from different sources, prepare communication plans for potential crises identified via topic modeling.
    2. Media monitoring: track how different media outlets report on diplomatic events and how they interpret them.
    3. Stakeholder analysis: identify key actors and their positions on particular diplomatic issues.
    4. Trend analysis: study trends over time to predict future diplomatic challenges or opportunities.
    5. Cultural aspects: understand cultural narratives that influence diplomatic relations between countries.

    Show full answer
/02

Connection to Text Mining

Topic modeling is part of a broader family of methods for intelligent analysis of unstructured texts known as text mining. Other methods in this family often include sentiment analysis and network analysis.
Sentiment analysis (also called opinion mining) is a method for identifying the emotional tone of a text. For example, it can be used to determine how a target audience feels about a phenomenon, service, or product.
Network analysis helps build a network model that shows key actors and links between them as a graph.
The term "text mining" refers to the metaphor of extracting valuable "resources" (knowledge). It belongs to the AI field of Natural Language Processing (NLP) and relies on machine learning approaches to automatically extract meaningful information from large text corpora.
The text-mining process includes four key stages, which — applied to topic modeling— can be summarized with the acronym TEMA.
Text mining stages — TEMA:
Texts form a document corpus used to discover topics
Erasing what is not needed
Model choice depends on the task and data specifics
Analysis includes visualization and interpretation
The similarity of steps across different text-mining methods makes it easier to combine multiple approaches in one study, and it does not eliminate the value of deep reading of full texts when the sample size allows it.
/03

Four Stages of Topic Modeling

/ STAGE 1: SAMPLING

Building the sample for further analysis

A document corpus — a set of texts that forms the final sample.
A corpus may include:
  • texts scraped from social networks;
  • answers to open-ended survey questions;
  • transcripts of audio/video materials;
  • standalone documents such as articles;
  • chat exports from social networks saved in a uniform format (e.g., .txt) and organized in a folder.
Even if all posts are stored in a single .csv or .xls file, the unit of analysis may be each post in a separate cell. You can treat one file as one unit of analysis (k = 1) or split each text into segments, depending on the research goals. The content of the sample is determined by the research goal. For inspiration, see examples of goals from published articles (shown in the table below).

Goals

Examples


Understanding public opinion about a studied issue


1. Title: "Restoring reputation through digital diplomacy…"

Goal: identify which narratives appeared at different stages of a crisis

Sample: k = 12,935


2. Title: "Analytical framework for evaluating digital diplomacy using network analysis and topic modeling: comparing South Korea and Japan"

Goal: identify the consequences of violent protest tactics

Sample: k = 9,659,770


Identifying dominant themes in discussion



1. Title: "Analytical framework for evaluating digital diplomacy using network analysis and topic modeling: comparing South Korea and Japan" / "Аналитическая структура для оценки цифровой дипломатии с использованием сетевого анализа и тематического моделирования: сравнение Южной Кореи и Японии"

Goal: identify fans'/tourists' attitudes toward Korea and …

Sample: KOCIS posts (k = 16,962) and JNTO posts (k = 8,278)

Source: adjusted result of a Perplexity query
Note: In surveys, we use n=x for respondents. Here we analyze texts, so we use k=x, where x is the number of analyzed units. The research goal determines which data sources are most appropriate within the chosen context, following the 4 DATA principles.
4 DATA principles
  • Diversification
    To deepen analysis, expand observations across different sources, time periods, and demographic groups to capture multiple facets of the phenomenon.
  • Availability
    Data access is legally allowed and technically possible; in some cases written consent or ethics committee approval may be required.
  • Trustworthiness
    Synthetic data can be used for training, but for analysis and interpretation, human-generated data is preferred. Special attention is paid to source reliability.
  • Adequacy of a sample size
    There is no consensus on a minimum number of observations for topic modeling. Sufficiency can be justified by representativeness and access constraints.
Tips for building a corpus
The result of the first stage is a corpus of texts collected into a file or folder of a known format. For Python code, you must know the file location/path and extension. Also, all texts should be in one language (e.g., Russian only).
/ STAGE 2: PREPROCESSING

Data preprocessing

Preprocessing solves a set of tasks summarized by the acronym CLeAN.
Preprocessing tasks — CLeAN:
Cleaning text
Analysis for noise: remove punctuation, numbers, markup, random symbols, images, etc. Stop words: remove function words (conjunctions, prepositions, pronouns)
Lemmatization or stemming
Reduce words to base form or root
Adjusting register
Convert words to lowercase (even names and places)
n-grams
Define if a unit of analysis could be not only a single word, but also a combination of 2−3 words
Tokenization
Translate words into small units — tokens
Each step requires decisions. For example, you decide what counts as "noise" in your corpus and which libraries to use. Stop-word lists can be loaded from libraries and then expanded manually (for instance, adding query keywords that appear in nearly every text).
You also need to choose whether to normalize words as lemmas or stems—depending on your research goals and data. Additionally, you may create n-grams (bigrams, trigrams) to better capture meaning, especially for stable expressions like "digital diplomacy."
After preprocessing, texts are transformed into a machine-readable form.
Approaches to representing texts
  • Word frequency measure
    BoW (Bag of Words): counts how often words/phrases appear. Word order and document origin are largely ignored.
  • Word importance measure
    TF-IDF (Term Frequency-Inverse Document Frequency): identifies words frequent in one subset and rare elsewhere—useful for highlighting distinctive terms.
  • Embeddings
    Vector representations of words/phrases capturing semantic relationships (e.g., Word2Vec, GloVe).
  • Word frequency measure
    BoW (Bag of Words): counts how often words/phrases appear. Word order and document origin are largely ignored.
  • Word importance measure
    TF-IDF (Term Frequency-Inverse Document Frequency): identifies words frequent in one subset and rare elsewhere—useful for highlighting distinctive terms.
  • Embeddings
    Vector representations of words/phrases capturing semantic relationships (e.g., Word2Vec, GloVe).
These approaches are not mutually exclusive and often appear sequentially: BoW as a base step, TF-IDF for key term selection, embeddings for semantic depth.

Working in Google Colab

Preprocessing is performed in an Integrated Development Environment (IDE). Here we use Google Colab, a platform that provides free collaborative access.
Google Colab lets you run Python in the browser without installing software. Work happens in a Jupyter Notebook that can be shared with the whole research team for real-time collaboration. It offers powerful compute resources and supports a wide range of Python libraries, speeding up text-mining tasks.
    • important

    In the free version of Google Colab, data are not preserved permanently, so you must export results to your device or cloud storage.
  • important
In the free version of Google Colab, data are not preserved permanently, so you must export results to your device or cloud storage.
A further advantage is integration with GitHub, supporting sharing, collaboration, transparency, and reproducibility. It also helps you find and adapt ready-made notebooks. Several popular GitHub resources are listed at the end of the chapter. Alternative platforms (example result from a neural network query):

Ask a neural network…

  • Researcher
    prompt example
    Make a list of alternatives to Google Solab, for example, datacamp.com, VSCode, PyCharm JetBrains. Rewrite it in a uniform format: the name of the instrument, the link to it, and a brief description.
  • Perplexity
    1. DataCamp: datacamp.com — an interactive learning platform for data analysis and ML.
    2. Kaggle Kernels: kaggle.com — cloud notebooks for Python with datasets and ML tools.
    3. Yandex DataSphere: yandex.cloud/ru/services/datasphere — cloud analytics and ML platform for large-scale data.
    4. Deepnote: deepnote.com — collaborative data science notebooks in the cloud.
    5. Paperspace: paperspace.com — cloud platform with GPU access for ML development.
    6. Amazon SageMaker: aws.amazon.com/sagemaker — Amazon’s platform for building/training/deploying ML models.
    7. Lightning Studio: lightning.ai/studio — ML platform from the PyTorch Lightning ecosystem.
    8. Visual Studio Code (VSCode): code.visualstudio.com — extensible code editor with team features.
    9. PyCharm (incl. Community Edition): jetbrains.com/pycharm/download/ — Python IDE with strong tooling.
    10. Anaconda: anaconda.com — scientific computing platform with environments and package management.
If you are working with Google Colab for the first time, at the initial stage it is important to familiarize yourself with the instructions for creating notebooks, as well as with the syntax. If you are already an experienced user, then you probably know how to connect GPT, other plugins, and activate magic commands. If not, ask the neural network about this functionality.
Work in Google Colab begins with installing the required libraries. To ensure work with textual and tabular data, you can install the pandas library. It is also important to specify the language for the loaded dictionaries, for example, Russian.
For preliminary text processing, you will need the NLTK (Natural Language Toolkit) library, in particular the ability to add custom stopwords, remove noise, as well as perform lemmatization and tokenization (punkt). In addition to NLTK, there is a wide range of other libraries that can be combined within a single study — Gensim, SpaCy, Polyglot, Scikit-Learn, NetworkX. As you gain experience interacting with them and evaluating the results, you will also be able to adjust your preferences.
One of the advantages of Google Colab is that tasks can be performed non-linearly, placing them in different code cells. It is important to install libraries once at the very beginning, while creating n-grams and converting texts into a numerical format will likely need to be done several times.
After the first run, you may have doubts about the interpretability of the model or other indicators of its quality. For example, you may decide to expand the list of stopwords. Or later, at the stage of choosing the number of topics, you may not need to repeat text preprocessing. In this regard, it is recommended to write code for these tasks in separate blocks so as not to restart other processes.

Ask a neural network…

  • Researcher
    prompt example
    Describe the advantages and limitations of using libraries for text mining in Google Colab, using NTLK, Gensim, SpaCy, Polyglot, Scikit-Learn, NetworkX as examples. Propose the most appropriate context of application for each example.
  • Perplexity
    response fragment
    1. NLTK (Natural Language Toolkit)
    Advantages:
    • A comprehensive set of tools for natural language processing: tokenization, lemmatization, sentiment analysis, etc.
    • Support for many languages and the availability of ready-made text corpora and dictionaries (e.g., WordNet).
    • Good documentation and many tutorials, making the library accessible for beginners.
    Limitations:
    • Lower performance compared to more modern libraries such as SpaCy.
    • Complexity of use for more advanced tasks without prior configuration.
    Context of application: NLTK is ideal for educational projects and initial text analysis, such as sentiment analysis of reviews or creating simple chatbots.

    Show full answer
To use bigrams, trigrams, and other n-grams, you can refer to the Phrases and Сorpora models from the Gensim library. In the code, you may encounter and adjust the following parameters:
  • min_count=5: means that n-grams that occur fewer than 5 times in the entire text corpus will be ignored.
  • threshold=10: the association threshold between words in n-grams. The higher this parameter, the more "connected" the words must be to form n-grams.
As an alternative, you can remove words that are too rare or too frequent:
dictionary.filter_extremes(no_below=5, no_above=0.5)
Using Corpora, an n-gram dictionary and a text corpus are created, and TF-IDF is calculated:
# Creating bigrams
bigram_model = Phrases(data['cleaned_abstract'], min_count=5, threshold=10)
data['bigrams'] = data['cleaned_abstract'].apply(lambda tokens: bigram_model[tokens])

# Creating dictionary and corpus
dictionary = corpora.Dictionary(data['bigrams'])
corpus = [dictionary.doc2bow(text) for text in data['bigrams']]

# Calculating TF-IDF
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
The result of the second stage is a fully prepared text corpus for topic modeling. To write code in Python, it will be important to clarify: the environment for which the code needs to be written (for example, Google Colab), the programming language — Python, the specific list of tasks, and the libraries you have chosen to accomplish them.
/ STAGES 3 AND 4: MODELING AND ANALYSIS

Choosing an appropriate model

At the final stages, the best probabilistic topic model is selected. For convenience in choosing the best among the considered models, the modeling and analysis stages are performed simultaneously.

LDA approach to topic modeling

The most well-known approach to training a model called Latent Dirichlet Allocation (LDA). The LDA mechanism is based on the following: all words are initially assigned random probability values, which can also be set manually, for example, random_state=42 — this increases model reproducibility. Then, with each pass over all texts, these values are refined. The number of passes can also be adjusted. For small corpora, you can set values within 100 passes, and if there is a lot of data, fewer — 5−10. The more passes, the longer the analysis process takes. The number of topics is configured in a similar way.
Both the NTLK library and the LDA approach appeared back in the early 2000s, which explains their widespread adoption, including among researchers. However, the number of alternatives has grown over the years, expanding the range of possibilities and perspectives. Among other popular approaches are BERTopic, including its adaptation for the Russian language ruBERT, but they are used for longer texts. For short texts, approaches such as Non-negative Matrix Factorization (NMF), Structural Topic Modeling (STM), Top2Vec, and many others are used, some of which can be explored using the links in the appendix.
In the LDA approach, the number of topics is set manually. To make it easier to decide on the number of topics, you can compare the values of a number of parameters for each model, for example:
  • Perplexity
    A measure of the model’s ability to predict words in a document: the lower the value, the better
  • Log Likelihood
    A measure of how well the model fits the data: the higher the value, the better
  • Coherence Score
    A measure of the semantic coherence of topics: the higher the value, the better
The list of model quality criteria is regularly expanding. You can learn about well-known alternatives from current research. When calculating the optimal number of topics, you can propose calculating several quality measures for multiple models at once, specifying the initial and threshold number of topics, as well as the step at which to display the results:
limit=10, start=2, step=1
# Function to compute coherence, perplexity, and log_likelihood for an LDA model

def compute_metrics(dictionary, corpus, texts, limit=10, start=2, step=1):
   coherence_values = []
   perplexity_values = []
   log_likelihood_values = []
   model_list = []
   num_topics_range = []
   for num_topics in range(start, limit, step):
       model = models.LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=150)
       model_list.append(model)
       coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
       coherence_values.append(coherencemodel.get_coherence())
       perplexity_values.append(-model.log_perplexity(corpus))
       log_likelihood_values.append(model.bound(corpus))
       num_topics_range.append(num_topics)
   return model_list, coherence_values, perplexity_values, log_likelihood_values, num_topics_range

# Parameters for selecting the number of topics
start = 2
limit = 6  # maximum number of topics you want to test
step = 1
if limit < start:
   print("Error: limit must be greater than or equal to start")
   exit()
if step <= 0:
   print("Error: step must be greater than 0")
   exit()

# Computing metrics for different values of num_topics
model_list, coherence_values, perplexity_values, log_likelihood_values, num_topics_range = compute_metrics(dictionary, corpus_tfidf, data['bigrams'], limit, start, step)

# Creating a table with results
metrics_df = pd.DataFrame({
   'Num Topics': num_topics_range,
   'Coherence': coherence_values,
   'Perplexity': perplexity_values,
   'Log Likelihood': log_likelihood_values
})
print(metrics_df)

# Displaying results
plt.figure(figsize=(12, 6))
plt.subplot(1, 3, 1)
plt.plot(num_topics_range, coherence_values)
plt.xlabel("Number of topics")
plt.ylabel("Coherence")
plt.legend(("Coherence"), loc='best')
plt.subplot(1, 3, 2)
plt.plot(num_topics_range, perplexity_values)
plt.xlabel("Number of topics")
plt.ylabel("Perplexity")
plt.legend(("Perplexity"), loc='best')
plt.subplot(1, 3, 3)
plt.plot(num_topics_range, log_likelihood_values)
plt.xlabel("Number of topics")
plt.ylabel("Log Likelihood")
plt.legend(("Log Likelihood"), loc='best')
plt.tight_layout()
plt.show()

# Outputting optimal numbers of topics according to all metrics
optimal_num_topics_coherence = num_topics_range[coherence_values.index(max(coherence_values))]
optimal_num_topics_perplexity = num_topics_range[perplexity_values.index(max(perplexity_values))]
optimal_num_topics_log_likelihood = num_topics_range[log_likelihood_values.index(max(log_likelihood_values))]
print(f'Optimal number of topics by coherence: {optimal_num_topics_coherence}')
print(f'Optimal number of topics by perplexity: {optimal_num_topics_perplexity}')
print(f'Optimal number of topics by log likelihood: {optimal_num_topics_log_likelihood}')

Synthetic data

Most of the tools discussed in this chapter are actively used to train large language models (LLMs). This requires very large datasets, which usually significantly exceed the sample compiled for a single study. To overcome this limitation, some researchers decide to first increase the number of texts through data augmentation. For these purposes, new texts are created based on existing ones, for example by using synonyms.
When choosing the final model, researchers also rely on their own expertise — how logical the proposed distribution is and how well it corresponds to the studied context. In case of difficulties with interpretation, calculations can be repeated even without changing the parameters — the results may already turn out to be different. This is another distinction of this approach from others — low model stability, which can be increased through synthetic data. You can learn about additional methods for improving model stability from supplementary materials presented at the end of the chapter.

Ask a neural network…

  • Researcher
    prompt example
    What are the advantages of data augmentation for topic modeling?
  • Perplexity
    response fragment
    1. Increasing data volume: augmentation makes it possible to artificially increase the size of the training set, which is especially important when the original data is limited. Creating new synthetic examples based on existing texts helps provide a broader dataset for training models, which in turn improves their ability to generalize and adapt to new data.
    2. Improving model robustness: models trained on augmented data become more resistant to noise and various distortions, making them more reliable in real-world conditions.
    3. Data diversity: augmentation creates many variations of the original data, allowing the model to see different realizations of the same object or phenomenon.

Model visualization

After each model, the results are visualized using topic distribution plots, word clouds, charts with model quality metrics, and heatmaps that show relationships between topics. For visualization in Google Colab, special libraries are used, for example pyLDAvis.
The result of the third and fourth stages is the selection of a model with the best degree of interpretability, robustness, and quality indicators. As a result of selecting the most preferred model, the final set of keywords and phrases is determined, for each of which the probability of belonging to topics is calculated. Based on the obtained results, all documents are also grouped into identified clusters. Both the model and the dictionary can be saved to a local device if desired.
/04

Example study on training data

Goal: to identify key topics in the open Telegram channel of the Russian Ministry of Foreign Affairs.
/ STEP 1

Создание выборки

To create the document corpus, the Telegram bot 2CSV was used. Due to the large number of messages in the selected channel, parsing took longer than usual and the file turned out to be so large that free online CSV-to-Excel converters could not be used. As a result, the code was created not for .xls but for .csv. The language is Russian, and instead of manual preprocessing, an additional line appeared in the code to skip cells with incomplete information.
/ STEP 2

Data preprocessing

In subsequent cells, data visualization and result saving are provided. At this stage, it becomes clearer which additional stopwords can be added, what the topic names might be, and what the structure of the response to the research questions should be. In the case of teamwork, topics can be distributed among participants, and within their own task blocks, they can be correlated with the text content to clarify details and compose the narrative.
We recommend going through the entire path independently — finding an open Telegram channel that contains answers to your research questions, creating a document corpus, and conducting analysis based on the code proposed in the notebook.

Practicum

Feedback
If you want to leave a review on a particular chapter, select its number
Rate the material
 

**By clicking the button, you agree to the personal data processing policy