A collective textbook-guide by Colaboratoria

Ru

En

Table of contents

A collective textbook-guide by Colaboratoria

Data visualization using Python

Chapter 13

Denis Vakarchuk
Author
Alexander Borovinsky
Author

Denis Vakarchuk

Author

Alexander Borovinsky

Author

ata visualization is a powerful tool for an international relations political scientist. High-quality graphical presentation of data not only makes analytical results easier to communicate, but also helps reveal hidden patterns. This chapter covers three data visualization tools in the Python programming language.

1. Подготовка к визуализации данных в Python
2. Функция plot() в библиотеке pandas
3. Библиотека matplotlib
4. Библиотека seaborn

/01

Preparing for Data Visualization in Python

In previous chapters, the authors discussed the use of artificial intelligence methods for data analysis in detail. However, despite major progress in this area, AI still makes inaccuracies and sometimes clear errors. To understand analytical processes and results more deeply, researchers need skills for independent data analysis.

In professional analytical work, it is important to present data in a way that is understandable even to people without deep subject-matter expertise. For that purpose, data visualization methods using the Python programming language are applied.

It is assumed that Anaconda is already installed on your computer. To begin, open Jupyter Notebook and run the following command:

 import pandas as pd

Pandas is a specialized library for working with tabular data, offering a wide range of built-in functions. These functions are needed for data preprocessing and analysis.

Python’s capabilities will be demonstrated using a dataset on state voting in the UN General Assembly (UNGA) from 1946 to 2019. These data come from work by a research team led by Erik Voeten, a prominent scholar of voting behavior in international organizations.

The dataset includes three tables (dataframes):

First table (un_votes)

Contains the voting history of each country. Each row reflects a specific country’s participation in a vote on a particular issue.

Second table (un_roll_calls)

Contains information about each vote, including the date of the roll call, a description, and the resolution under discussion.

Third table (un_roll_call_issues)
Contains categorization of each vote into one of six broad issue areas:
- "Colonialism"
- "Arms control and disarmament"
- "Economic development"
- "Human rights"
- "Palestinian conflict"
- "Nuclear weapons and nuclear material"

Data preprocessing

Below are several examples of data visualization. First, we need to preprocess the data:

# Load the data, selecting only the needed columns
un_roll_call_issues = pd.read_csv('un_roll_call_issues.csv', usecols=['rcid', 'issue']) 
un_roll_calls = pd.read_csv('un_roll_calls.csv', usecols=['rcid', 'session', 'date'])
un_votes = pd.read_csv('un_votes.csv', usecols=['rcid', 'country', 'vote'])

# Merge the dataframes into a single chain
# First, un_roll_call_issues with un_roll_calls
# Then merge the result with un_votes
UN = un_roll_call_issues.merge(un_roll_calls, on='rcid') \
    .merge(un_votes, on='rcid')

As an example, we will select only G20 members (hereafter G20). Note that G20 includes 19 countries:

g20_members =[
   "Argentina", "Australia", "Brazil", "Canada", "China",
   "France", "Germany", "India", "Indonesia", "Italy",
   "Japan", "Mexico", "Russia", "Saudi Arabia", "South Africa",
   "South Korea", "Turkey", "United Kingdom", "United States"
]
 
G20 = UN.query("country in @g20_members")

First, let’s see which agendas—and how often—were discussed by G20 countries in UNGA from 1945 to 2019:

G20.groupby('issue').agg({'rcid':'nunique'}).reset_index()

Frequency of discussion of broad issue areas by G20 countries in the UN General Assembly

At this stage, we can proceed to visualization. There are three basic tools for visualization in Python:

the plot () function in pandas

the matplotlib library

the seaborn library

This chapter covers how these three tools work.

/02

The plot () Function in pandas

First, we need to understand how information will be shown to the reader. Based on the table above, we can visualize the distribution of votes by issue area:

G20.groupby('issue').agg({'rcid':'nunique'}).reset_index().plot()

Distribution of agenda frequencies for G20 in UNGA (version 1)

Before drawing conclusions, it is worth considering how a first-time viewer perceives the figure. A key issue is that the plot does not clearly display the agenda topics because each agenda appears as a numeric index. To fix this, we specify the x-axis labels:

G20.groupby('issue').agg({'rcid':'nunique'}).reset_index().plot(
     x='issue',
     rot=45, figsize=(10,5)
)

Distribution of agenda frequencies for G20 in UNGA (version 2)

Now numeric indices are replaced with agenda labels. To prevent label overlap, we used figsize to increase the figure size, and rot =45 to rotate labels for readability. Next, add axis labels:

G20.groupby('issue').agg({'rcid':'nunique'}).reset_index().plot(
    x='issue', 
    rot=45, figsize=(10,5), 
    xlabel='Broad issue area', 
    ylabel='Number of roll calls', 
    legend=False
)

Distribution of agenda frequencies for G20 in UNGA (version 3)

However, there is another problem that can cause serious misinterpretation. In its current form, the chart could be used to hide certain information. A quick look might suggest that G20 most often voted on "Arms control and disarmament," "Human rights," and the "Palestinian conflict," while "Economic development" was discussed much less. But this conclusion can be misleading because the y-axis does not start at zero. To fix this, we set the y-axis limits:

G20.groupby('issue').agg({'rcid':'nunique'}).reset_index().plot(
     x='issue', 
     rot=45, figsize=(10,5), 
     xlabel='Broad issue area',
     ylabel='Number of roll calls', 
     legend=False, 
     ylim=(0,1100)
)

Distribution of agenda frequencies for G20 in UNGA (version 4)

- important
It is not enough to simply visualize data. You should first define the research goal and then choose the appropriate graph type.

important

It is not enough to simply visualize data. You should first define the research goal and then choose the appropriate graph type.

A line chart is not suitable here because it is generally used for time series. A better solution is to switch to a bar chart using kind = 'bar':

G20.groupby('issue').agg({'rcid':'nunique'}).reset_index().plot(
    x='issue', 
    rot=45, figsize=(10,5), 
    xlabel='Broad issue area', 
    ylabel='Number of roll calls', 
    legend=False, kind='bar'
)

Distribution of agenda frequencies for G20 in UNGA (version 5)

/03

The Matplotlib library

Now let’s examine which G20 country most often abstains. We filter the data for abstentions and plot using matplotlib.

# Import matplotlib
import matplotlib.pyplot as plt

# Filter the data and store it in a new variable
G20_abstain = G20.query('vote == "abstain"')

# Count abstentions by country
abstain_counts = G20_abstain['country'].value_counts()

# Create a pie chart
plt.figure(figsize=(10, 8))
plt.pie(
    abstain_counts, 
    labels=abstain_counts.index,
    autopct='%1.1f%%', 
    startangle=140
)
plt.title('Share of countries by number of abstentions', pad=20)

# Keep the pie circular
plt.axis('equal')

# Show the chart
plt.show()

Distribution of countries by share of "abstain" votes

Note that pandas visualization allows you to chain plotting directly after data manipulation. Matplotlib, by contrast, is a separate library that typically takes preprocessed data as input. Matplotlib is much more flexible and offers a wider range of customization options.

The resulting chart shows that six out of nineteen countries occupy more than half of the pie, indicating that many G20 members abstain relatively rarely in UNGA voting.

/04

The Seaborn library

Some visualization tools reveal patterns that are hard to interpret from large datasets. One of these is the scatterplot, which can show on which agendas—and in which periods—countries tended to abstain. To illustrate, we will visualize abstentions of the USSR/Russian Federation in UNGA with a scatterplot. While pandas can build a similar plot, achieving year-by-year point distribution with color coding by broad issue area would require relatively complex code. Seaborn is better suited for this task.

import seaborn as sns

Before plotting, we need preprocessing. In the original table, dates are strings. We convert them to datetime and extract the year:

G20_abstain['date'] = pd.to_datetime(G20_abstain['date'])
 
G20_abstain['year'] = G20_abstain['date'].dt.year

Now identify when the country abstained and which periods had minimal abstentions:

# Filter for the country of interest
Russia_abstain = G20_abstain.query('country = "Russia"')

# Group by year and agenda; count votes
votes_by_year_issue = Russia_abstain.groupby(['year', 'issue']).size()\
    .reset_index(name='count')

Then plot the scatterplot:

# Scatterplot
plt.figure(figsize=(12, 8))
sns.scatterplot(
    data=votes_by_year_issue, 
    x='year', y='count', 
    hue='issue', 
    palette='viridis', 
    s=100
)

# Formatting
plt.xlabel('Year', fontsize=14)
plt.ylabel('Number of votes', fontsize=14)
plt.legend(title='Agenda', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()

# Show the scatterplot
plt.show()

Distribution of USSR/Russia "abstain" votes by broad agendas in UNGA

To interpret this plot, we can focus on the mode (the most frequent value) of the distribution. Here, the mode corresponds to the period when the country most often abstained on the "Palestinian conflict" agenda in the early 1990s. Thus, after the collapse of the Soviet Union, disarmament and nuclear issues often led Russia to abstain in UNGA voting. At the same time, on "Economic development," Russia more often cast a substantive vote (Yes/No) rather than abstaining.

This chapter is not exhaustive. Many important aspects of visualization remain outside its scope, including advanced visualization methods and integration with BI platforms. Still, mastering these Python tools for graphical data representation provides a foundation for further professional growth in quantitative analysis of political processes.