- /
- Projects/
- Data and Twilight Imperium, Part 1: Introduction, Cleaning, and Graphing Game Completion/
Data and Twilight Imperium, Part 1: Introduction, Cleaning, and Graphing Game Completion
Table of Contents
I’ve been quite impressed with how the Twilight Imperium community handles their match data. No one would blame them for not keeping track of the torrents of data generated by such a complex game, but not only have they risen to the task, but the analysts in the community have managed to wrangle and present it in a compelling way that can inform the play-by-play decisions of avid players.
Taking a bit of inspiration from them, I thought I might try my hand at working with the data. My goals from this ongoing series are:
- take the same data and recreate its presentation on my own to the best of my ability.
- Discover new insights into the data on my own.
- Improve my facility with Jupyter Notebooks and Python’s Pandas and Matplotlib libraries.
What is Twilight Imperium? #
Twilight Imperium is an enormous board game of war, strategy, and diplomacy. Taking place in a colorful sci-fi setting, each player controls a unique faction and competes to be the first to reach ten (or sometimes fourteen) victory points. The game is famous for its complexity and length of play; the map of the galaxy is generated randomly, then players each choose one of twenty-five factions with completely unique abilities and play styles. A ‘quick’ game will only last four hours, but it’s not unusual for games to last ten hours or more, especially offline.
The game was first released in 1997 as a passion project of Christian T. Petersen, the founder and then-owner of Fantasy Flight Games (FFG). Although the game has always had a die-hard cult following since the very beginning, the game reached new heights of popularity with the release of its fourth edition (TI4) in August 2017, followed three years later with its first and so-far only expansion, Prophecy of Kings. A volunteer-created port of the game into Tabletop Simulator (TTS) made it easier to host tournaments and pickup games, as well as to help potential new players try out TI4 without making a daunting $100+ investment. Because of this, it’s no surprise that FFG themselves semi-officially condone the port, which they probably see as effective word-of-mouth advertising for a difficult product to sell. Podcasts and Youtube channels have sprung up out of the TI4 community, and FFG has been slowly but steadily expanding the product line with tie-in novels and a roll-and-write version of the game (think Yahtzee, but much, much more complicated).
The Data #
As a consequence of online TTS games, fans have been able to collect tons and tons of data from each match. Data analysts among them have collated these data into a dashboard with useful and captivating visuals that update with every game played.
Here’s a description of the data from the dashboard:
A snapshot of the board state is saved at the start of every round (and at the end of the game when the roundplusfinal column = ‘final’). In this table, each row represents one of these snapshots of one of the players. A 6 player game that goes for 5 rounds will have 36 rows total (one snapshot per round per player, plus another snapshot at the final end state of the game). The setupTimestamp field is used to differentiate between different matches.
I downloaded the data at exactly 3:31pm on July 5th, 2023. The resulting CSV file is 320.3 MB and contains over 286,290 rows and twenty-seven columns. The oldest data were added on February 28, 2021. Since one game can cover multiple rows and each game is uniquely identified with the setupTimestamp column, a .nunique() method shows that this dataset represents 9,082 games.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
df = pd.read_csv('TI4 Tabletop Simulator Stats Dashboard.csv')
print(df.info())
print(df.tail(10))
df.nunique()
/var/folders/8p/sbpv06dx3b576tlsdyc4cs9c0000gn/T/ipykernel_77367/363263826.py:4: DtypeWarning: Columns (21) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv('TI4 Tabletop Simulator Stats Dashboard.csv')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 286290 entries, 0 to 286289
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 286290 non-null object
1 setupTimestamp 286290 non-null float64
2 Timestamp 286290 non-null float64
3 mapString 65737 non-null object
4 isHomebrewMatch 286290 non-null bool
5 isPOK 286290 non-null bool
6 roundPlusFinal 286290 non-null object
7 color 286290 non-null object
8 active 286233 non-null object
9 score 286290 non-null int64
10 custodiansPoints 286239 non-null float64
11 Winning Faction 286290 non-null object
12 isPOK.1 286290 non-null bool
13 Player Count 286290 non-null int64
14 factionName 286171 non-null object
15 factionNameWinner 286290 non-null object
16 objectives 210235 non-null object
17 Public_Objectives_I 286260 non-null object
18 Public_Objectives_II 245634 non-null object
19 leaders 286290 non-null object
20 technologies 286290 non-null object
21 hexSummaryImageURLHyperlink 261850 non-null object
22 strategyCards 277754 non-null object
23 strategyCards2 45440 non-null object
24 isSpeaker 286290 non-null bool
25 factionsInGame 284465 non-null object
26 gameDuration 286290 non-null float64
dtypes: bool(4), float64(4), int64(2), object(17)
memory usage: 51.3+ MB
None
Date setupTimestamp Timestamp mapString
286280 Feb 28, 2021, 12:11:30 AM 1.614471e+09 1.614471e+09 NaN \
286281 Feb 28, 2021, 12:11:30 AM 1.614471e+09 1.614471e+09 NaN
286282 Feb 28, 2021, 12:11:30 AM 1.614471e+09 1.614471e+09 NaN
286283 Feb 28, 2021, 12:11:30 AM 1.614471e+09 1.614471e+09 NaN
286284 Feb 28, 2021, 12:11:30 AM 1.614471e+09 1.614471e+09 NaN
286285 Feb 28, 2021, 12:11:30 AM 1.614471e+09 1.614471e+09 NaN
286286 Feb 28, 2021, 12:11:30 AM 1.614471e+09 1.614471e+09 NaN
286287 Feb 28, 2021, 12:11:30 AM 1.614471e+09 1.614471e+09 NaN
286288 Feb 28, 2021, 12:11:30 AM 1.614471e+09 1.614471e+09 NaN
286289 Feb 28, 2021, 12:11:30 AM 1.614471e+09 1.614471e+09 NaN
isHomebrewMatch isPOK roundPlusFinal color active score ...
286280 False True final Red true 8 ... \
286281 False True final Yellow false 7 ...
286282 False True 3 Red unknown 2 ...
286283 False True final Purple true 8 ...
286284 False True final Blue true 10 ...
286285 False True 3 Yellow unknown 3 ...
286286 False True 2 Blue unknown 1 ...
286287 False True 1 Yellow unknown 0 ...
286288 False True 1 Blue unknown 0 ...
286289 False True 1 White unknown 0 ...
Public_Objectives_I
286280 ['Engineer a Marvel', 'Populate the Outer Rim'... \
286281 ['Engineer a Marvel', 'Populate the Outer Rim'...
286282 ['Engineer a Marvel', 'Populate the Outer Rim'...
286283 ['Engineer a Marvel', 'Populate the Outer Rim'...
286284 ['Engineer a Marvel', 'Populate the Outer Rim'...
286285 ['Engineer a Marvel', 'Populate the Outer Rim'...
286286 ['Engineer a Marvel', 'Populate the Outer Rim'...
286287 ['Engineer a Marvel', 'Populate the Outer Rim'...
286288 ['Engineer a Marvel', 'Populate the Outer Rim'...
286289 ['Engineer a Marvel', 'Populate the Outer Rim'...
Public_Objectives_II leaders
286280 ['Found a Golden Age'] {"hero":"purged","commander":"unlocked"} \
286281 ['Found a Golden Age'] {"hero":"unlocked","commander":"unlocked"}
286282 ['Found a Golden Age'] {"hero":"locked","commander":"unlocked"}
286283 ['Found a Golden Age'] {"hero":"purged","commander":"unlocked"}
286284 ['Found a Golden Age'] {"hero":"locked","commander":"unlocked"}
286285 ['Found a Golden Age'] {"hero":"locked","commander":"locked"}
286286 ['Found a Golden Age'] {"hero":"locked","commander":"unlocked"}
286287 ['Found a Golden Age'] {"hero":"locked","commander":"locked"}
286288 ['Found a Golden Age'] {"hero":"locked","commander":"locked"}
286289 ['Found a Golden Age'] {"hero":"locked","commander":"locked"}
technologies
286280 ["Plasma Scoring","Prototype War Sun II","Dest... \
286281 ["Antimass Deflectors","Sarween Tools","Dark E...
286282 3
286283 ["Antimass Deflectors","Neural Motivator","Gra...
286284 ["Antimass Deflectors","Sarween Tools","Dark E...
286285 3
286286 3
286287 1
286288 2
286289 2
hexSummaryImageURLHyperlink strategyCards strategyCards2 isSpeaker
286280 NaN Warfare NaN False \
286281 NaN Politics NaN False
286282 NaN Technology NaN False
286283 NaN Diplomacy NaN False
286284 NaN Imperial NaN False
286285 NaN Leadership NaN False
286286 NaN Construction NaN False
286287 NaN Warfare NaN False
286288 NaN Trade NaN True
286289 NaN Construction NaN False
factionsInGame gameDuration
286280 Titans of Ul, Emirates of Hacan, Federation of... 7.5
286281 Titans of Ul, Emirates of Hacan, Federation of... 7.5
286282 Titans of Ul, Emirates of Hacan, Federation of... 7.5
286283 Titans of Ul, Emirates of Hacan, Federation of... 7.5
286284 Titans of Ul, Emirates of Hacan, Federation of... 7.5
286285 Titans of Ul, Emirates of Hacan, Federation of... 7.5
286286 Titans of Ul, Emirates of Hacan, Federation of... 7.5
286287 Titans of Ul, Emirates of Hacan, Federation of... 7.5
286288 Titans of Ul, Emirates of Hacan, Federation of... 7.5
286289 Titans of Ul, Emirates of Hacan, Federation of... 7.5
[10 rows x 27 columns]
Date 9082
setupTimestamp 9084
Timestamp 9084
mapString 1204
isHomebrewMatch 2
isPOK 2
roundPlusFinal 13
color 12
active 3
score 16
custodiansPoints 8
Winning Faction 94
isPOK.1 2
Player Count 7
factionName 128
factionNameWinner 94
objectives 104076
Public_Objectives_I 9171
Public_Objectives_II 1718
leaders 38
technologies 44489
hexSummaryImageURLHyperlink 46272
strategyCards 19
strategyCards2 16
isSpeaker 2
factionsInGame 16992
gameDuration 2592
dtype: int64
Cleaning #
As one might expect from such copious amounts of data from such a complex game, there are more null values than I think most data analysts would be comfortable with. This might reflect an issue with how data is collected from TTS games or even player knowledge of how to properly send it. For now, though, the only column that we should be careful with is the mapString
data, which describe board states at each round for every game.
As we can see, column 21 (hexSummaryImageURLHyperlink
) has mixed data types. Normally, this would need to be corrected, but considering that this data stores a hyperlink to help display board states, it might be best to ignore it for now. Of more concern is that the date column holds strings instead of the datetime
datatype. Let’s fix that:
df['Date'].head
<bound method NDFrame.head of 0 Jul 1, 2023, 3:09:34 PM
1 Jul 1, 2023, 3:09:34 PM
2 Jul 1, 2023, 3:09:34 PM
3 Jul 1, 2023, 3:09:34 PM
4 Jul 1, 2023, 3:09:34 PM
...
286285 Feb 28, 2021, 12:11:30 AM
286286 Feb 28, 2021, 12:11:30 AM
286287 Feb 28, 2021, 12:11:30 AM
286288 Feb 28, 2021, 12:11:30 AM
286289 Feb 28, 2021, 12:11:30 AM
Name: Date, Length: 286290, dtype: object>
df['DateFixed'] = pd.to_datetime(df['Date'], format='%b %d, %Y, %I:%M:%S %p')
df['DateFixed'].head
<bound method NDFrame.head of 0 2023-07-01 15:09:34
1 2023-07-01 15:09:34
2 2023-07-01 15:09:34
3 2023-07-01 15:09:34
4 2023-07-01 15:09:34
...
286285 2021-02-28 00:11:30
286286 2021-02-28 00:11:30
286287 2021-02-28 00:11:30
286288 2021-02-28 00:11:30
286289 2021-02-28 00:11:30
Name: DateFixed, Length: 286290, dtype: datetime64[ns]>
print(df['DateFixed'].info())
<class 'pandas.core.series.Series'>
RangeIndex: 286290 entries, 0 to 286289
Series name: DateFixed
Non-Null Count Dtype
-------------- -----
286290 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 2.2 MB
None
Analyzing and Visualizing the Data #
Let’s try to recreate one of the first charts on the dashboard: Match Completion Rates. The site describes the data thusly:
This is the percentage of submitted matches that are considered complete. A game is considered incomplete if noone has reached the VP target (e.g. 10 or 14 VP), and can occur if players leave the match without updating the score tracker to reflect the final scores. Placing a command token on the final notch of the score tracker will trigger a stats update.
It’s no surprise that a long game like Twilight Imperium would have less than a 70% completion rate. However, as mentioned in the description above, some of the low completion rates might be inaccurate simply because players don’t know how to properly use the TTS mod to report data.
First, since each observation is represents one player per round, we need to aggregate each game by the unique setupTimestamp
series. We’ll need to add a Boolean series to determine whether or not someone reached the final score of 10 or 14. The simplest way to do this would be to use the Winning Faction
series.
# Creating a new data frame from columns from the original dataframe.
completionRate = df[['DateFixed','Winning Faction','setupTimestamp']].copy()
# Adding a column with Boolean data types to indicate whether the game is complete or not.
completionRate['Complete'] = completionRate['Winning Faction'] != 'Incomplete Match'
# Aggregating the data by setupTimestamp
completionRate1 = completionRate.groupby('setupTimestamp', as_index=False).first()
print(completionRate1.info())
print(completionRate1.head())
print(completionRate1.tail())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9084 entries, 0 to 9083
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 setupTimestamp 9084 non-null float64
1 DateFixed 9084 non-null datetime64[ns]
2 Winning Faction 9084 non-null object
3 Complete 9084 non-null bool
dtypes: bool(1), datetime64[ns](1), float64(1), object(1)
memory usage: 221.9+ KB
None
setupTimestamp DateFixed Winning Faction Complete
0 1.614471e+09 2021-02-28 00:11:30 Emirates of Hacan True
1 1.614476e+09 2021-02-28 01:31:56 Incomplete Match False
2 1.614493e+09 2021-02-28 06:16:36 Incomplete Match False
3 1.614499e+09 2021-02-28 07:50:19 Titans of Ul True
4 1.614499e+09 2021-02-28 07:57:52 Incomplete Match False
setupTimestamp DateFixed Winning Faction Complete
9079 1.688170e+09 2023-06-30 23:58:38 Incomplete Match False
9080 1.688184e+09 2023-07-01 04:00:50 Incomplete Match False
9081 1.688216e+09 2023-07-01 12:50:10 Incomplete Match False
9082 1.688223e+09 2023-07-01 14:50:35 Incomplete Match False
9083 1.688224e+09 2023-07-01 15:09:34 Incomplete Match False
The data frame has been reduced to 9,084 rows, which matches the current number of games in the database. Seeing as there are no null values, I think this data is ready for analysis.
Let’s go back to the original chart from the dashboard:
The chart’s x-axis represents time, the left y-axis represents completed game count, and the right y-axis represents the completion rate itself. It also seems that the x-axis is delimited by month. Keeping that in mind, let’s try to recreate that visualization with Pandas and Matplotlib:
# Grouping the results by month.
grouper = pd.Grouper(key='DateFixed', freq='M')
CompleteCount = completionRate1.groupby(grouper)['Complete'].sum().reset_index()
CompleteRate = completionRate1.groupby(grouper)['Complete'].apply(lambda x: np.sum(x)/len(x))
# Setting up the axes, labels and colors.
fig, ax1 = plt.subplots(facecolor='#2d2f38')
ax1.plot(CompleteCount['DateFixed'], CompleteCount['Complete'], color='#0d67d1')
ax1.set_yticks([0,100,200,300,400])
ax1.set_yticklabels([0,100,200,300,400], color='#e2e2e4')
ax2 = ax1.twinx()
ax2.plot(CompleteCount['DateFixed'], CompleteRate, color='#0aaec2')
ax2.set_yticks([0.0,0.20,0.4,0.6,0.8,1.0])
ax2.set_yticklabels(['0%','20%','40%','60%','80%','100%'], color='#e2e2e4')
ax1.set_facecolor('#2d2f38')
ax1.set_xticks(['2021-02','2021-05','2021-08','2021-11','2022-02','2022-05','2022-08','2022-11','2023-02','2023-05'])
ax1.set_xticklabels(['Feb 2021','May 2021','Aug 2021','Nov 2021','Feb 2022','May 2022','Aug 2022','Nov 2022','Feb 2022','May 2022'], color='#e2e2e4')
for tick in ax1.xaxis.get_major_ticks()[1::2]:
tick.set_pad(15)
# Setting up the legend
legend1 = mlines.Line2D([], [], color='#0d67d1', label='Complete Game Count')
legend2 = mlines.Line2D([], [], color='#0aaec2', label='Complete Game %')
ax1.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc='lower left', fontsize = 'small', facecolor = '#2d2f38', labelcolor = '#e2e2e4', edgecolor = '#2d2f38', ncols=2, handles=[legend1,legend2])
# Show Plot
plt.show()
Conclusion #
There’s a few things missing from the chart to make it complete, notably the annotations for each point plotted. I’m satisfied for now, though. It was really neat digging into the documentation of Matplotlib to get everything working just right. I also think that this is a chart that can be used to make decisions or at least guide future inquiry. For example, I’d be interested to see if the declines in complete game percentage track with influxes of new players, who typically aren’t aware of how to correctly report their data. I look forward to working with this dataset in the future and uncovering more insights into this intriguing game.