My fun attempt to predict the 2022 FIFA World Cup
View the Project on GitHub morales-felix/Qatar-2022-FIFA-World-Cup-simulation
Want to explore the Jupyter notebook? Click here to clone the repository and run it locally. Or, click the link below my picture.
As a soccer fan, this project was both fun and a step towards a future personal project. The real reason for this simulation was a World Cup bracket challenge with my coworkers. Was I the only one who did this? Did I win? Read on 😉
This simulation uses Elo ratings from https://eloratings.net to measure team strength and update it after each simulated game. The Elo implementation is based on FiveThirtyEight’s NFL forecasting game (https://github.com/morales-felix/nfl-elo-game).
import numpy as np
import pandas as pd
import csv
from tqdm import tqdm
from joblib import Parallel, delayed
from src.world_cup_simulator import *
Since I want to simulate the group stage many times to generate a distribution of outcomes, I will use the joblib
module to parallelize the simulation. This will allow me to run the simulation many times in a reasonable amount of time. That requires me to use a function to simulate the group stage and return the results.
def run_group_stage_simulation(n, j):
"""
Run a simulation of the group stage of the World Cup
"""
teams_pd = pd.read_csv("data/roster.csv")
for i in range(n):
games = read_games("data/matches.csv")
teams = {}
for row in [
item for item in csv.DictReader(open("data/roster.csv"))
]:
teams[row['team']] = {
'name': row['team'],
'rating': float(row['rating']),
'points': 0
}
simulate_group_stage(
games,
teams,
ternary=True
)
collector = []
for key in teams.keys():
collector.append(
{"team": key,
f"simulation{i+1}": teams[key]['points']}
)
temp = pd.DataFrame(collector)
teams_pd = pd.merge(teams_pd, temp)
sim_cols = [
a for a in teams_pd.columns if "simulation" in a]
teams_pd[
f"avg_pts_{j+1}"
] = teams_pd[sim_cols].mean(axis=1)
not_sim = [
b for b in teams_pd.columns if "simulation" not in b]
simulation_result = teams_pd[not_sim]
return simulation_result
The gist is to read from two files: One defining the match schedule, the other with teams and their relative strengths (given by Elo ratings prior to the start of the event).
# Reads in the matches and teams as dictionaries and proceeds with that data type
n = 100 # How many simulations to run
m = 100 # How many simulation results to collect
roster_pd = Parallel(n_jobs=5)(
delayed(run_group_stage_simulation)(
n, j) for j in tqdm(range(m)))
for t in tqdm(range(m)):
if t == 0:
roster = pd.merge(
roster_pd[t],
roster_pd[t+1]
)
elif t >= 2:
roster = pd.merge(
roster,
roster_pd[t]
)
else:
pass
sim_cols = [i for i in roster.columns if "avg_pts" in i]
roster['avg_sim_pts'] = roster[sim_cols].mean(axis=1)
roster['99%CI_low'] = roster[sim_cols] \
.quantile(q=0.005, axis=1)
roster['99%CI_high'] = roster[sim_cols] \
.quantile(q=0.995, axis=1)
not_sim = [
j for j in roster.columns if "avg_pts" not in j]
roster[not_sim].sort_values(
by=[
'group',
'avg_sim_pts'
],
ascending=False
)
group | team | rating | avg_sim_pts | 99%CI_low | 99%CI_high |
---|---|---|---|---|---|
A | Netherlands | 2040 | 6.1407 | 5.71475 | 6.51505 |
A | Ecuador | 1833 | 3.7859 | 3.36475 | 4.29565 |
A | Senegal | 1687 | 3.2100 | 2.65980 | 3.65030 |
A | Qatar | 1680 | 2.6475 | 2.24495 | 3.01000 |
B | England | 1920 | 5.2230 | 4.65495 | 5.70555 |
B | Wales | 1790 | 3.7854 | 3.30000 | 4.20000 |
B | Iran | 1797 | 3.4933 | 3.08475 | 4.11515 |
B | United States | 1798 | 3.4533 | 2.93475 | 3.85515 |
C | Argentina | 2143 | 6.8043 | 6.49495 | 7.11525 |
C | Poland | 1814 | 4.5490 | 3.96960 | 4.98515 |
C | Mexico | 1809 | 2.9092 | 2.50990 | 3.40505 |
C | Saudi Arabia | 1635 | 1.7660 | 1.49000 | 2.12515 |
D | France | 2005 | 5.3226 | 4.83840 | 5.58515 |
D | Denmark | 1971 | 4.9913 | 4.63485 | 5.42505 |
D | Australia | 1719 | 2.7047 | 2.37990 | 3.16080 |
D | Tunisia | 1707 | 2.7007 | 2.20375 | 3.12565 |
E | Spain | 2048 | 5.6514 | 5.20355 | 5.99030 |
E | Germany | 1936 | 4.2981 | 3.92465 | 4.88505 |
E | Japan | 1787 | 4.0986 | 2.81990 | 3.64515 |
E | Costa Rica | 1743 | 1.5796 | 2.14970 | 2.84565 |
F | Belgium | 2007 | 6.1354 | 5.60495 | 6.47515 |
F | Croatia | 1927 | 4.5970 | 4.16465 | 5.04515 |
F | Morocco | 1766 | 2.7974 | 2.41970 | 3.27040 |
F | Canada | 1776 | 2.5637 | 2.14485 | 3.02010 |
G | Brazil | 2169 | 6.1151 | 5.73445 | 6.52030 |
G | Switzerland | 1902 | 4.2062 | 3.69000 | 4.61535 |
G | Serbia | 1898 | 2.9597 | 2.56495 | 3.30020 |
G | Cameroon | 1610 | 2.3326 | 2.05970 | 2.52010 |
H | Portugal | 2006 | 5.8483 | 5.43495 | 6.32020 |
H | Uruguay | 1936 | 4.2981 | 3.81435 | 4.74545 |
H | South Korea | 1786 | 4.0986 | 3.66465 | 4.49030 |
H | Ghana | 1567 | 1.5796 | 1.37990 | 1.84515 |
You can see that:
Here is where the fun begins.
# Now, doing the Monte Carlo simulations
n = 10000
playoff_results_teams = []
playoff_results_stage = []
for i in tqdm(range(n)):
overall_result_teams = dict()
overall_result_stage = dict()
games = read_games("data/playoff_matches.csv")
teams = {}
for row in [
item for item in csv.DictReader(open("data/playoff_roster.csv"))]:
teams[row['team']] = {
'name': row['team'],
'rating': float(row['rating'])
}
simulate_playoffs(games, teams, ternary=True)
playoff_pd = pd.DataFrame(games)
# This is for collecting results of simulations per team
for key in teams.keys():
overall_result_teams[key] = collect_playoff_results(
key,
playoff_pd
)
playoff_results_teams.append(overall_result_teams)
# Now, collecting results from stage-perspective
overall_result_stage['whole_bracket'] = playoff_pd['advances'].to_list()
overall_result_stage['Quarterfinals'] = playoff_pd.loc[playoff_pd['stage'] == 'eigths_finals', 'advances'].to_list()
overall_result_stage['Semifinals'] = playoff_pd.loc[playoff_pd['stage'] == 'quarterfinals', 'advances'].to_list()
overall_result_stage['Final'] = playoff_pd.loc[playoff_pd['stage'] == 'semifinals', 'advances'].to_list()
overall_result_stage['third_place_match'] = playoff_pd.loc[playoff_pd['stage'] == 'semifinals', 'loses'].to_list()
overall_result_stage['fourth_place'] = playoff_pd.loc[playoff_pd['stage'] == 'third_place', 'loses'].to_list()[0]
overall_result_stage['third_place'] = playoff_pd.loc[playoff_pd['stage'] == 'third_place', 'advances'].to_list()[0]
overall_result_stage['second_place'] = playoff_pd.loc[playoff_pd['stage'] == 'final', 'loses'].to_list()[0]
overall_result_stage['Champion'] = playoff_pd.loc[playoff_pd['stage'] == 'final', 'advances'].to_list()[0]
overall_result_stage['match8'] = list(playoff_pd.loc[8, ['home_team', 'away_team']])
overall_result_stage['match9'] = list(playoff_pd.loc[9, ['home_team', 'away_team']])
overall_result_stage['match10'] = list(playoff_pd.loc[10, ['home_team', 'away_team']])
overall_result_stage['match11'] = list(playoff_pd.loc[11, ['home_team', 'away_team']])
overall_result_stage['match12'] = list(playoff_pd.loc[12, ['home_team', 'away_team']])
overall_result_stage['match13'] = list(playoff_pd.loc[13, ['home_team', 'away_team']])
overall_result_stage['match14'] = list(playoff_pd.loc[14, ['home_team', 'away_team']])
overall_result_stage['match15'] = list(playoff_pd.loc[15, ['home_team', 'away_team']])
playoff_results_stage.append(overall_result_stage)
Since I ran 10k simulations, I will now take a look at which results are the most common for Argentina.
results_teams = pd.DataFrame(playoff_results_teams)
results_teams['Argentina'].value_counts()
Argentina | Simulation count |
---|---|
Quarterfinals | 4317 |
Champion | 2127 |
Third_place | 1209 |
Round_of_16 | 932 |
Second_place | 905 |
Fourth_place | 510 |
I’m not showing this, but very strong teams had similar simulation counts than Argentina. For weaker teams, the most prevalent outcome was being eliminated in the Round of 16 (which makes sense).
Now, the real question is:
results_stage = pd.DataFrame(playoff_results_stage)
results_stage['Champion'].value_counts()
Team | Champion count |
---|---|
Argentina | 2127 |
Brazil | 1860 |
Netherlands | 1547 |
France | 859 |
England | 787 |
Portugal | 627 |
Spain | 555 |
Croatia | 454 |
Switzerland | 283 |
Japan | 231 |
Morocco | 218 |
United States | 208 |
Poland | 88 |
Senegal | 78 |
Australia | 46 |
South Korea | 32 |
My simulations predicted Argentina as the most likely World Cup winner after the group stage was completed.
But it so happens that I root for Argentina since my pre-teen years, and I’ve been conditioned to so much disappointment that I just couldn’t believe Argentina could win this World Cup. Especially after that defeat against Saudi Arabia. So I ended up not following these results when filling out my bracket.
Needless to say, seeing Argentina winning was one of the happiest moments in my life (next to my country, Panama, qualifying for the World Cup in 2018).
Lesson learned: I should have trusted my simulations and gained some bragging rights with the Bracket challenge! 😭