Who Will Win The Copa América 2024? (2/2) – Alfredo Ramirez Portafolio

In this part we will be in charge of preparing the data, cleaning information, verifying that the information is consistent, and then running it with our Poisson mathematical model.

Suggestion. If you only want to see the results, you can go directly to section 4, Results, if you want to know more about the development of the model and the code you can see parts 2 and 3

2. Cleaning Data

For this part we will use the pandas python library, we will clean the two databases obtained in the web scraping process, to use only the information relevant to our model.

import pandas as pd

df = pd.read_csv("american_cup_fixture.csv")
df2 = pd.read_csv("american_cup_historical_data.csv")

#Cleaningn df fixture

df["home"]=df["home"].str.strip()
df["away"]=df["away"].str.strip()
df["score"]=df["score"].str.strip()

df.replace({"Concacaf 5": "Canadá", "Concacaf 6": "Costa Rica"},inplace=True)

#Cleaning df2 historical
#df2[df2["home"].isnull()]
df2.dropna(inplace=True)
df2.shape

"""
If you are interested in the complete code, send me a message and I will gladly share it.
"""

3. Data modeling (Poisson Distribution Model)

In a future entry we will delve deeper into why the Poisson distribution model was selected. In this entry we will focus on the development of the model , our K (goals) meets the 4 points of the model:

k is the number of times an event occurs in an interval and k can take values 0, 1, 2, … .
The occurrence of one event does not affect the probability that a second event will occur. That is, events occur independently.
The average rate at which events occur is independent of any occurrences. For simplicity, this is usually assumed to be constant, but may in practice vary with time.
Two events cannot occur at exactly the same instant; instead, at each very small sub-interval, either exactly one event occurs, or no event occurs.

If these conditions are true, then k is a Poisson random variable, and the distribution of k is a Poisson distribution.

We will use 3 python libraries for this part of the model: Pandas, Pickle & SciPy

import pandas as pd
import pickle
from scipy.stats import poisson

df = pickle.load(open("dict_table","rb"))
df2 = pd.read_csv("clean_american_cup_historical_data.csv")
df3 = pd.read_csv("clean_american_cup_fixture.csv")

df_home = df2[['HomeTeam', 'HomeGoals', 'AwayGoals']]
df_away = df2[['AwayTeam', 'HomeGoals', 'AwayGoals']]

df_home = df_home.rename(columns={'HomeTeam':'Team', 'HomeGoals': 'GoalsScored', 'AwayGoals': 'GoalsConceded'})
df_away = df_away.rename(columns={'AwayTeam':'Team', 'HomeGoals': 'GoalsConceded', 'AwayGoals': 'GoalsScored'})

df_team_strength = pd.concat([df_home, df_away], ignore_index=True).groupby(['Team']).mean()
df_team_strength

Team	GoalsScored	GoalsConceded
Argentina	2.322917	0.875000
Bolivia	0.930435	2.417391
Brasil	2.243094	1.033149
Catar	0.666667	1.666667
Chile	1.580110	1.662983
Colombia	1.171875	1.562500
Costa Rica	1.000000	1.823529
Cuba	0.000000	4.000000
Ecuador	1.055556	2.587302
Estados Unidos	1.000000	1.611111
Haití	0.500000	3.000000
Honduras	1.166667	0.833333
Jamaica	0.000000	1.500000
Japón	1.000000	2.500000
México	1.375000	1.291667
Panamá	2.000000	2.500000
Paraguay	1.493976	1.698795
Perú	1.419355	1.574194
Trinidad y Tobago	0.000000	1.000000
Uruguay	1.948980	1.056122
Venezuela	0.742857	2.571429

Df Team Strength

We then create our point prediction function.

#Function predict_points

def predict_points(home, away):
    if home in df_team_strength.index and away in df_team_strength.index:
        # goals_scored * goals_conceded
        lamb_home = df_team_strength.at[home,'GoalsScored'] * df_team_strength.at[away,'GoalsConceded']
        lamb_away = df_team_strength.at[away,'GoalsScored'] * df_team_strength.at[home,'GoalsConceded']
        prob_home, prob_away, prob_draw = 0, 0, 0
        for x in range(0,11): #number of goals home team
            for y in range(0, 11): #number of goals away team
                p = poisson.pmf(x, lamb_home) * poisson.pmf(y, lamb_away)
                if x == y:
                    prob_draw += p
                elif x > y:
                    prob_home += p
                else:
                    prob_away += p
        
        points_home = 3 * prob_home + prob_draw
        points_away = 3 * prob_away + prob_draw
        return (points_home, points_away)
    else:
        return (0, 0)

4. Results

Once our model has been run for the first phase of the 4 groups, where only the first two in each group advance to the next round, we obtain the following.

Group Stage

Knockout Stage

For this section we define a function to obtain the winners of each phase of the knockout

def get_winner(df_fixture_updated):
    for index, row in df_fixture_updated.iterrows():
        home, away = row['home'], row['away']
        points_home, points_away = predict_points(home, away)
        if points_home > points_away:
            winner = home
        else:
            winner = away
        df_fixture_updated.loc[index, 'winner'] = winner
    return df_fixture_updated

and then we get the results

5. Conclusions

This model, when all 4 conditions are met, is extremely effective. As we will see in future posts, it is used in different fields of science, administration, finance, biology, physics, etc. Being a probability distribution model, a complementary analysis of variables of interest to the teams (analyzing the team’s moment, physical condition, injuries and illnesses, home field, etc.) would help complement our analysis, which we can verify in future entries.