Who Will Win The Copa América 2024? (2/2)


In this part we will be in charge of preparing the data, cleaning information, verifying that the information is consistent, and then running it with our Poisson mathematical model.

Suggestion. If you only want to see the results, you can go directly to section 4, Results, if you want to know more about the development of the model and the code you can see parts 2 and 3

2. Cleaning Data


For this part we will use the pandas python library, we will clean the two databases obtained in the web scraping process, to use only the information relevant to our model.

import pandas as pd

df = pd.read_csv("american_cup_fixture.csv")
df2 = pd.read_csv("american_cup_historical_data.csv")

#Cleaningn df fixture

df["home"]=df["home"].str.strip()
df["away"]=df["away"].str.strip()
df["score"]=df["score"].str.strip()

df.replace({"Concacaf 5": "Canadá", "Concacaf 6": "Costa Rica"},inplace=True)

#Cleaning df2 historical
#df2[df2["home"].isnull()]
df2.dropna(inplace=True)
df2.shape

"""
If you are interested in the complete code, send me a message and I will gladly share it.
"""

3. Data modeling (Poisson Distribution Model)

In a future entry we will delve deeper into why the Poisson distribution model was selected. In this entry we will focus on the development of the model , our K (goals) meets the 4 points of the model:

  • k is the number of times an event occurs in an interval and k can take values 0, 1, 2, … .
  • The occurrence of one event does not affect the probability that a second event will occur. That is, events occur independently.
  • The average rate at which events occur is independent of any occurrences. For simplicity, this is usually assumed to be constant, but may in practice vary with time.
  • Two events cannot occur at exactly the same instant; instead, at each very small sub-interval, either exactly one event occurs, or no event occurs.

If these conditions are true, then k is a Poisson random variable, and the distribution of k is a Poisson distribution.


We will use 3 python libraries for this part of the model: Pandas, Pickle & SciPy

import pandas as pd
import pickle
from scipy.stats import poisson

df = pickle.load(open("dict_table","rb"))
df2 = pd.read_csv("clean_american_cup_historical_data.csv")
df3 = pd.read_csv("clean_american_cup_fixture.csv")

df_home = df2[['HomeTeam', 'HomeGoals', 'AwayGoals']]
df_away = df2[['AwayTeam', 'HomeGoals', 'AwayGoals']]

df_home = df_home.rename(columns={'HomeTeam':'Team', 'HomeGoals': 'GoalsScored', 'AwayGoals': 'GoalsConceded'})
df_away = df_away.rename(columns={'AwayTeam':'Team', 'HomeGoals': 'GoalsConceded', 'AwayGoals': 'GoalsScored'})

df_team_strength = pd.concat([df_home, df_away], ignore_index=True).groupby(['Team']).mean()
df_team_strength
TeamGoalsScoredGoalsConceded
Argentina2.3229170.875000
Bolivia0.9304352.417391
Brasil2.2430941.033149
Catar0.6666671.666667
Chile1.5801101.662983
Colombia1.1718751.562500
Costa Rica1.0000001.823529
Cuba0.0000004.000000
Ecuador1.0555562.587302
Estados Unidos1.0000001.611111
Haití0.5000003.000000
Honduras1.1666670.833333
Jamaica0.0000001.500000
Japón1.0000002.500000
México1.3750001.291667
Panamá2.0000002.500000
Paraguay1.4939761.698795
Perú1.4193551.574194
Trinidad y Tobago0.0000001.000000
Uruguay1.9489801.056122
Venezuela0.7428572.571429
Df Team Strength


We then create our point prediction function.

#Function predict_points

def predict_points(home, away):
    if home in df_team_strength.index and away in df_team_strength.index:
        # goals_scored * goals_conceded
        lamb_home = df_team_strength.at[home,'GoalsScored'] * df_team_strength.at[away,'GoalsConceded']
        lamb_away = df_team_strength.at[away,'GoalsScored'] * df_team_strength.at[home,'GoalsConceded']
        prob_home, prob_away, prob_draw = 0, 0, 0
        for x in range(0,11): #number of goals home team
            for y in range(0, 11): #number of goals away team
                p = poisson.pmf(x, lamb_home) * poisson.pmf(y, lamb_away)
                if x == y:
                    prob_draw += p
                elif x > y:
                    prob_home += p
                else:
                    prob_away += p
        
        points_home = 3 * prob_home + prob_draw
        points_away = 3 * prob_away + prob_draw
        return (points_home, points_away)
    else:
        return (0, 0)

4. Results

Once our model has been run for the first phase of the 4 groups, where only the first two in each group advance to the next round, we obtain the following.

Group Stage

Knockout Stage

For this section we define a function to obtain the winners of each phase of the knockout

def get_winner(df_fixture_updated):
    for index, row in df_fixture_updated.iterrows():
        home, away = row['home'], row['away']
        points_home, points_away = predict_points(home, away)
        if points_home > points_away:
            winner = home
        else:
            winner = away
        df_fixture_updated.loc[index, 'winner'] = winner
    return df_fixture_updated


and then we get the results

5. Conclusions


This model, when all 4 conditions are met, is extremely effective. As we will see in future posts, it is used in different fields of science, administration, finance, biology, physics, etc. Being a probability distribution model, a complementary analysis of variables of interest to the teams (analyzing the team’s moment, physical condition, injuries and illnesses, home field, etc.) would help complement our analysis, which we can verify in future entries.