In this part we will be in charge of preparing the data, cleaning information, verifying that the information is consistent, and then running it with our Poisson mathematical model.
Suggestion. If you only want to see the results, you can go directly to section 4, Results, if you want to know more about the development of the model and the code you can see parts 2 and 3
2. Cleaning Data
For this part we will use the pandas python library, we will clean the two databases obtained in the web scraping process, to use only the information relevant to our model.
import pandas as pd
df = pd.read_csv("american_cup_fixture.csv")
df2 = pd.read_csv("american_cup_historical_data.csv")
#Cleaningn df fixture
df["home"]=df["home"].str.strip()
df["away"]=df["away"].str.strip()
df["score"]=df["score"].str.strip()
df.replace({"Concacaf 5": "Canadá", "Concacaf 6": "Costa Rica"},inplace=True)
#Cleaning df2 historical
#df2[df2["home"].isnull()]
df2.dropna(inplace=True)
df2.shape
"""
If you are interested in the complete code, send me a message and I will gladly share it.
"""
3. Data modeling (Poisson Distribution Model)
In a future entry we will delve deeper into why the Poisson distribution model was selected. In this entry we will focus on the development of the model , our K (goals) meets the 4 points of the model:
- k is the number of times an event occurs in an interval and k can take values 0, 1, 2, … .
- The occurrence of one event does not affect the probability that a second event will occur. That is, events occur independently.
- The average rate at which events occur is independent of any occurrences. For simplicity, this is usually assumed to be constant, but may in practice vary with time.
- Two events cannot occur at exactly the same instant; instead, at each very small sub-interval, either exactly one event occurs, or no event occurs.
If these conditions are true, then k is a Poisson random variable, and the distribution of k is a Poisson distribution.
We will use 3 python libraries for this part of the model: Pandas, Pickle & SciPy
import pandas as pd
import pickle
from scipy.stats import poisson
df = pickle.load(open("dict_table","rb"))
df2 = pd.read_csv("clean_american_cup_historical_data.csv")
df3 = pd.read_csv("clean_american_cup_fixture.csv")
df_home = df2[['HomeTeam', 'HomeGoals', 'AwayGoals']]
df_away = df2[['AwayTeam', 'HomeGoals', 'AwayGoals']]
df_home = df_home.rename(columns={'HomeTeam':'Team', 'HomeGoals': 'GoalsScored', 'AwayGoals': 'GoalsConceded'})
df_away = df_away.rename(columns={'AwayTeam':'Team', 'HomeGoals': 'GoalsConceded', 'AwayGoals': 'GoalsScored'})
df_team_strength = pd.concat([df_home, df_away], ignore_index=True).groupby(['Team']).mean()
df_team_strength
Team | GoalsScored | GoalsConceded |
---|---|---|
Argentina | 2.322917 | 0.875000 |
Bolivia | 0.930435 | 2.417391 |
Brasil | 2.243094 | 1.033149 |
Catar | 0.666667 | 1.666667 |
Chile | 1.580110 | 1.662983 |
Colombia | 1.171875 | 1.562500 |
Costa Rica | 1.000000 | 1.823529 |
Cuba | 0.000000 | 4.000000 |
Ecuador | 1.055556 | 2.587302 |
Estados Unidos | 1.000000 | 1.611111 |
Haití | 0.500000 | 3.000000 |
Honduras | 1.166667 | 0.833333 |
Jamaica | 0.000000 | 1.500000 |
Japón | 1.000000 | 2.500000 |
México | 1.375000 | 1.291667 |
Panamá | 2.000000 | 2.500000 |
Paraguay | 1.493976 | 1.698795 |
Perú | 1.419355 | 1.574194 |
Trinidad y Tobago | 0.000000 | 1.000000 |
Uruguay | 1.948980 | 1.056122 |
Venezuela | 0.742857 | 2.571429 |
We then create our point prediction function.
#Function predict_points
def predict_points(home, away):
if home in df_team_strength.index and away in df_team_strength.index:
# goals_scored * goals_conceded
lamb_home = df_team_strength.at[home,'GoalsScored'] * df_team_strength.at[away,'GoalsConceded']
lamb_away = df_team_strength.at[away,'GoalsScored'] * df_team_strength.at[home,'GoalsConceded']
prob_home, prob_away, prob_draw = 0, 0, 0
for x in range(0,11): #number of goals home team
for y in range(0, 11): #number of goals away team
p = poisson.pmf(x, lamb_home) * poisson.pmf(y, lamb_away)
if x == y:
prob_draw += p
elif x > y:
prob_home += p
else:
prob_away += p
points_home = 3 * prob_home + prob_draw
points_away = 3 * prob_away + prob_draw
return (points_home, points_away)
else:
return (0, 0)
4. Results
Once our model has been run for the first phase of the 4 groups, where only the first two in each group advance to the next round, we obtain the following.
Group Stage
Knockout Stage
For this section we define a function to obtain the winners of each phase of the knockout
def get_winner(df_fixture_updated):
for index, row in df_fixture_updated.iterrows():
home, away = row['home'], row['away']
points_home, points_away = predict_points(home, away)
if points_home > points_away:
winner = home
else:
winner = away
df_fixture_updated.loc[index, 'winner'] = winner
return df_fixture_updated
and then we get the results
5. Conclusions
This model, when all 4 conditions are met, is extremely effective. As we will see in future posts, it is used in different fields of science, administration, finance, biology, physics, etc. Being a probability distribution model, a complementary analysis of variables of interest to the teams (analyzing the team’s moment, physical condition, injuries and illnesses, home field, etc.) would help complement our analysis, which we can verify in future entries.