Next summer the Copa America will be played in the United States. I have created this project to be able to model with historical data, who has the greatest probability of winning the tournament and how the matches would turn out.
This project is divided into three parts:
1. Data extraction (web scrapping).
2 Preprocessing and data cleaning.
3. Data modeling (Poisson model)
1. Data extraction (web scrapping)
In this first part, we will use four python libraries to extract the four group tables (A-D) hosted on Wikipedia, for the two available places we have considered Canada and Costa Rica.
Pos | Team | Pld | W | D | L | GF | GA | GD | Pts | Qualification |
---|---|---|---|---|---|---|---|---|---|---|
1 | Argentina | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Advance to knockout stage |
2 | Peru | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
3 | Chile | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
4 | CONCACAF 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
import pandas as pd
from string import ascii_uppercase as alfabeto
import pickle
import numpy as np
yrs = pd.read_html("https://es.wikipedia.org/wiki/Copa_Am%C3%A9rica")
yrs_c = list(yrs.pop("Edición"))
df1 = pd.read_html("https://es.wikipedia.org/wiki/Copa_Am%C3%A9rica_2024")
dict_tablas = {}
for letra, i in zip(alfabeto, range(6,28,7)):
df = df1[i]
df.pop("Dif.")
dict_tablas[f'Grupo {letra}'] = df
"""
If you are interested in the complete code, send me a message and I will gladly share it.
"""
1.2 Historical data extraction
To extract all the results of the matches played since 1916 of all participating teams, we will use 3 Python libraries, including beautifulSoup to perform web scraping from Wikipedia.
2 July 1916
Uruguay | 4–0 | Chile |
---|---|---|
Piendibene 44′, 75′ Gradín 55′, 70′ |
Gimnasia y Esgrima, Buenos Aires
Referee: Hugo Gronda (Argentina)
from bs4 import BeautifulSoup
import requests
import pandas as pd
years = ['1916','1917','1919','1920','1921','1922','1923','1924','1925','1926','1927','1929','1935','1937','1939','1941','1942','1945','1946','1947','1949','1953','1955','1956','1957','1963','1967','1975','1979','1983','1987','1989','1991','1993','1995','1997','1999','2001','2004','2007','2011','2015','2016','2019','2021','2021']
def get_matches(year):
web = f'https://es.wikipedia.org/wiki/Copa_Am%C3%A9rica_{year}'
response = requests.get(web)
content = response.text
soup = BeautifulSoup(content, 'lxml')
matches = soup.find_all('table', class_='collapsible autocollapse vevent plainlist',width=True)
home = []
score = []
away = []
for match in matches:
home.append(match.find('td',width= "24%").get_text())
score.append(match.find('td',width= "12%").get_text())
away.append(match.find('td',width= "22%").get_text())
"""
If you are interested in the complete code, send me a message and I will gladly share it.
"""
Results
In this first part we obtained the necessary data to be able to preprocess the information, clean the data and start our distribution model.
We got:
1. The position tables of the initial tournament.
2. All match results from 1916 to the last tournament 2021
3. The game schedule (fixture) for the tournament to be played.
You can consult the files created in this part (CSV, dict table & partial code) on my github.