Who Will Win The Cup America 2024? (1/2)

Next summer the Copa America will be played in the United States. I have created this project to be able to model with historical data, who has the greatest probability of winning the tournament and how the matches would turn out.


This project is divided into three parts:

1. Data extraction (web scrapping).

2 Preprocessing and data cleaning.

3. Data modeling (Poisson model)

1. Data extraction (web scrapping)


In this first part, we will use four python libraries to extract the four group tables (A-D) hosted on Wikipedia, for the two available places we have considered Canada and Costa Rica.

PosTeamPldWDLGFGAGDPtsQualification
1 Argentina00000000Advance to knockout stage
2 Peru00000000
3 Chile00000000
4 CONCACAF 500000000
Group A
import pandas as pd
from string import ascii_uppercase as alfabeto
import pickle
import numpy as np

yrs = pd.read_html("https://es.wikipedia.org/wiki/Copa_Am%C3%A9rica")
yrs_c = list(yrs.pop("Edición"))

df1 = pd.read_html("https://es.wikipedia.org/wiki/Copa_Am%C3%A9rica_2024")

dict_tablas = {}
for letra, i in zip(alfabeto, range(6,28,7)):
    df = df1[i]
    df.pop("Dif.")
    dict_tablas[f'Grupo {letra}'] = df 

"""
If you are interested in the complete code, send me a message and I will gladly share it.
"""

1.2 Historical data extraction


To extract all the results of the matches played since 1916 of all participating teams, we will use 3 Python libraries, including beautifulSoup to perform web scraping from Wikipedia.

2 July 1916

Uruguay 4–0 Chile
Piendibene  44′, 75′
Gradín  55′, 70′
First game and data information example to scrap.

Gimnasia y Esgrima, Buenos Aires

Referee: Hugo Gronda (Argentina)

from bs4 import BeautifulSoup
import requests
import pandas as pd

years = ['1916','1917','1919','1920','1921','1922','1923','1924','1925','1926','1927','1929','1935','1937','1939','1941','1942','1945','1946','1947','1949','1953','1955','1956','1957','1963','1967','1975','1979','1983','1987','1989','1991','1993','1995','1997','1999','2001','2004','2007','2011','2015','2016','2019','2021','2021']

def get_matches(year):
    web = f'https://es.wikipedia.org/wiki/Copa_Am%C3%A9rica_{year}'
    response = requests.get(web)
    content = response.text
    soup = BeautifulSoup(content, 'lxml')
    matches = soup.find_all('table', class_='collapsible autocollapse vevent plainlist',width=True)

    home = []
    score = []
    away = []

    for match in matches:
        home.append(match.find('td',width= "24%").get_text())
        score.append(match.find('td',width= "12%").get_text())
        away.append(match.find('td',width= "22%").get_text())

"""
If you are interested in the complete code, send me a message and I will gladly share it.
"""

Results


In this first part we obtained the necessary data to be able to preprocess the information, clean the data and start our distribution model.


We got:

1. The position tables of the initial tournament.

2. All match results from 1916 to the last tournament 2021

3. The game schedule (fixture) for the tournament to be played.

You can consult the files created in this part (CSV, dict table & partial code) on my github.

https://github.com/AlfredoRmz31/Who-Wins-American-Cup-2024