Who Will Win The Cup America 2024? (1/2) – Alfredo Ramirez Portafolio

Next summer the Copa America will be played in the United States. I have created this project to be able to model with historical data, who has the greatest probability of winning the tournament and how the matches would turn out.

This project is divided into three parts:

1. Data extraction (web scrapping).

2 Preprocessing and data cleaning.

3. Data modeling (Poisson model)

1. Data extraction (web scrapping)

In this first part, we will use four python libraries to extract the four group tables (A-D) hosted on Wikipedia, for the two available places we have considered Canada and Costa Rica.

Pos	Team	Qualification
1	Argentina	Advance to knockout stage
2	Peru	Advance to knockout stage
3	Chile
4	CONCACAF 5

Group A

import pandas as pd
from string import ascii_uppercase as alfabeto
import pickle
import numpy as np

yrs = pd.read_html("https://es.wikipedia.org/wiki/Copa_Am%C3%A9rica")
yrs_c = list(yrs.pop("Edición"))

df1 = pd.read_html("https://es.wikipedia.org/wiki/Copa_Am%C3%A9rica_2024")

dict_tablas = {}
for letra, i in zip(alfabeto, range(6,28,7)):
    df = df1[i]
    df.pop("Dif.")
    dict_tablas[f'Grupo {letra}'] = df 

"""
If you are interested in the complete code, send me a message and I will gladly share it.
"""

1.2 Historical data extraction

To extract all the results of the matches played since 1916 of all participating teams, we will use 3 Python libraries, including beautifulSoup to perform web scraping from Wikipedia.

2 July 1916

Uruguay	4–0	Chile
Piendibene 44′, 75′ Gradín 55′, 70′

First game and data information example to scrap.

Gimnasia y Esgrima, Buenos Aires

Referee: Hugo Gronda (Argentina)

from bs4 import BeautifulSoup
import requests
import pandas as pd

years = ['1916','1917','1919','1920','1921','1922','1923','1924','1925','1926','1927','1929','1935','1937','1939','1941','1942','1945','1946','1947','1949','1953','1955','1956','1957','1963','1967','1975','1979','1983','1987','1989','1991','1993','1995','1997','1999','2001','2004','2007','2011','2015','2016','2019','2021','2021']

def get_matches(year):
    web = f'https://es.wikipedia.org/wiki/Copa_Am%C3%A9rica_{year}'
    response = requests.get(web)
    content = response.text
    soup = BeautifulSoup(content, 'lxml')
    matches = soup.find_all('table', class_='collapsible autocollapse vevent plainlist',width=True)

    home = []
    score = []
    away = []

    for match in matches:
        home.append(match.find('td',width= "24%").get_text())
        score.append(match.find('td',width= "12%").get_text())
        away.append(match.find('td',width= "22%").get_text())

"""
If you are interested in the complete code, send me a message and I will gladly share it.
"""

Results

In this first part we obtained the necessary data to be able to preprocess the information, clean the data and start our distribution model.

We got:

1. The position tables of the initial tournament.

2. All match results from 1916 to the last tournament 2021

3. The game schedule (fixture) for the tournament to be played.

You can consult the files created in this part (CSV, dict table & partial code) on my github.

https://github.com/AlfredoRmz31/Who-Wins-American-Cup-2024