Saleem Khan :A Portfolio of Data Science Work:
    About     Archive     Feed

Football Statistics Analysis

Predicting Football Game Point Totals Using Machine Learning

By: Saleem Khan

Players

Photo by Joe Calomeni from Pexels

Background

With the NFL conference championships now complete I thought it would be timely to take a closer look at some football statistics ahead of the Super Bowl. I’m not a huge football fan but analyzing this data is a great way to get a better feel for the game and the statistics that drive it.

Problem Statement

The objective of this exercise is to determine which team statistics have the greatest impact on the amount of total points scored during an NFL Game. This problem statement is very specific because I want to be able to compare my modeled output to the Las Vegas Over/Under betting line (see below). I am merely using this as a comparison benchmark for the machine learning model i will produce.

Note: I have no aspirations of becoming a professional sports gambler! ;-)

BettingLines

NFL Betting Lines

Las Vegas needs to be correct on their betting lines at least 52.4% of the time in order to break even. Therefore any model I come up with will need to beat these odds in order to be profitable. I will be focusing on the Over/Under betting line above since this is probably the easiest number to predict.

This sort of analysis can help for more than just betting, there are multiple constituents that can benefit from this sort of analysis:

  • NFL Teams: By analyzing point totals for the entire league and comparing their individual point totals, teams can better adjust their player roster to improve scoring.
  • NFL Players: Players can get a better sense of how best to tackle (pun intended) their opponents in the coming weeks.
  • Fans: Fans can better enjoy the game and have hours of commentary by knowing where their favorite teams excel or need improvement.

Data Selection & Web Scraping

DataAnalysis

Photo by Markus Spiske temporausch.com from Pexels

As I started this data analysis I had to find a reliable source for statistics on NFL games. The best provider I could find that also allowed webscraping was the Pro Football Reference page. This site has an amazing depth of detail in the statistics they capture for each and every game.

Once I was able to identify a provider my next task was to determine a good way to get the data I needed. I wanted to look at 3 years worth of data for all 32 NFL teams. This would include all 16 regular season games as well as playoffs and Super Bowls. I decided to use the Selenium driver provided by Google (i.e. essentially a stripped down version of Chrome used for webscraping) in order to capture the data on each web page. I then needed to use a Python library called BeautifulSoup in order to pull the specific fields i needed.

Dataset1

After a bit of coding I was able to gather 1600 rows and 27 columns of data which represent each individual game played. Each of the columns will be a potential feature in my machine learning model. The output from my Python dataframe below gives you a sense of the type of data i will be analyzing.

Dataset

Exploratory Data Analysis

Once the dataset was scraped and loaded into a Python Pandas dataframe I was then able to do some exploratory analysis to get a better feel for the data. After looking at some of my columns I realized that many of the features would not be helpful in a machine learning model. Items like week, time and even team name are not logically predictive of what the point total will be for a game. I whittled down my initial 27 features to just 11. I also combined the Points Scored and Points Allowed fields into a new column called OverUnder which is the y or the dependent variable in our model that we are trying to predict. The image below gives you a general overview of the data that will be going into my model.

Describe

Note: Does something seem odd in the description above? All of the offensive and defensive summary statistics are the same! This is because our dataset capture stats on both teams that played. For instance if the Patriots played the Jets then the defensive stats for the Patriots would be the same as the offensive stats for the Jets and vice versa. This was actually a good way to see if there were any errors in my data. If any of the basic statistics were not correct here there might have been some error in these summary statistics.

Next, let’s get a general birds-eye view of our data in order to see how it is shaped. I used a basic matplotlib histogram in order to visualize this data:

#let's get a visual of our dataset
df.hist(figsize = (20, 20), layout=(6, 3), bins = 'auto')
plt.tight_layout()
plt.savefig('charts/histogram.png')
plt.show()

Histogram

Our data looks normally distributed overall so we won’t need to do much feature engineering.

Initial Correlation Review

After loading our data in and getting a sense of the shape and summary statistics i now need to look at how the datasets correlates with the OverUnder statistic we are trying to predict. The heatmap below shows the each individual feature and how much it correlates with the point total as well as the other features.

Correlation

One thing to note here is that the offensive and defensive total yards are highly correlated to passing yards. This makes sense because total yards are a combination of passing and rushing yards. This is an example of multicolinearity which can introduce statistical noise in my model. I may tweak my model later by removing the total yards stat to see if this improves our R2. Note: R2 is a measure of the influence our independent variables have on our dependent variable (PointsTotal or OverUnder in this case)

Machine Learning Model

Now that we have a good sense of our dataset and multicollinearity (i.e. redundancy) within our features we can proceed with creating our first model. I employed a basic train, test and validate along with a K-Fold segmentation.

  • Train, Validate, Test: This means that we will split our entire dataset into portions. 40% to train our model, 40% to validate our model and the last 20% to test our model. This is a standard approach to machine learning.
  • K-Fold: This is an approach where we iterate through our entire dataset at randomized intervals in order to run a new random chunk of our data through our model. This is a great way to see whether our model performs consistently across different groupings of our data.

I decided to try out three different models using the K-fold technique. I worked with basic linear regression, ridge regression and lasso regression. Python makes it very easy to quickly run these sorts of models. Often it only takes a few lines of code as seen below.

    #Linear Regression
    lm = LinearRegression()
    
    #Ridge Regression - The alpha value is a hyperparameter that needs to be optimized for best fit.
    lm_ridge = Ridge(alpha=1)
    
    #Lasso Regression
    lm_lasso = Lasso(alpha=0.1)

After running my data through these models i received very similar results for the R2 metric:

Linea scores:  [0.5389065856889645, 0.5389065856889645, 0.5389065856889645, 0.5389065856889645, 0.5389065856889645]
Ridge scores:  [0.5389148915059347, 0.5389148915059347, 0.5389148915059347, 0.5389148915059347, 0.5389148915059347]
Lasso scores:  [0.5402006830177307, 0.5402006830177307, 0.5402006830177307, 0.5402006830177307, 0.5402006830177307] 

We can read this as our independent variables (game statistics) explaining about 54% of the variability in our over/under or points total. Remember that we wanted to beat the Las Vegas odds of 52.4%. Looks like we did it but not so fast. We do need to look at another metric, our mean absolute error (MAE) to get a sense of how much our model’s predicted point totals deviated from the actuals in our test dataset.

Here are the mean absolute error calculations for all three models:

Linea MAE:  2.927044316714102
Ridge MAE:  3.8878462965670018
Lasso MAE:  3.877920193274435

We can see above that the basic linear regression model had the lowest MAE, meaning its average prediction error was about 3 points from the actual score. This isn’t too bad but it’s still not enough to beat Vegas. With this error we would be hovering right around the 52.4% odds that Vegas has.

LinearReg

Conclusion & Follow-Up

In the end I wasn’t able to beat Vegas but it was fun going through the process. I was able to collect the data needed, perform some exploratory analysis, do basic feature engineering and run a machine learning model. The last thing we need to do is look at some of the coefficients our model produced in order to see how each feature influenced our overall point totals.

Off_1stD : 0.56
Off_TotYd : 0.04
Off_PassY : 0.02
Off_RushY : 0.02
Def_1stD : 0.59
Def_TotYd : 0.04
Def_PassY : 0.02
Def_RushY : 0.02
HomeOrAway : -0.72

The list above is telling us that for every total yard gained (combined passing and rushing) we can expect our point total to improve by 0.04. Said another way we can expect an addition of 4 total points for every 100 total yards gained by either team.

Also, an additional bit of follow up here would be to look at adding in some more data to see what would help us create a better model. I’ve listed out some additional items to look into in the future.

FollowUp

New York Subway Traffic Analysis

NYC Subway Ridership Analysis

By: Saleem Khan

Commuters

Photo by mentatdgt from Pexels

Note: This was a hypothetical client project for the Metis Data Science Bootcamp

Problem Statement

WomenTechWomenYes (WTWY) is a non-profit organization whose mission is to promote the participation of women in the technology sector. This group holds an annual fundraising gala in order to raise awareness and receive donations to promote their cause. The group uses a street team for outreach and promotion of their gala event.

WomenTechWomenYes (WTWY) has an annual gala at the beginning of the summer each year. As we are new and inclusive organization, we try to do double duty with the gala both to fill our event space with individuals passionate about increasing the participation of women in technology, and to concurrently build awareness and reach.

To promote their event WTWY enlists a street team that is placed at entrances to New York City subway stations. The street team is responsible for collecting e-mail addresses and those who sign up receive free tickets to the gala event.

The organization asked for help in using MTA subway data to help optimize the placement of their street teams. The goal is to optimize not only the number of attendees but those that will most likely to contribute to their cause.

Data Selection

Data Analysis

*Photo by Kevin Ku from Pexels

The first step in this data analysis was to identify an appropriate dataset. The New York City MTA (Metropolitan Transit Authority) has made ten years’ worth of turnstile data available on their mta.info website. This data provides details on over 400 stations across the MTA and PATH systems along with individual turnstiles, counter and booth information. The data is counter-based meaning the data collected from each turnstile increments up every time a new rider swipes. The counters seems to have an arbitrary reset after a certain counter limit is reached.

More information on the MTA dataset can be found here: MTA Turnstile Dataset

Exploratory Data Analysis

Once the data was identified, i had to do some basic analysis to get a better sense of what the data represents. I pulled down data for all of 2019 for this analysis. Each row within this dataset represented a unique station, unit, turnstile, date and time combination along with an entry and exit counter reading. There were over 10M entries in this dataset.

MTA Data Sample

The first thing i had to do to get a sense of where the street team should be deployed is to look at the yearly aggregate counts for all unique subway station entries. The graph below represents the yearly entry totals for the top 20 subway stations.

Top 20 Stations

x-axis is in 10’s of millions

Recomendation #1 The street team should be deployed to the top 20 stations by volume in NYC.

By far the busiest subway station in NYC is 34th St - Penn. Station. This subway station receives almost 30% more traffic than the second busiest subway station. I wanted to look at whether there were any seasonality effects for this subways station. The plot below shows a daily view of ridership for all of 2019. You will notice that subway ridership peaks during the spring and fall months but the changes over the rest of the year is negligible.

Daily View

Recomendation #2 The street team can be deployed any time of year as seasonality is not a major factor. Most of the street team should be deployed to the 34th Street station since it handles 30% more volume than the second busiest station.

I then needed to look at the best days of the week to deploy the street team. I had a hunch that subway ridership would decrease on the weekend and increase during the work week. The image below proves this intuition. We see an ebb and flow to the ridership where peak traffic occurs between Tuesday and Thursday and wanes on Saturday and Sunday.

Daily Turnstile View

Recomendation #3 Deploy street team resources during the work week specifically from Tuesday through Thursday.

The last question is whether there is a certain time of day that is ideal for reaching a bigger audience. After taking a closer look at the data we do see that there is a marked increase in traffic during specific times of the day. The peak hours seem to be between 6-10AM and 2-6PM. The image below shows this visually.

Daily Turnstile View

Recommendation #4 Deploy the street team during peak rush hour times of 6-10AM and 2-6PM.

Conclusion & Follow-Up

Analyzing the MTA data allowed me to arrive at some early conclusions about how best to deploy the WTWY street team resources. We have four specific recommendations that should put the organization on a good trajectory to maximize donations.

For future consideration I would need to look at the following to enhance this analysis:

  • Social Media / GPS Data - Include data from social media data providers which provide uniqie mobile devices travelling through a specific subway station. This will allow us to better target the female demographic that this group is interested in.
  • Tech Hubs - Find locations within NYC where there are tech hubs that would attract more of the demographic that this group is interested in.
  • Demographic/Income Data - Use US Census data to determine whether there are higher pockets of female demographics in the NYC area and also add in income data to improve likelihood of donations.