Data Science in sports is a huge deal! Even though most professional sports teams have been using them as a core part of their strategy, it was relatively recently that it got much more attention in the NFL…as soon as Seahawks dedicated a team to analytics! As they blew out records in virtually all positions so rapidly, rest of the NFL teams invested in the same. Now all NFL teams have a dedicated team crunching data 🙂 And it’s virtually impossible to get your hands in those datasets from the outside. Additionally, teams are paying big $$$ to buy data and analysis from 3rd party companies who make a living tracking sports related data.
With that prologue, I was able to get some free data, specifically, I found data on quarterbacks across all NFL teams for all of 2016 season (regular season) from https://www.kaggle.com/ as csv. The format was in this order:
qb, att, cmp, yds,ypa, td, int, lg, sack, loss, rate, game_points, home_away, year
and the site explained the columns’ definitions as follows respectively: Quarterback Name,Attempts,Completions,Yards,Yards per Attempt,Touchdowns,Interceptions,Longest Throw,Sacks,Loss of Yards,The NFL’s Quarterback Rating for the game,Total points scored in the game,Home or Away Game,Year
However, the Quarterback Rating has since been practically replaced by a newer method called Passer Rating. I believe it was first instituted by ESPN. The output of both of these formulas are not dramatically different but they are different! And passer rating actually takes into consideration the stats for non-QBs as well (remember that some receivers, TEs, even RBs can and do pass on trick plays!).
- For this exercise:
- I cleaned up the data,
- took a subset of some prominent QBs (not everyone in the dataset),
- used Passer Rating formula to compute new ratings,
- added a new score (my invention) called Efficiency (the point of Efficiency rating is so that I can compare completed/attempted ratios in percentage. I’ll get to it more later),
- Added Yards before Interception, and longest throw that was a TD (from a different dataset elsewhere)
- Added win-loss for each game (from a different dataset elsewhere).
Once the cleaned up data is in my Excel wb, the QB/Passer Rating formula would be as follows (in my layout):
Once converted to a pivot table, I get something as follows:
With this done, I’m ready to do visual and statistical analysis…
I created two interactive Pivot Charts (below) that shows me a quick overview of the subset of QBs I’m interested in and their key metrics I’m interested in:
Both of the above charts show the same, accurate information, but in different layouts…matter of preference really as to which one’s better. So, I’m sharing both versions.
Remember, these are BOTH interactive and filtering in either charts will update BOTH charts simultaneously. Below is one of the charts showing 9 QBs specifically selected and their stats only:
Similarly, the chart below compares two QBs head-to-head:
See anything interesting?
While Brady has higher number of wins for the team, more TDs, fewer interceptions, the interceptions he threw came in pretty quickly (Yds Before INT) overall. It’s NOT a per game stat, but the entire season’s. So, of all the 16 games, I considered the total INTs and total YDS and calculated how many yards was covered before any INT would occur when they’re normalized over 16 games.
A very effective way to visualize this type of data is shown right inside a table-chart hybrid where we can quickly visualize what’s good/bad (Red to Green gradients for Bad to Good for example), the mini-histograms right inside the cells to show relative magnitude, an so on. This is a powerful albeit colorful way to present a quick summary. Of course, it can be sorted and filtered in various ways (not just by QBs column).
We can also take a look at correlations and if (and how) one metric may affect another another. A summary of correlations is below:
What the above table shows us among other things is that there’s a Negative correlation between INT and WIN. So, more INTs, less WINs. But to what degree? Not extremely strong but definitely something that affects the outcome.
We further notice a Strong correlation between Efficiency and WIN (remember, I calculated Efficiency from original dataset in meta-processing). In other words, higher the passing efficiency (== higher completion rates over attempted), chances are very good that it’ll result in a WIN.
We can visual that strong correlation out of curiousity and see what it looks like…
Yep, they’re surely correlated with a strong R^2! (yes the dots are tiny footballs ;))
Hope it was a fun read. Check out rest of my blogs that span across various STEM topics (not just Excel). What will the new season stats look like???