Welcome to this post. Today, I want to share my process of retrieving some Mega Million Lottery winning data, curating them, and analyzing them to tell a story. The results may or may not make us super-rich 🙂 but I certainly had fun throughout the process. Disclaimer: I used Excel, but this is not a tutorial on any particular application including Excel.
Objective: Once I have the data all formatted correctly, I want to:
- Find out which numbers occur most in the core 5 numbers in Winning numbers column.
- Find out the frequencies of numbers in Mega Ball.
- See if there’s any connection between the Winning amounts and date of drawings. How about the day of the week?
- Are there any outliers in Jackpots?
Official Mega Millions Rules: Players may pick six numbers from two separate pools of numbers – five different numbers from 1 to 70 (the white balls) and one number from 1 to 25 (the gold Mega Ball). You win the jackpot by matching all six winning numbers in a drawing. The jackpots start at $40 million and grow by a minimum of $5 million per draw each time the jackpot rolls.
Sources: I downloaded 2 different datasets because one of them had all information EXCEPT the Jackpot or Winnings. It contained Date, Winning Numbers (all strewn together in 1 column as a string), Mega Ball pick. This data is from https://catalog.data.gov/
I found the Jackpot data from https://www.usamega.com/. However, the Jackpot figures were spelled out in English strings, where I want the numbers to only numeric. So, I had to do some tricky curations to extract the column into Millions USD.
Then I consolidated them into one worksheet as follows with 1779 days of drawings. Excerpt below.

To do any metrics on the winning numbers, I have to split the string of numbers into their own numbers in their own columns. So, we can apply formulas on each independently. So, I split the Winning Numbers column into 5 separate numbers and into their own columns as shown below.

One more thing I’m interested in is getting the day of the week from the full date column. Using Excel’s weekday() I get the day number associated with that date. So, by default in Excel, 1 is Sunday, 3 is Tuesday, 6 is Friday…so we get 6 and 3 repeating all throughout because those are exactly the days of the Mega Millions drawings per week.

Ok, now we can do some basic charting to get a high-level understanding of the data. Using a scatter plot, we get this the following visual…clearly a few data points are outliers! This is a great way to quickly identify those anomalies. I call them out specifically on top of the chart along with their date and winning numbers. You can easily tell that in Oct 23, 2018 was the biggest payout of 1.5+Billion$! Just earlier that week was a 1 Billion$! And even earlier was over half Billion$. That was a crazy month and year!

Let’s now see how numbers are drawn the most. The table below shows the results. We can do this using the concept of MODE in statistics. Hmm, so 2, 17, 31, 46, 52 are drawn the most based on our 1779 drawings since year 2012. And the Megaball that seems to come up most is 7. To illustrate further patterns, the minimum, maximum, and median values for each number + Megaball are also calculated below…

We can also visualize our finding with this chart: 
Ok, so far we have lots of good information. But I also want to know correlation between the date and drawing and the jackpot amount. And from the date, we can do the same comparison but using the day of the week to see if the day of the week of the drawing has any impact on the jackpot amount! We can use CORRELATION concept of statistics to get an idea.
Using CORREL function, we find a weak link between date and winning amount. It’s there, but it’s not strong enough to claim anything. R square value tells us the same consistent story since it’s not strong but there is a correlation.

When we do the same comparison using only the day of the week (instead of the full date), the link is even weaker. In fact, we can confidently say (95% confidence level) that there is NO correlation between which day the lottery is drawn and the amount of jackpot. The following table confirms that.

So, there you have it! Lots of different cool ways to look at mundane data all around us 🙂 I don’t enjoy hard-core statistics much (mainly because I’m not a data scientist with statistical background beyond college), but I love crunching and twisting numbers using common statistical concepts for real-life applications. And lot of these concepts can be grasped and applied even without mastery in statistics.
What other stories would you tell about this? Please comment.