Reporting weather by reading off the predictions is obviously a very different skill than creating the code and models that actually predicts the weather! Not just forecasting for the next 5 days, but actually PREDICTING for any number of period in the future! Is that even possible? Yes, it is albeit there are myriad of external factors that can affect the actual outcome, we can built a machine learning model by which we can predict weather into the future using computer science, leveraging Machine Learning (ML). In this post, I share the methods of doing just that.
Ultimately, I want to write code so I can train a proven ML algorithm and have it spit out the result. The model I’m interesting is a neural network algorithm called Neural Prophet. This AI (artifical intelligence) model is based on a Time-Series model, inspired by Facebook Prophet and AR-Net, built on PyTorch. It’s publicly available to developers with documentation here: https://neuralprophet.com/html/index.html
But before I dive into that, I want to start with Excel and see how I can utilize that. So, in this part 1, I’ll start with 69 years of historical weather data in Seattle, WA and gleaning all sorts of information about the data and even make some prediction using Excel’s built-in ML techniques. While, Excel’s predictions and algorithms are not as sophisticated as the neural network algorithms out in the cloud servers, we can do some very useful processing. Afterward, in Part 2, I’ll go feet first into the neural network using Python code with Neural Prophet and see what else we can do with that ML algorithm. Be sure to read this first completely, then move on to Part 2 for a complete understanding of the process.
The Seattle weather data 1948 to 2017 (from kaggle) looks like this:
DATE PRCP TMAX TMIN RAIN 1/1/1948 0.47 51 42 1/2/1948 0.59 45 36 1/3/1948 0.42 45 35 ...12/12/2017 0 46 32 12/13/2017 0 48 34 12/14/2017 0 50 36 A total of 25551 records
First, I wanted to quickly check if there are blank, NA, or missing data in any of the columns. Instead of applying filter for each header and sorting, I directly used COUNTBLANK() and found that column PRCP (for precipitation) has 3 rows that are marked “NA” (not blank but actually has the text “NA” meaning not-available), and so does RAIN column. So, we need to tackle those differently. Using Filter(), I extracted the dates of those exact rows which turn out to be: So, the columns and the data types I’ll be dealing with are: 6/2/1998, 6/3/1998, 9/5/2005
After a little bit of data cleanup, I also added Fluctuation column that shows the daily difference between high and low temps, and RAINconverted column which converts the TRUE and FALSE to numeric 1 or 0.
DATE PRCP TMAX TMIN RAIN Fluctutaion RAINconverted <datetime> <float><int><int><boolean><int><int>
We can get some basic information right away as below:
|Average max temp:||60|
|Average min temp:||45|
|Most common high temp||50|
|Most common low temp||42|
We can clearly see 29th July 2009 had the hottest day at 103° F. The lowest daily fluctutations occurred 3 times: 1971, 74, and 2014 all in January with only 1° fluctuation throughout the day! The days are extracted by matching the fluctuation column and extracting one or more as applicable matching rows using Filter() function. We also see the coldest day was in January 1950 with just 16°. We can also see the average temps over 69 years along with most typical temps (high and low) and clearly see that upper 40s to upper 50s are the most common temps. These are easily done via Mode() and Average().
Next we can calculate the variance, and therefore, the standard deviation which comes out to be 13. This tells us that the data spread from mean is 13° in the entire range.
Before we try to do predictions in Excel, I can get some more interesting information out of this. For example, I see that there’s a 0.515 correlation between PRCP and RAIN. While maybe obvious connection here, it’s very important to understand correlations and its strength because some datasets are not going to be that clear-cut.
|Total rainy days:||10,900|
|Total dry days:||14,648|
|Total rows in dataset:||25,551|
|Missing data for x rows:||3|
|Rainy days percentage:||0.426597785|
|Dry days percentage:||0.573284803|
Seattle had 57% dry days and had 42% rainy days over the 69 year period…which proves the statement “it always rains in Seattle” to be a myth. Let’s also visualize the data:
We get different views to the temp fluctuations above for both all of the data range and a subset if we want. The daily fluctuations (difference between high and low temps) have been interestingly decreasing in more recent times in a 30-year span (see the dotted line). Over 69 years though, we see the coldest temps have risen slightly since 1980s while the highest temps haven’t risen by that much…meaning a more narrow range in recent times.
Ok, this is really interesting, but can we predict weather for a date in the future in Excel? We can…logistic regression is one way, but without some add-in help, it is cumbersome. Without the add-ins, we can still do some forecasting and get very useable results.
I used FORECAST.ETS() function which is one of the statistical functions built into Excel. It can be used to calculate or predict a future value based on known values by using the AAA version of the Exponential Smoothing (ETS) algorithm. A powerful part of this flavor of forecasting is that it can take optionally take seasonality into account…which is exactly what I wanted so I included to take into account the seasonality factor using the known/historical values. With that in place, it predicted the following:
|High temp (predicted)||For jan 1, 2022|
|High temp (predicted)||For july 30, 2022|
|Low temp (predicted)||For jan 1, 2022|
|Low temp (predicted)||For july 30, 2022|
Overall, the predictions look reasonable. I checked the actual weather from Jan 2022, and it so happens that in 2022, the average January high temp was 45° …spot on! [jan 2022 seattle weather data: https://world-weather.info/forecast/usa/seattle/january-2022/]…although not necessarily for the exact day.
The lowest temp in January was 39 (not 24 as predicted but not that far off). The lowest temp for july 2022 (which is in the future at the time of writing, so we cannot compare yet with actual) seems to be off however…typically, we would expect the end of july low to be around 60-70. The high of 92 seems to be very realistic and has occurred in the past.
Remember however, no prediction will get us there with reality with so many external factors into the future…no algorithm has a crystal ball!
This prediction may not be as accurate as the cloud-based, more complex neural network algorithms. That’s where we will head next in the following post. After reading this, head over there http://flyingsalmon.net/?p=3923 to see how I’ve done it in Python.