Specifically, we’re going to start with the most basic data about US Presidents, and “play” with them to extract some more meaningful, interesting tidbits. Along the way, you’ll learn how to:
- Recognize, and deal with unprintable characters
- Shape raw (unruly string) data for analytics
- Find trends
- Make interactive charts and tables
- Use fuzzy logic and vlookup to determine horoscopes (or any other characteristics) of a POTUS
- Trigger new ideas of data slicing based on these learnings
Ok, let’s get started. (I’ll be updating the blog incrementally due to my time-constraints so keep checking for more info)
First, I copied and pasted the DOBs from this site: https://www.gettysburgbattlefieldtours.com/presidential-birthdays-calendar/
And that’s ALL THE DATA WE WILL NEED to get this exercise going.
The page snippet is below:
It’s very nicely laid out for a web page. I want to paste this into Excel so I can do some data manipulation. After pasting into Excel, I realized that I want to shape the data differently as each line is a full string that contains the name and DOB. I want to extract components of each line differently. Specifically, I want to set up the dataset so that I have the following specific fields:
the name of the president, DOB, Year of birth, month of birth, then based on them, the name of the week of birth, have a binary field for Weekend (Y/N), years since birth, zodiac sign.
In order to get the zodiac sign, I’ll need just the day and month (not the year…for western horoscope). And to add some pizzazz, why not actually show the zodiac symbols for each president too?
Along the way, there will be challenges and I’ll explain the nature of them and how to resolve them one by one…before we achieve the final visuals and interactivity.
If I can get from the above data page to my desired table below, we’re all set!
Challenges:
First, as I pasted the text from the website, I notice a character that’s in-between the name of the president and his birthday. The character is a unicode, not a simple dash. But what is it exactly? I want to know this so I can accurately split the text for each president to extract his name and birthday info.
It turns out it’s code is 150 which I discovered using the CODE function. And to verify that, I use CHAR function on 150 to turn it back into the glyph. Solved! Now I can start my first parsing.
To get the name part, I can now use LEFT and FIND and CHAR all nested. e.g. =LEFT(A6,FIND(“–”, A6)-1) where A6 is the raw string to parse and first param is either the pasted unicode char as string or CHAR(150) which in turn returns the same character as shown above.
Now to get the DOB is a little trickier. Some months are 1 digit (e.g. 8 for august) while some are 2 digits (e.g. 11 for November). The formula I came up with is: =TRIM(MID($A6, FIND(“–”, $A6)+1, FIND(“–”,A6)+1))
The output now should be something like “6/14/1946” or “10/1/1924”, etc.
Ok, next I need to get the year portion out of this. Looking at the pattern, it’s easy to see I need the last 4 digits from the right-side of the string. So, let’s use RIGHT function with 4 as the num_chars value.
But remember, I want to extract the month also (for zodiac and other fun perspectives). So, I can’t just extract the 1st digit only because sometimes the month is 2 digits. The way I need to do this is by looking for the first instance of the “/” delimiter character and extract whatever is left of it (1 or 2 digits as appropriate). This is accomplished by using FIND and wrapping it in LEFT. e.g. =LEFT(C6,FIND(“/”, C6)-1)
So far, we have these fields: POTUS (name), DOB, Year, Month.
Now for fun, I want to know what was the day of the week each were born! Was it monday or friday or…?
To do that, I can use the DOB field I already parsed and apply TEXT function with “dddd” formatting to extract the day of the week.
Next I want to know how many were born on a weekend vs weekend! (I know, it’s fun, isn’t it?). To get that, I find that there’s a WEEKDAY function in Excel that returns a numeric value for the day. Depending on how I set it up (e.g. Sunday to be day 1, or monday to be day 1) which is customizable, I can extract that number from the DOB field which is in mm/dd/yyyy format and then transform that into a binary Yes or No. So, I created the field named: Weekend?
In that field, I’ll simply put “Y” for yes and “N” for no. By checking if the weekday (in my configuration) is more than 5, then I know it falls on Saturday or Sunday…meaning it was weekend!
So my formula is: IF(WEEKDAY(VALUE(C6),2)>5, “Y”, “N”)
Next we need find the number of years passed since birth. For a living POTUS, that’s obviously their age and for those passed away, it’ll be indication of how old they would have been today, or how long ago they were born. All interesting facts.
To get that, there are several ways but the challenge is some of the date functions in Excel do NOT work for years earlier than 1900 (I discovered this by trial and error) and many of our presidents were born way before. So, I had to find a way around that limitation by using simpler formula such as YEAR(NOW())-D6 where D contains the parsed year as yyyy as already done above.
Now to set up for the zodiac, I need a format as in “month/day”. We have the month already, now I need the day. Fortunately, the DAY function comes in handy here to parse from C column (which contains the DOB as you’ve already noticed). UNFORTUNATELY, it doesn’t seem to work for years prior to 1900! So use some creativity here to extract the date from the string (instead of using date function) and append it to the month with “/” delimiter.
So, now we have everything we need to find the zodiac! So far, I have these in place (for all 44 presidents):
Now, I’d like to get each one’s zodiac. For this, I need a reference table/data that shows which signs belong to which months and the range of start/end of a sign. To make it more interesting, I will even use their visual symbols from Unicode character set. This will be done via VLOOKUP functions.
I’ll use fuzzy logic to match them. Read my next blog in this series to follow along.