This is an example of a Decision Tree model (very useful and popular method) to make prediction using the power of Machine Learning…using just a normal PC.
Here we predict salary (depedendent variable) by giving various types of criteria (different forks in a decision tree). e.g. What’s the salary of a particular gender, of some age, with specific education, in a certain occupation. This is a binary classification problem as we’re not predicting the salary as a continuous data rather if someone’s salary is greater than some number or not.
The original dataset was from: https://www.kaggle.com/datasets/ayessa/salary-prediction-classification
But I changed some header names to make a bit more sense and removed some columns for relevance.
Additionally, many columns with category values had a leading SPACE char, which I removed in Excel (except for ” ?” on purpose of demonstrating how to clean up in Python) + there were several incorrect countries/spellings (along with leading spaces) which I also Find/Replace and with =UNIQUE() formula
to make it easy to identify them + changed the salary range from 50K to 70K. The final, working, cleaned up version is saved as salary.csv
The salary (our target value or what we are going to predict) is a text range: “<=70K” or “>70K”.
The entire source code and the dataset used here and encoded inputs for ML created and exported in the code are all available freely in my Github repo here: https://github.com/flyingsalmon/DecisionTree/
However, here instead of repeating the code I’ll provide explanation of the code and the overall process of solving this problem via Machine Learning (ML). So, let’s begin.
If we take a look the dataset dimension using dataframe.shape attribute, we see there are 32,561 rows, 11 columns. That’s a good amount of data that must be investigated via code or some automation and not just by eye-balling!
WARNING: The dataset has some unspecified cells with ? characters…we need to check them below and drop those columns especially as they contain labels and not numbers therefore, we cannot interpolate or substitute those values mathematically.
We need to understand the data types in the dataset, which we do by dataframe.types() and we get:
age int64
sector object
education object
education-num int64
marital-status object
occupation object
race object
sex object
hours-per-week int64
country object
salary object
So, we have age and hours-per-week as the integers and rest of them are string objects meaning categorical values or labels.
Next, we check if there’s any missing values using dataframe.isna() and we can get also get the sum in one shot: df.isna().sum() # where df is our dataframe as shown in the code
We see there’s no missing data in any column. At this point, it may be tempting to start using the dataset…but that would be a mistake! THERE IS A PROBLEM!
Opening the csv file in Excel, we see there are some ‘?’ in various columns! But it’s not just ‘?’ there’s a preceding SPACE char before ‘?’ So, it’s actually “ ?” that we need to remove all rows that have this string in any column. We do this by using pandas replace() to replace all the suspect values to null:
df=df.replace(" ?",np.nan).dropna(axis = 0, how = 'any')
and np.nan is a constant defined in numpy so we imported that as well.
And then drop all the row with null values using:
df.dropna(axis = 0, how = 'any', inplace = True)
Also drop the columns from dataframe that we don’t need:
df=df.drop(columns=[‘sector’, ‘education-num’, ‘marital-status’, ‘hours-per-week’])
So we are left with clean data with no funny or null characters, and only the columns we’re going to use: [‘age’, ‘education’, ‘occupation’, ‘race’, ‘sex’, ‘country’, ‘salary’]
Note, on purpose, the actual csv file is not modified. All these cleanups are done in memory in dataframes. If we wanted to save/overwrite with the cleaned up dataset, we could save the df into a csv file using pandas, or clean up in Excel.
But we should a couple of more things before we go into ML specific code. We see ‘salary’ column has two types of values (strings): <=70K” and “>70K”
This must be encoded as the strings are of no use to ML model. Same thing with ‘sex’ column where we have “Female” and “Male”. The encoding process will ensure the strings all map to unique numeric values that can be used in math computation for prediction.
We could do the encoding as part of sklearn modeling or we can also do it ourselves before training in the dataset ourselves as shown below. I decided to do this part by hand (and I show how to let ML do it as part of modeling for the rest of the columns). Doing this before modeling is useful when we know we want to map a small number of values to either 0 or 1 by hand as below.
cleanup_nums = {"salary": {"<=70K": 0, ">70K": 1}}
df = df.replace(cleanup_nums)
Now the salary column only contains 0 or 1. 0 if <=70K and 1 if >70K instead of “<=70K” and “>70K”.
Similarly we converted the column ‘sex’ to 0 (for male) or 1 (for female).
Next, we can get into ML-specific prep and code.
We save the target column (which is ‘salary’) into a new dataframe called ‘target’ and remove it from our inputs dataframe called ‘inputs’.
Then we encode the remaining columns we will use that are still in labels using sklearn’s LabelEncoder. The columns we need to encode are: ‘education’, ‘occupation’, ‘race’, ‘country’. fit_transform() does the trick and we create a new column with “_n” suffix for each of them. Then we drop the label columns as ML won’t need them with
inputs.drop(['education','occupation','race','country'], axis='columns')
Now we both clean inputs values and target values to start modeling!
from sklearn import tree
model = tree.DecisionTreeClassifier()
model.fit(inputs, target) # train the model
model.score(inputs,target)
After the training is done our score is 0.8660566275445926 so, 86.6% prediction accuracy. Not bad.
Now we can predict by giving any combination of available features to see if that particular person will make >$70K annually or not. The answer we’ll get will be either 0 for no, or 1 for yes.
Let’s try a few queries and see what ML prediction tells us.
PREDICTION QUERY 1: In United States, is salary of an Adm-clerical professional of female gender of race Asian-Pac-Islander with Bachelors degree > 70K?
To set up the query, we need to get the numeric codes for each of these params (features) from our encoded columns. To make it easy, we saved the encoded and their labels above as salary_encoded.xlsx file for reference so we can easily map the labels to their corresponding numeric values. We find the following mapping was done by LabelEncoder:
United States = 37 [country_n]
Adm-clerical = 2 [occupation_n]
Female = 1 (we already did this manually before modeling) [sex]
Asian-Pac-Islander = 2 [race_n]
Bachelors = 9 [education_n]
IMPORTANT: Pay attention to the parameters’ order in predict(). They must match the order of encoded columns order used as inputs and the number of arguments in predict().
To recap, the order of columns in inputs dataframe is: age sex education_n occupation_n race_n country_n
So, the first param we pass will be for the code for age, then sex, then so forth…if we skip a column (e.g. we don’t need that parameter
It’s very important to set to 0 or -1 for features we don’t want to consider in prediction. So our code for solving query 1 is:
model.predict([[-1,1,9,2,2,37]])
And the answer is: [0] meaning: No! Meaning, in USAs, salary of an Adm-clerical professional of female gender of race Asian-Pac-Islander with Bachelors degree is NOT > $70K based on the large dataset.
Let’s compare this prediction with the original known dataset…open in Excel, convert to table, and filter each column to the above criteria using only Label columns and we see:
education occupation race sex country salary
Bachelors Other-service Asian-Pac-Islander Female United-States >70K
Bachelors Adm-clerical Asian-Pac-Islander Female United-States <=70K
Bachelors Adm-clerical Asian-Pac-Islander Female United-States <=70K
Bachelors Adm-clerical Asian-Pac-Islander Female United-States <=70K
Bachelors Adm-clerical Asian-Pac-Islander Female United-States <=70K
Bachelors Adm-clerical Asian-Pac-Islander Female United-States <=70K
Bachelors Adm-clerical Asian-Pac-Islander Female United-States <=70K
Bachelors Adm-clerical Asian-Pac-Islander Female United-States <=70K
Bachelors Adm-clerical Asian-Pac-Islander Female United-States <=70K
Bachelors Adm-clerical Asian-Pac-Islander Female United-States <=70K
Which means only 1 person with those features made >70K while 9 others didn’t (e.g. 90% didn’t)…so the prediction is spot on!
Let’s try one more.
PREDICTION QUERY 2: In USA, is salary of an Exec-managerial professional of a white male gender of any race with Masters degree > 70K?
Again, from 'salary_encoded.xlsx' we find the following codes:
United States = 37 [country_n]
Exec-managerial = 4 [occupation_n]
male = 0 (we already did this manually before modeling) [sex]
white = 4 [race_n]
Masters = 12 [education_n]
So, our code is going to be:
model.predict([[-1,0,12,4,4,37]])
And the answer is: [1] meaning: Yes! Meaning, in USA, salary of an Exec-managerial professional of a male of any race with Masters degree is greater than $70K.
To verify, if we filter the columns in source dataset according to the features the query, we see Execs with Masters in USA do make (almost all) >70K. Spot on!
And, let’s try one more variation.
PREDICTION QUERY 3: In Ireland, is salary of a 50 year professional (any job) with Some-college > 70K? (any race, gender, sex)
From 'salary_encoded.xlsx' we find the following codes:
Ireland = 19 [country_n]
Age = 50 [age]
Some-college = 15 [education_n]
So, our code is going to be:
model.predict([[50,-1,15,-1,-1,19]])
And the answer is: [1] meaning: Yes! Meaning, in Ireland, salary of a 50 year professional with Some-college can make over $70K (in any job, any race, gender, sex).
Looking at the dataset, we see half of them in Ireland with some college make >70K and half make <70K and they’re all over 40 years of age …the model predicts predicts for a 50 year-old, he/she can make >70K, so that makes sense too.
I hope this gives you a sense of how to take raw dataset all the way into solving complex problems using large datasets harnessing the power of machine learning and computing. In normal SQL or database queries, no matter how large the dataset is, your queries only extract the data matching the search criteria, here with machine learning, we’re predicting features and combinations that may not even exist in the dataset through complex mathematical functions and layers using neurons! That is the beauty of machine learning and with Python libraries, we didn’t have to do a single calculation by hand, really! I hope it was fun and educational for you. (Please give my book, although not related to this blog, some love so I can continue to use the revenues to create more content and share freely with the world.)
Full code, dataset and encoded inputs are all in my Github repo: https://github.com/flyingsalmon/DecisionTree/
▛Interested in creating programmable, cool electronic gadgets? Give my newest book on Arduino a try: Hello Arduino!
▟