Machine Learning (Prediction with Dummy Variables)

I already shared ways to leverage Python for Machine Learning and predict values in univariate and multivariate regression models. Be sure to read those before proceeding to this as this builds on those concepts. In this post, we’ll also make prediction using regression model but this time we have categorical values to deal with. Categorical values are values that cannot be used in a mathematical formula as they’re really labels such as male/female, tall/short, car model names, zip codes, country names, color names, animals, etc.

Categorical values are of two types:
1) Nominal: Categories cannot be given any numeric scale. e.g. man/woman, colors, zipcodes, county, town names.
2) Ordinal Variables: The categories can be given some numeric scale. e.g. High/medium/low ratings, satisfied/neutral/Dissatified, old/young/infant, yes/no.

In this example, we’ll predict price of a specific model of a used car based on features such as its mileage, and age. We have a dataset containing 3 columns with headers: Model,Mileage,Price($),Age(yrs)

Therefore, Model is our nominal categorical value. We will use ‘dummy’ variables to assign some values to it using pandas get_dummies() method. The dataset looks like this when loaded in Excel (it’s a CSV text file):

Model Mileage Price($) Age(yrs)

Subaru Outback 61373 27878 5

Subaru Outback 59999 28576 3

Subaru Outback 10190 35999 2

Subaru Outback 60721 22576 7

Volvo V90 11093 56000 2

Volvo V90 32678 50813 4

Volvo V90 30552 55887 3

Volvo V90 33425 38980 4

Chrysler 300C 20550 26986 4

Chrysler Pacifica 44372 29987 5

Chrysler Pacifica 60466 25998 2

Chrysler 300C 51989 21991 6

We’ll be using the same libraries as with the other regression examples linked above. After we read the dataset (a CSV file)…

DATASOURCE ='carprices.csv'
import pandas as pd
df = pd.read_csv(DATASOURCE)

…we create the dummy variables with the statement:

dummies = pd.get_dummies(df['Model'])

Now the dummies will look like this with model names in their own columns:

Chrysler 300C  Chrysler Pacifica  Subaru Outback  Volvo V90

Next, we need to join the dummies with the pandas dataframe by:

merged = pd.concat([df,dummies],axis='columns')

The merged dataframe looks like this at this point in memory:

       Model  Mileage  ...  Subaru Outback  Volvo V90
0      Subaru Outback    61373  ...               1          0
...etc.

Since the model names are already spread out as their own columns, we should drop the original column Model from our dataframe (this also protects against what’s known as Dummy Variable Trap). Next, we need to remove (drop) the Price$ column (3rd in our dataset) as this is the dependent variable.

X = final.drop('Price($)',axis='columns')

After dropping it, we save the dataframe into X variable which looks like this in memory:

    Mileage  Age(yrs)  Chrysler 300C  Chrysler Pacifica  Subaru Outback
0     61373         5              0                  0               1
...etc.

Notice how the dummy variables are assigned a 0 or 1 by pandas. In order to the fitting, we need to set the y variable to the price (dependent variable) as:

y = final['Price($)']

Next, we train the machine learning model using regression.fit():

model.fit(X,y)

Finally, we can predict price (find dependent var) from any given features (independent vars).

IMPORTANT: The order of the params must match the data source’s column order and now the dataframe order. e.g. we have Model, Mileage, Price($), Age(yrs) in datasource and the dataframe final has: Mileage, Age(yrs), several dummy variable columns.

To find price of Chrysler 300C with 30,000 miles and 5 years of age: first, look at the final dataframe to see the order:

    Mileage  Age(yrs)  Chrysler 300C  Chrysler Pacifica  Subaru Outback
0     61373         5              0                  0               1
1     59999         3              0                  0               1
...
11    51989         6              1                  0               0

You can ignore the very first column (0,1…11) as it’s the default index numbers ouputted by Python. This is not in the dataset, so our first column is Mileage, and ‘Chrysler 300C’ is the third column. Also note that we have 3 dummy variables as columns in our dataframe and Chrysler 300C is the first dummy column.

We call regression.predict() to get the price as:

p=model.predict([[30000,5, 1,0,0]])

The first 2 parameters are for mileage and age respectively. The last 3 parameters are for dummy variables (we have 3 in our dataframe)…that parameter is set to 1, while the rest are set to 0 because the one we’re predicting for (Chrysler 300C) is the first dummy variable in order.

Now, p contains the predicted price! The output looks like this:

[25487.48831031]

Which means, it’s saying a Chrysler 300C car that has 30K miles on it and is 5 years old would cost $25,487. If you look at the original dataset above, it makes total sense where it would be positioned.

Similarly, to predict for a

So, to find price of Volvo V90 with 30K miles and 5 years of age:

p=model.predict([[30000,5, 0,0,0]])
print(p)

And we get [48080.60309935] or $48,080. Again, it makes perfect sense when we look at our dataset.

We can also verify how good the prediction was by the computer. This gives us a level of understanding of how much we can trust the prediction. The nature of the dataset as well as the method used can affect this value. To get the accuracy, we do this:

print(model.score(X,y)) 
>>> 0.9113771853190252

Meaning, the prediction is about 91% accurate.

This took several small steps but understanding each step and its purpose can make the difference between getting intended results versus completely incorrect results. I hope this was educational and interesting. Now we know how to work with univariate, multivariate regressions with categorical values and to make the computer predict with multiple features using machine learning.


▛Interested in creating programmable, cool electronic gadgets? Give my newest book on Arduino a try: Hello Arduino!  
▟

Leave a Reply Cancel reply