Wednesday, July 17, 2024

# Data Normalization & Rescaling

Normalizing data is a common task in many applications, especially when working with large datasets, machine learning, or statistical analysis. There are two common statistical methods for normalization: Min-Max Scaling, Standardization or Z-score Normalization. But there are other ways too, which I will demonstrate in the examples below.

1. Min-Max Scaling (Normalizes Data to Between 0 and 1):
Min-max scaling normalizes data so that the minimum value becomes 0, and the maximum value becomes 1. This method is useful when you want to compare values across different scales.
The formula for min-max scaling is:
` MIN-MAX Normalized value = (Value - Min Value) / (Max Value - Min Value) `

2. Standardization or Z-score Normalization. (Normalizes data so Mean=0 and Standard Deviation=1):
Standardization modifies the dataset so that the mean is 0, and the standard deviation is 1. It allows effective comparison of data with different scales.
The formula for standardization is:
` Normalized value = (Value - Mean ) / Standard Deviation ` Or, `STANDARIZE(x, mean, stdev)` in Excel

The STANDARDIZE function in Excel can be used to normalize data instead of the above manual calculation in Standardization (Z-score Normalization).
It returns a normalized value (z-score) based on the mean and standard deviation of the dataset, effectively normalizing the data.

3. There is another common method of rescaling, that’s by using Proportion.
This is very useful to normalize to specific or arbitrary scales (not just 0 to 1 or Z-scores). And is useful to compare values that were originally measured on different scales.
Let’s look at some examples with real numbers and scenarios.

EXAMPLE 1:
We have the following student scores from 3 different tests where each test’s maximum score are different.

We can easily convert them to percentage (a common scale across all tests) as shown in ‘Proportion or Pct.’ column. The ‘Proportion or Pct.’ column gives us the scores if all tests were brought to 100 max score scale. We will also use MIN-MAX, and Standaridzation methods (using both STANDARDIZE() and manual formula above) to bring the scores to a common scale.

In the above example, we used MIN-MAX normalization as well as STANDARDIZATION (using STANDARDIZE()) methods to rescale the ‘Original Score’ values. We can see that thee MIN-MAX values have converted the ‘Original Score’ values to between 0 and 1. 0 being the lowest score, 1 being the highest. And 60/80 given 0.545. Both STANDARDIZE() and Normalized return the same values (as expected as they use the same method) into Z-score values range. Remember, Z-scores may be positive or negative, with a positive value indicating the score is above the mean and a negative score indicating it is below the mean.

What are the unique benefits of using Standardization or Z-score Normalization?

Standardization, also known as Z-score normalization, is another common method for scaling features. It transforms the data to have a mean of 0 and a standard deviation of 1.
Here are some unique benefits of using standardization:

1. Handling Outliers: Unlike min-max normalization, standardization does not bound values to a specific range. This means that it is less sensitive to outliers.
A single outlier can shift the minimum or maximum of the data when using min-max normalization, but standardization is more robust to this.
2. Useful for Certain Algorithms: Some machine learning algorithms, like linear regression, logistic regression, and linear discriminant analysis, assume that all features are normally distributed.
These algorithms can perform better if the features are standardized.
3. Scale Invariance: Standardization makes the results of your analysis scale invariant, meaning the same scale of measurements in different units can be compared.
For example, you can compare the weight of something (in pounds or kilograms) with the height (in inches or meters).
4. Interpretability: After standardization, the transformed feature values can be interpreted as the number of standard deviations away from the mean.
This can be useful for understanding the relative importance of features.

EXAMPLE 2 | Proportion Method:

Objective: I have several student test scores. Some tests had a maximum possible of 40 points, some had 80, some had 100. I want to bring all the scores to a common scale such as to a common scale (e.g. 0 to 100 points). We’ll use the same dataset of tests as in previous example.

SOLUTION:
We can normalize the scores to a common scale using a simple formula in Excel. The formula we need is a basic proportion calculation.
The formula is: `(OriginalScore/OriginalMaxPossibleScore)*NewMaxPossibleScore`
This method brings all scores to a common scale of 0 to the new scale. If the new scale’s maximum is 100, then we multiply the division by 100. If the new scale is 50, then we would multiply the division by 50, and so on. Pretty straightforward.

On a scale of 50 max, the scores would look like this:

EXAMPLE 3 | Logarithmic Method:
Here we have population data of a few cities where the numbers vary greatly between the smallest and largest numbers.

We can see below that without rescaling the chart makes it hard to see the smaller cities portions.
Even with MIN-MAX scaling, the proportions remain the same although we have the numbers much smaller, making the smaller cities hard to see on chart.
Same challenge with Proportions method as a percentage or x/max_value.
But with STANDARDIZE method, by bringing in negative to positive range, we have a broader visual range.
However, the differences between the smaller cities are hard to see, and the Data values or labels would not be useful after any rescaling (other than Proportion method).
Another method is Logarithmic. Here we select the original data without rescaling anything, and chart them. Then in Excel, just for the axis to Logarithmic scale.
We could manually rescale using LOG10() and visually it’ll be same as using the Logarithmic method above, but the values will be changed, so data labels won’t be accurate.

The best solution in this case is the Logarithmic scale for visuals in this case, as we can also add Data Labels which read correctly.

Can we compare the proverbial apples with oranges using standardization?
Yes, we can. Standardization and normalization are techniques that allow us to compare “apples to oranges” in a meaningful way. They transform different features (or “fruits”) to a common scale, allowing us to make valid comparisons. Just like you might compare apples and oranges based on their weight or color, these techniques let us compare different features based on their standardized or normalized values. Let’s consider weights of a sample data of people in completely different units.

First, we convert to a common unit, then we apply standardization and the standardized weights look like this:

Now, all the weights have a mean of 0 and a standard deviation of 1, making them easier to compare. This is the essence of standardization. This clearly shows that anything positive is above average and anything negative is below average weight. So, a standardized weight of 1.19 (like for Person C in our example) means that Person C’s weight is approximately 1.19 standard deviations above the average weight. (You can verify the person’s weight with the dataset above by manually adding the appropriate standard deviation as shown in the ‘Standardized Weight’ column and they’ll be very close (the standard deviation is: 6.063569081 and the Mean is 74.452), any difference may be due to the decimal precision in your calculation.

EXAMPLE 4 | More Advanced MIN-MAX
We have the following dataset from some survey feedback on ratings. Unfortunately, the results were recorded in a different way but we need to bring them to 1 to 5 star rating system so that minimum is 1, and highest is 5. How can we do this?

We can use the MIN-MAX normalization. Although it usually brings values between 0 and 1 (inclusive), we can tweak the formula to set any new min and new max values. We just need the min, max values from the dataset, and then use the new min, new max for the new scale.

Use this formula: `Normalized value = ((Value - Min Value) / (Max Value - Min Value)) * (new max - new min) + new min`

And boom! We have our new standardized numbers in positive scale from 1 to 5 inclusive. Let’s take it a step further. Let’s make them whole numbers so we can actually report these visually as stars.

This is done by rounding the standardized results to nearest whole number, and then to draw the stars, we just use the Unicode codepoint 9733. Then using REPT function of Excel, we can plug in this codepoint as an argument to UNICHAR function, and there we get the exact depiction of star ratings visually.

Check out my detailed post on this type of visualization here: https://flyingsalmon.net/mckinsey-style-bar-chart-in-excel/

I hope this was educational and interesting. Some of the related concepts, on which I have separate posts, include: Z-score, Normal distribution, t-test, Binomial and Poisson distributions. You can search for them on this site, and remember, I’ll be adding more on related topics. Enjoy!

Related: