Friday, September 22, 2023

# Compare and understand: Spread and Consistency

Imagine you have a product line with an average price of \$20, and another product line where the average is \$100. Which has more spread?

Imagine you have 3 players whose bowling scores you have (which are all over the place by the way) and you want to know which player is more consistent? Or, put another way, you want to compare the 3 players perhaps to decide who to recruit.

Also imagine, you have house prices from two different neighborhoods and you want to know which neighborhood has large discrepancies between house prices.

Finally, imagine you have data sets of heights (in inches) and weights (in pounds), and you want to know which data set has the bigger spread or variation.

How would you find the answers? How would you compare inches and pounds when they’re different units? In this blog, I show you how to do exactly that.

First, we need to utilize Standard Deviation (SD), and then we’ll take it a step further and use Coefficient of Variation (CoV) to find our answers. I won’t discuss the statistical theories of what they are here (you can study them free online) but rather the practical usage of them in these real scenarios.

### Why Coefficient of Variation (CoV)?

In most cases when the units are the same and datasets are homogeneous, we can use SD to compare. The higher the SD, the more the spread. If we’re lazy, we would simply compare the MEANs of each dimension and that would be the wrong thing to do. Also, note that while SD can be used for comparing apples and apples, it cannot be meaningful when comparing apples and oranges! Can we even compare apples and oranges? Yes, if the data sets are in different scale or units, we need to use CoV instead.

We’ll see a few examples of these application below for different scenarios.

### Example: House Prices

The prices for 20 sample houses per zipcode are below. Which zipcode (neighborhood) house prices have the bigger spread? Looking at the average prices, Zip 99099 clearly has the largest prices. However, when we calculate CoV, we see that the spread is actually more in Zip 11011. Meaning, the difference between house prices fluctuate more there: 47.4% over 44.78%. To find the CoV, we calculated the SD using STDEV.S() or .P() function depending on your sample and population of the data. They we find the CoV by the simple formula: CoV = (SD/Mean)*100

### Example: Height vs Weight

We have the following heights and weights data. Which data set (height or weight) has more spread? Just by looking at the SD value, you’d think Weight, but the fact is Height has more spread (6.9% vs 6.5%). You can verify the CoV values above.

### Example: Bowling Scores

We have scores from the following three players over 20 games. Which player is most consistent? Who would you recruit? Gomez’s average score is the best! However, is he always reliable? The SD shows a large variation but it’s confirmed looking at CoV that his scores are more “all over the place”. (You can also verify this visually by plotting a scatter plot for each)

Getting back to the data set, we can plot the average and CoV (chart titled CV below) and Zscores. The key element to notice is that while Gomez’s average score is highest, his CoV is also highest. What does that mean? That means, he’s most INCONSISTENT but capable of scoring sometimes the highest points!

(You can also see Zscore calculated and charted above, we’ll get to that in a moment.)

So, the question is: do you recruit him?

What if we took out the outliers? Take out the really high and low scores that are identified as outliers and re-average…then, what would it show?

To identify the outliers, I use quartiles to find the lowest and highest buckets. An easier way to do it is a box & whisker chart. Now we clearly find the outlier score for each. If we replace just those for respective players with their average (without the outliers), we will get the new, re-calibrated average as below (replaced scores shown in red in table). As you can see, Lucas actually scores on average the highest, outliers not considered. Why did we find Zscore? A z-score, or standard score, is a way of standardizing scores on the same scale by dividing a score’s deviation by the SD. It shows the number of standard deviations the overall score is from the mean. So, Gomez and Lucas are both above the mean by that many units (see ZScores chart above) but they’re both pretty close.

So, now you have different analytics on these three players. You’re the manager! Who would you pick?

Hope this was a fun experiment for you. As you can see the applications are many in real-world situations. The formulas are generally simpler, however, the key to getting the right information lies in understanding which method/tool/measure to use and why.