STEM

Combo Scatter Plots, Normalization

In this article, I am presenting two key concepts: combining multiple scatter plots into a single plot for easy comparisons, and bringing data on different scales to a common one via normalization so that they can be accurately compared. To illustrate those concepts, let’s start with two datasets: 1) Data on a class of students with their weekly cups of coffee consumption, and hours they studied during that time. 2) Data on a class of students with their weekly cans of soda consumption (instead of coffee), and hours they studied during that time.

The dataset may be arranged in 3 columns for the coffee data: Student#, Coffee consumed per week (cups), Hours studied per week. Similarly, for the soda data: Student#, Soda consumed per week (cans), Hours studied per week. The number of students tracked is 20. For this example, we’ll keep the number of units for coffee and soda the same. Afterward, I introduce data where the units are different while keeping the number of students same, which will require normalization.

First of all, we can plot the datasets as two different scatter plots as that type of chart is best for seeing relationships between x (units consumed) and y (hours studied). They are shown below:

It’s quite easy to see the number of units (cups or cans) consumed and the correlated hours put into studying on both charts, but it’s not so easy to see each student’s effects with coffee vs soda! It’d be more useful to see them overlaid on one plot so the distance or difference of a student’s study behavior with coffee and with soda can be readily drawn. That combined scatter plot is shown below.

Now it’s far more useful. We find that in almost all cases the students put in more hours with coffee than with just soda. The exception(s) are also easy to see. For example, we see in only one instance a student studied more hours under soda than under coffee. And in another instance, they were the same (overlapping at intersection of 10th cup and 15 hours). We can still do better.

Without having to even match the legend symbols, we can use obvious images for the datapoint markers for each series: a cup of coffee for each coffee data point, a can of soda for each soda datapoint. Additionally, we can make it more readable by controlling the transparency to address overlapping datapoints between coffee and soda so that in such cases the datapoint overlapped by another will still show through. The customized combo scatter plot is shown below.

It’s clean, informative, and intuitive!

Remember, in the above example, the number of cofee cups and soda cans are same across all students (the total number of students is kept fixed at 20) and only the hours vary under coffee vs soda. In other words, we’re exploring the change in hours studied given the same amount of units (cups or cans) but different types (coffee vs soda). This is a very common scenario in medicine by the way.

But what if the numbers of coffee cups and soda cups were different but we still want to compare their effects on hours studied but on a common scale for the units (cups or cans)?

For such cases, we need to normalize the data. Normalization scales the number of units for both beverages to a common range while keeping the comparison intact.

Let’s say we have the following dataset for Soda cans which is now different in numbers of consumption per week than coffee (we’ll use the same one as above for Coffee).

Rescaling or normalization can be done in various ways, so whichever method you use, be sure to ensure it’s a proven method and preserves the integrity of the original data. In this example, I’m using the common normalization method which takes into account Mean, Standard Deviation to calculate a z-score for each datapoint for both drinks and z-score becomes the common scale. In Excel, you use AVERAGE() for Mean, and STDEV.P() or STDEV.S() for Standard Deviation of each dataset. I’m using .P() version because the class is entirely made of 20 students and it’s not a random sample subset of the class population. Then for each data point, I calculate the new normalized unit value by deducting the Mean for the series (coffee or soda…each needs to be computed separately as with Standard Deviation) and dividing the difference with the Standard Deviation of the units. Yes, this is exactly the z-score formula: With this done, we bring all the units to a range within -3 to 3.

The normalized data now looks like this:

If we now plot the chart with normalized column values instead of the raw values, we can get a fair comparison between the coffee and soda effects even though the number of units varied over the weeks. The plot now looks like this:

But now there’s a new problem! The x-axis shows the normalized z-scores and not the number of cups anymore. It can be confusing and virtually impossible to match up the x-axis z-score to actual number of cups without looking at the data table. So, how do we get around this so the observer can also see the actual number of cups?

The easiest solution (without having to deal with secondary axis and risk cluttering up the plot) would be to add data labels to the datapoints and remove the x-axis labels. But the data labels need to show the actual/raw values consumed while maintaining their plots positions exactly as their z-scores. To achieve this, we need to first add Data Labels for each series, and instead of the default Values, we’ll instead point the series values to the actual values: Coffee Cups per Week column for the coffee data series; Soda Cans per Week column for the soda series. With some further tuning such as positioning the labels to make them appear clean and visible, and by removing the x-axis, we have our final combo scatter plot that can handle any disparate data and still accuracy depict the relationship!

We can clearly tell the first datapoint shows 3 cans of soda consumption (data label) and studied 5 hours (y-axis) of studying. The second datapoint shows 2 cups of coffee consumption and 5 hours of studying.

If you need the actual workbook containing all these different charts and datasets, that you can use for your own purposes; and detailed step-by-step instructions, contact me here or via Patreon following this link. I hope this was educational as well as interesting. Until next time…

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top