Sunday, September 15, 2024

# Three sure ways to find Outliers in your data

I often see questions about how to find outliers in a dataset of any size the effective way. From what I’ve seen, there are many suggestions/responses and most of them are INCORRECT unfortunately. Here, I’ll show you 3 accurate and proven ways to find outliers in any data.

I’ll get to the other columns soon (which I built to find the outliers), but for now focus on the column titled “Data”. Notice that the data is NOT sorted. And we won’t need it sorted either!

METHOD 1: FORMULAS

The first method is to simply use the foundational concepts of outliers and applying them using formulas. I’m showing the solutions here in Excel but any spreadsheet will do.

First, we need to find the first quartile and 3rd quartile of the dataset. That’s easily done using QUARTILE.INC() formulas with k=1 and 3 respectively. Do this for each value in the dataset. (We could use the exclusive formula QUARTILE.EXC() but if you do, you’ll need to be consistent also with the Box & Whisker chart later on. Remember, INC version is for greater than or equal to the limit set, whereas EXC version is for greater than only.)

Once we have the quartiles, we need to find the inter-quartile range which is the absolute difference between the two quartiles. That’s shown in IQR column. Then we need to find the upper and lower bounds of the dataset. The surest way (and most acceptable way in Statistics) to do this is to find values that fall OUTSIDE of the 1.5 times the value of IQR. So, 1.5XIQR above the 3rd quartile’s max value or 1.5X below the 1st quartile’s minimum value.

In other words, to find the upper-bound, you can use this formula =F3+(1.5*G3) and to find the lower-bound, you would use this: =E3-(1.5*G3) assuming E3 contains the first quartile value, F3 contains the 3rd quartile value, and F3 contains the IQR value.

So, anything that falls outside of this upper and lower range are Outliers.

We can take it a step farther—instead of manually matching each row (we could be dealing with thousands of records or more) and seeing it’s within that range or not, I added another calculated column called “Outlier?” which basically yields a TRUE/FALSE based on if the cell next to it is greater than the upper-bound or less than the lower-bound! Essentially, I check if a cell value is >= to the Upper bound or <= to the Lower bound values by using =IF(OR(condition1, condition2), TRUE, FALSE). If any of the condition1 or condition2 is true, we get TRUE returned, otherwise FALSE. This means any cell returning TRUE is an outlier, otherwise they’re not. Finally, I added a conditional formatting on that column so it automatically colors only the TRUE values so they stand out from the rest.

We can clearly see the outliers are 985, 923, 769 sitting above the upper-bound. And -411, -668 sitting below the lower-bound.

Here’s an example of identifying outlier weights of some basketball players. In this example, I have a dataset of basketball players with their names, heights, and weights. I want to identify players whose weights are unusual (outliers) based on the sample data. Here’s an example output:

We can easily see with the conditional formatting based on our quartile, IQR, and multiplier calculations that Ivan R. weighed unusually heavy for the pool, while Davis R. weighed unusually light for the pool of players.

We can take this even further by only showing the outlier information, meaning only extract rows (with all its column information) that are outliers in a neat report. This is extremely useful when the sample data is very large and it’s erroneous or tedious to visually inspect the color codes. In order to achieve that, I used the ‘Outlier?’ column using IF(OR()) formula, and then I used a FILTER() function on that range only to extract those rows that have TRUE in them (meaning, just the outlier rows). And done!

METHOD 2: SCATTER PLOT

To visually find the outlier points, plot a scatter plot using the dataset as above. We get something like this:

I have highlighted with red-dots the outliers to make them stand out. Although this is the easiest way to identify some outliers, this is not the most accurate way because it’s not clear if you should pick 2 outliers or 4 or 7 in the beyond upper-bound range. Only by doing the math in METHOD 1, you’ll know for sure, but this is a quick-and-dirty way to get a sense and when used in conjunction with METHOD 1, it’s very powerful.

METHOD 3: BOX & WHISKER CHART

Using the same dataset, we insert a box & whisker chart as below. Remember, we use INCLUSIVE version of the formula for finding quartiles in METHOD 1? To match this visual identical to that method’s findings, we will need to tell it to use INCLUSIVE method to calculate its mean after the chart is plotted (in Excel, click on the Box and select Quartile Calculation: Inclusive Median from the Format Data Series pane).

This is clearly showing the outliers  beautifully (I’ve highlighted the numbers in orange, and formatted the mean text for explanation only here). The ‘X’ in the box is showing the Mean value of the dataset: 299.04 (when reduced to 2 decimals).

The upper end of the vertical line shows the max value (that’s NOT an outlier): 666

The lower end of the vertical is showing the min value (that’s NOT an outlier): -33. That is they’re both edges within Upper and Lower bounds.

The remaining points (dots) are the outliers! And they match perfectly with our findings in previous 2 methods—top: 985, 923, 769 and bottom: -411, -668