Sunday, October 01, 2023

# Three sure ways to find Outliers in your data

I often see questions about how to find outliers in a dataset of any size the effective way. From what I’ve seen, there are many suggestions/responses and most of them are INCORRECT unfortunately. Here, I’ll show you 3 accurate and proven ways to find outliers in any data.

I’ll get to the other columns soon (which I built to find the outliers), but for now focus on the column titled “Data”. Notice that the data is NOT sorted. And we won’t need it sorted either! METHOD 1: FORMULAS

The first method is to simply use the foundational concepts of outliers and applying them using formulas. I’m showing the solutions here in Excel but any spreadsheet will do.

First, we need to find the first quartile and 3rd quartile of the dataset. That’s easily done using QUARTILE.INC() formulas with k=1 and 3 respectively. Do this for each value in the dataset. (We could use the exclusive formula QUARTILE.EXC() but if you do, you’ll need to be consistent also with the Box & Whisker chart later on.)

Once we have the quartiles, we need to find the inter-quartile range which is the absolute difference between the two quartiles. That’s shown in IQR column. Then we need to find the upper and lower bounds of the dataset. The surest way (and most acceptable way in Statistics) to do this is to find values that fall OUTSIDE of the 1.5 times the value of IQR. So, 1.5XIQR above the 3rd quartile’s max value or 1.5X below the 1st quartile’s minimum value.

In other words, to find the upper-bound, you can use this formula =F3+(1.5*G3) and to find the lower-bound, you would use this: =E3-(1.5*G3) assuming E3 contains the first quartile value, F3 contains the 3rd quartile value, and F3 contains the IQR value.

Now, anything that falls outside of this upper and lower range are Outliers.

We can take it a step farther—instead of manually matching each row (we could be dealing with thousands of records or more) and seeing it’s within that range or not, I added another calculated column called “Outlier?” which basically yields a TRUE/FALSE based on if the cell next to it is greater than the upper-bound or less than the lower-bound! Finally, I added a conditional formatting on that column so it automatically colors only the TRUE values so they stand out from the rest.

We can clearly see the outliers are 985, 923, 769 sitting above the upper-bound. And -411, -668 sitting below the lower-bound.

METHOD 2: SCATTER PLOT

To visually find the outlier points, plot a scatter plot using the dataset as above. We get something like this: I have highlighted with red-dots the outliers to make them stand out. Although this is the easiest way to identify some outliers, this is not the most accurate way because it’s not clear if you should pick 2 outliers or 4 or 7 in the beyond upper-bound range. Only by doing the math in METHOD 1, you’ll know for sure, but this is a quick-and-dirty way to get a sense and when used in conjunction with METHOD 1, it’s very powerful.

METHOD 3: BOX & WHISKER CHART

Using the same dataset, we insert a box & whisker chart as below. Remember, we use INCLUSIVE version of the formula for finding quartiles in METHOD 1? To match this visual identical to that method’s findings, we will need to tell it to use INCLUSIVE method to calculate its mean after the chart is plotted (in Excel, click on the Box and select Quartile Calculation: Inclusive Median from the Format Data Series pane). This is clearly showing the outliers  beautifully (I’ve highlighted the numbers in orange, and formatted the mean text for explanation only here). The ‘X’ in the box is showing the Mean value of the dataset: 299.04 (when reduced to 2 decimals).

The upper end of the vertical line shows the max value (that’s NOT an outlier): 666

The lower end of the vertical is showing the min value (that’s NOT an outlier): -33. That is they’re both edges within Upper and Lower bounds.

The remaining points (dots) are the outliers! And they match perfectly with our findings in previous 2 methods—top: 985, 923, 769 and bottom: -411, -668