Box and Whisker Plot (seaborn example) – Musings by FlyingSalmon

In a previous post, I shared an example of Box and Whisker plot with a small sample dataset. In this post, I’m using a larger dataset from Seaborn library of Python and go a little deeper with more examples using Python.

We have a dataset in the Seaborn library called ‘tips’ that’s perfect for exemplifying the Box and Whisker concepts and how to create them in Python. It contains information about a restaurant or several restaurants and their transactions. 244 rows of data with a header row and these columns: total_bill, tip, sex, smoker, day, time, size. The first two are dollar figures (floats), the ‘size’ is the size of the party of the customers (1 to n as integers). The ‘sex’ contains categorical values: Male or Female as strings. The ‘day’ contains a 3 to 4-letter acronym for the name of the weekday when the customers transacted, and ‘time’ contains categorical values such as Lunch or Dinner. ‘Smoker’ contains just Yes or No (we won’t be using this column for our analysis).

Some of the things we want to glean from this data are: Which days generate the largest transactions, in which part of the day, where are the top x% sales by day, which days generate the most tips, are there outliers, which days and times of the day do customers arrive with larger party sizes, are there patterns in tipping between male and female, and what is the distribution for each of the metrics. And we could even more but I think these will get the point and usefulness of the Box and Whisker plots across. These can be done in Excel or Python or other tools. I have chosen to do this exercise in Python.

The Process

Loading the dataset: First, we need to import the seaborn library, create its object, then use its load_dataset() method to load the entire dataset into a dataframe. We’ll use matplotlib.pyplot for the plots. The statements for those are:

import matplotlib.pyplot as plt
import seaborn as sns
df = sns.load_dataset('tips')

import matplotlib.pyplot as plt

import seaborn as sns

df = sns.load_dataset('tips')

We should always inspect the dataset for missing values and its shape. For example, using df.isna().sum() we can see if there are any missing values in any of the columns. If all the columns show 0 it’s all good, otherwise, we’ll need to decide how to address that (for example, we might ignore the missing values, we might not use a specific column at all due to that, or we may interpolate the missing data using one of several methods, all depending on the objective of the analysis). In this case, we find no missing values. However, just because there are no blanks in any row doesn’t mean data is 100% complete. For example, we can count the number of day names (weekday names) that appear in the dataset and find that most data are for Sat (87), least data avaiable for Fri (19), and there’s no data for Mon, Tue, Wed. For this analysis, this is not a problem but if we were doing some metrics based on 7 days that would certainly need addressing. Using df.describe() we can get an overview of the dataset which looks like this:

We can easily see above the largest sales, largest tips, largest party size and their smallest, average counterparts etc. because they are numeric columns. To plot the charts, we use boxplot() method of our seaborn object sns as declared above. To create a chart object for total bills by the day of the week, use this statement: sns.boxplot(data=df, x='day', y='total_bill') followed by plt.show() to display the plot on screen. Without further ado, let’s take a look at some interesting plots I generated from this dataset.

Sundays are big sales day for this restaurant.

Showing 98% of the sales by excluding the top 2%. The bottom range is kept the same.

No one arrived on Sundays alone to dine.

More sales usually means more tips, no big surprise. However, Saturdays can be wild with tipping.

Larger groups on Sundays: 2 to 6 but mostly between 2 and 4 people per party.

Saturdays brought in some of the largest tips although Sundays are more consistent with good tips. There were slightly more male patrons across all days who also contributed larger amount of tips.

The boxplot() function can several parameters and its defined as:

sns.boxplot(x=None, y=None, hue=None, data=None, order=None, hue_order=None, orient=None, color=None, palette=None, saturation=0.75, width=0.8, dodge=True, fliersize=5, linewidth=None, whis=1.5, ax=None)

As we see the interquartile range is by default 1.5 (whis property). If we wanted less outliers and be more inclusive of datapoints in a broader range, we can increase this value, and to be more restrictive, decrease this value. To set upper and lower limits on the percentages of data points to include (e.g. to set a percentile range), we can pass an array of 2 values (list in Python) to the ‘whis’ parameter. There’s a lot of more to explore there, so head over to https://seaborn.pydata.org/generated/seaborn.boxplot.html for complete documentation.

Related:

Box and Whisker Plot Example

The Process

Leave a Reply Cancel reply