In a previous post, I shared an example of Box and Whisker plot with a small sample dataset. In this post, I’m using a larger dataset from Seaborn library of Python and go a little deeper with more examples using Python.
We have a dataset in the Seaborn library called ‘tips’ that’s perfect for exemplifying the Box and Whisker concepts and how to create them in Python. It contains information about a restaurant or several restaurants and their transactions. 244 rows of data with a header row and these columns: total_bill, tip, sex, smoker, day, time, size. The first two are dollar figures (floats), the ‘size’ is the size of the party of the customers (1 to n as integers). The ‘sex’ contains categorical values: Male or Female as strings. The ‘day’ contains a 3 to 4-letter acronym for the name of the weekday when the customers transacted, and ‘time’ contains categorical values such as Lunch or Dinner. ‘Smoker’ contains just Yes or No (we won’t be using this column for our analysis).
Some of the things we want to glean from this data are: Which days generate the largest transactions, in which part of the day, where are the top x% sales by day, which days generate the most tips, are there outliers, which days and times of the day do customers arrive with larger party sizes, are there patterns in tipping between male and female, and what is the distribution for each of the metrics. And we could even more but I think these will get the point and usefulness of the Box and Whisker plots across. These can be done in Excel or Python or other tools. I have chosen to do this exercise in Python.
The Process
Loading the dataset: First, we need to import the seaborn library, create its object, then use its load_dataset() method to load the entire dataset into a dataframe. We’ll use matplotlib.pyplot for the plots. The statements for those are:
import matplotlib.pyplot as plt
import seaborn as sns
df = sns.load_dataset('tips')
We should always inspect the dataset for missing values and its shape. For example, using df.isna().sum()
we can see if there are any missing values in any of the columns. If all the columns show 0 it’s all good, otherwise, we’ll need to decide how to address that (for example, we might ignore the missing values, we might not use a specific column at all due to that, or we may interpolate the missing data using one of several methods, all depending on the objective of the analysis). In this case, we find no missing values. However, just because there are no blanks in any row doesn’t mean data is 100% complete. For example, we can count the number of day names (weekday names) that appear in the dataset and find that most data are for Sat (87), least data avaiable for Fri (19), and there’s no data for Mon, Tue, Wed. For this analysis, this is not a problem but if we were doing some metrics based on 7 days that would certainly need addressing. Using df.describe()
we can get an overview of the dataset which looks like this:

We can easily see above the largest sales, largest tips, largest party size and their smallest, average counterparts etc. because they are numeric columns. To plot the charts, we use boxplot() method of our seaborn object sns as declared above. To create a chart object for total bills by the day of the week, use this statement: sns.boxplot(data=df, x='day', y='total_bill')
followed by plt.show()
to display the plot on screen. Without further ado, let’s take a look at some interesting plots I generated from this dataset.







The boxplot() function can several parameters and its defined as:
sns.boxplot(x=None, y=None, hue=None, data=None, order=None,
hue_order=None, orient=None, color=None, palette=None,
saturation=0.75, width=0.8, dodge=True, fliersize=5, linewidth=None, whis=1.5, ax=None)
As we see the interquartile range is by default 1.5 (whis property). If we wanted less outliers and be more inclusive of datapoints in a broader range, we can increase this value, and to be more restrictive, decrease this value. To set upper and lower limits on the percentages of data points to include (e.g. to set a percentile range), we can pass an array of 2 values (list in Python) to the ‘whis’ parameter. There’s a lot of more to explore there, so head over to https://seaborn.pydata.org/generated/seaborn.boxplot.html for complete documentation.
Related: