In this post, I present a use-case for the beautiful Violin plots. We have a sample of n drivers about which gender drives which type of vehicles (Sedan, Van, SUV, Pickup), how many of the sample, and their ages. The data is fictional and was generated by me using formulas in Excel, then saved as an Excel workbook which is then read by my Python script to analyze and plot the chart using Seaborn library. In an older post, I shared another example of Violin charts created in PowerBI using an add-on. At the time of this writing, there isn’t a built-in way to create such charts in Excel.
About Violin Plots
Violin plots are best used when we want to see the distribution of numeric data, and are especially useful when we are comparing distributions between two or more groups.
Some elements of Box plots (Mean, Median, Quartiles) can be incorporated into a Violin plot to provide additional details.
Violin Plot vs Box Plot and Histogram
In a Violin plot, each density curve is built around its center line rather than stacked on baselines as seen in a Histogram. Other than this difference in display pattern, curves in a violin plot follow the exact same construction and interpretation.
Box plots (aka Box-and-Whisker) are limited in what information they can show but they are easier to interpret. When avaiable space for the plot is a concern or showing a statistical summary is of top importance, the Box plot may be preferable to a Violin plot.
Compared to density curves such as Violin plots, the Histogram is the more common chart type for depicting distributions. Creating a kernel density estimate (KDE) requires consideration of kernel shape and bandwidth, while creating a histogram requires consideration of bin sizes.
Creating the Plots
Let’s start with the dataset. It has 3 columns: Gender, Vehicle, Age. The only numeric column is Age (as integer). The Gender and Vehicle columns contain categorical values (as strings). The Gender possible values are “Male” or “Female”. Vehicle column contains these possibilitiies: “Sedan”, “SUV”, “Pickup”, “Van”. For efficiency purposes, the data are generated in Excel using a combination of CHOOSE and RANDBETWEEN functions. This dataset is then read by the Python code using pandas library: read_excel()
.
The Violin plot is created using seaborn.violinplot()
and the summary of stats is generated using pandas dataframe methods: dataframe.groupby()
, agg()
functions. The Box plot is created with sns.boxplot()
.
The Violin chart is shown below.
We can also complement it with additional, descriptive information programmatically by showing the Median and Mean ages of each gender by vehicle types with: grouped_df = df.groupby(['Vehicle', 'Gender'])['Age'].agg(['median', 'mean'])
which outputs the following summary:
We can also create a Box and Whisker chart using the same data and Seaborn and Matplotlib libraries as shown below. It conveys the same information but with a little more emphasis on the statistical aspects including outliers (the dots outside of boxes when present). When we compare the thick black along the center lines on the violins density curves with the boxes, we can see the stark similarities between the chart types.
Interpreting the Violin Chart
Thick black line along the center of each violin: This line within each violin represents the median age for each group (gender and vehicle type). The median is the middle value of our dataset when it’s ordered from lowest to highest, splitting the data into two equal halves. The line indicates the central tendency of ages for each gender and vehicle type.
White dot: The white dot inside the black line represents the mean (average) age for each group. It can indicate if the age distribution is skewed (mean and median will be farther apart in skewed distributions).
Violin shape: The width of each violin plot at any given point along the y-axis represents the density of the data at that age. A wider section (larger bandwidth) indicates a higher concentration of data points (more people of that age), while a narrower section (smaller bandwidth) indicates fewer data points. The overall shape of the violin plots shows where most of our data points lie and the spread of ages.
Split Violin plot: Since we used split=True, the Violin plot is split to show the distribution for each gender side by side. The left half shows the distribution for one gender (e.g., Male). The right half shows the distribution for the other gender (e.g., Female).
The width of the violin plot at any given point along the y-axis (age) represents the density of data points (individuals) at that age for each vehicle type. So, a wider section of the violin plot means there are more people of that age driving that vehicle type because the width at each y-value represents the number of individuals at that age within each vehicle type. The height of the violin plot shows the spread of ages for each vehicle type. For example, for Pickup we see there’s many more male than female who drive them and the age distribution is between 45 to 55 years of age with most of them being around 52.
The x-axis differentiates between vehicle types. The y-axis shows the range of ages. The width at each y-value represents the number of individuals at that age within a particular vehicle type.
Related: