Sunday, September 15, 2024
STEM

Venn Diagrams (Python)

Venn diagrams are versatile and useful for a variety of purposes. Some common uses are:

Visualizing Relationships: They illustrate the logical relationships between different sets. For example, they can show how different groups overlap and share common elements, and which don’t.
Comparing and Contrasting: They are great for comparing and contrasting different items, ideas, or groups. For instance, we can use them to compare the features of two different products or the characteristics of different species.
Data Analysis: They are useful in data analysis to show the relationships between different datasets. This can be particularly helpful in fields like computer science, business, and research.
Venn diagrams are often used in presentations and reports to visually organize information in a clear and impactful way. They make it easier to communicate complex information to an audience.

Of course, once we understand the relationships and how to draw Venn diagrams, we can draw them by hand, using a graphic software, or more advanced tools such as PowerPoint and Excel, which make drawing various shapes and colors easier. However, most of them are static and not linked to dynamic datasets. At present, there’s a no built-in chart in Excel that let you chart Venn diagrams dynamically based on data. In this post, I share how to create truly dynamic Venn diagrams (involving 2 sets, 3 sets, 4 sets+) based on live data using Python. It requires some coding and knowledge of Python.

Scenario with 2 sets

Let’s start with a simpler example involving two sets. Suppose we have data on customers on who and how many of them bought electronic items and clothing items from a supermarket. For brevity, I’ll keep the size of the customer pool and categories (sets in this example) small but the method and idea are easily scalable to larger datasets. So, we have two sets: Electronics, and Clothing. The customers who bought electronic items belong to Electronics set, and the customers who bought clothing items belong to Clothing set. To make this example realistic, we’ll use customer names for each set.

Alice, Bob, and Charlie purchased electronics. Bob, Charlie, Eve, and David purchased some clothing. We want to see who’s buying which category of items and who’s buying both, and how large is the overlap. From this small dataset, we can manually deduce that: Alice only purchased electronics. Bob, Charlie purchased both electronics and clothing. Eve and David purchased both clothing and electronics. Next, we want to make this task automated, dynamic and visual using Python.

The output generated by our Python code for this example of two sets look like as follows:

We can immediately appreciate the diagram and the value it conveys so easily and clearly.

Venn Diagram Creation Code: 2 sets

The actual lines of code required is amazingly few. This is because of the matplotlib_venn library does much of the heavy lifting along with matplotlib. From there, we only need the venn2 module for diagrams with 2 sets, and pyplot for plotting the diagram.

Beyond that, we only need to define the sets, and call venn2() function to get the diagram. Finally, we use show() as with all matplotlib plots to show the diagram on screen. If we take out the line#s 9 through 11 which are responsible for actually putting labels instead of numbers (default), then we would get the following diagram:

Which is still useful for counts. In other words, if we’re only interested in knowing the count of customers that belong to the sets and overlapping segments, then count is what we would need. For example, we see above that 2 customers bought both electronics and clothing items, 2 bought only clothing, while 1 bought only electronics.

If you’re only interested in getting the counts and not set the labels (in this case, customer names from each set) on the diagram, then you don’t need to bother with lines 9 through 11. However, I wanted the names as well in this example and I wanted to demonstrate how to set the labels correctly for each region of the plot.

To do so, we’ll need to at least understand the concepts of set intersections, unions, additions, subtraction, and subsets as setting up labels requires us to specify specific IDs which depend on this operations. Those IDs are specified as strings but take the form of binary digits: ’10’, .01′, ’11’ as shown in the code lines 9 through 11. For just two sets, it’s simpler to explain how you’d determine those and pass the correct IDs so the labels appear correctly in the diagram. This is how it’s done…

We have 2 sets in the code: set1 and set2. So we’ll need a 2-digit binary code for 3 regions. ’10’ refers to the region that contains only elements from set1 but those not present in set2. ’01’ refers to the region that contains only elements from set2 but not present in set1. ’11’ refers to the region that contains elements from both set1 and set2. Once you have this down, the rest is easy as pie.

Scenario with 3 sets

Let’s move on to the most common scenario for using Venn diagrams in the real-world, which involves three sets. Extending the previous example, I have added a new category or department: Groceries. So we have these three sets: Electronics, Clothing, Groceries. And so we’ll need three sets: set1, set2, set3. Assume that set1 has these customers (purchased electronics): Alice, Bob, Charlie, and David.
set2 has these customers (purchased clothing): Bob, Charlie, Eve, and Frank.
set3 has these customers (purchased home goods): Charlie, David, Eve, and Grace.

Venn Diagram Creation Code: 3 sets

Without labels (i.e. not showing element names of the sets, only count), the diagram would look like this:

Which clearly shows the overlaps and count of customers on each region, including the Grocery-exclusive region which has 2 customers (meaning, they only bought grocery items and nothing else).

Also, in order to generate a Venn diagram involving more than 2 sets, we have to use venn3 module of matplotlib_venn library. So, to generate this, we need to add these import lines instead:

import matplotlib.pyplot as plt
from matplotlib_venn import venn3

And creating the venn object would require a call to venn3() function as this:

venn = venn3([set1, set2, set3], ('Electronics', 'Clothing', 'Grocery'))

The rest of the lines look the same for drawing without labels. However, to draw actual labels (e.g. customer names from the sets that belong to a region), then we will need seven 3-digit-binary string IDs for 7 possible regions. They are ‘100’ (for set1-set2-set3); ‘010’ (for set2-set1-set3), ‘001’ (for set3-set1-set2), ‘110’ (for set1&set2 – set3), ‘101’ (for set1&set3 – set2), ‘011’ (for set2&set3 – set1), and finally ‘111’ (for set1 & set2 & set3). The idea here is the same as with two sets, only the number of regions are more, requiring more IDs and instead of 2-digit binary, we need 3-digit binary to accomodate 3 sets here. For example the operation: set1&set2 – set3 refers to the region containing elements that are in both set1 and set2 but not in set3. The operation: set3-set1-set2 refers to the region that contains elements of set3 that are not in set1 and not in set2. The operation: set1 & set2 & set3 refers to the region that contains elements of set1, set2, and set3. You get the idea! Once the above 7 labels are set, we get the following plot with 3 regions.

Sure enough, we see the 7 regions and see that Tony and Grace bought only groceries, whereas Alice only bought electronics, Frank only clothing, Charlie bought items from all 3 departments, Eve bought groceries and clothing items, and so on.

You may be wondering: is it possible to draw a Venn diagram with more sets than 3? Yes, it is. However, those require a different library such as venn (yes, it’s simply called venn and is not the same as venn2 or venn3 modules as we’ve seen so far). Using that library, we can draw diagrams with upto 6 sets! To demonstrate 4 sets quickly, I added yet another set called ‘Books’ to the previous three sets in our example. The sets are defined as follows as :

sets = {
'Electronics': {'Alice', 'Bob', 'Charlie', 'David'},
'Clothing': {'Bob', 'Charlie', 'Eve', 'Frank'},
'Grocery': {'Charlie', 'David', 'Eve', 'Grace', 'Rodger'},
'Books': {'Frank', 'Grace', 'Hank'}
}

Then to generate the venn diagram object, we call:

venn_diagram = venn(sets)

The syntax venn library uses for its venn function is indeed different from that of matplot_lib’s. And finally

plt.show()

The Venn diagram with 4 sets looks like this:

By comparing the diagram with our set elements, we can see Hank is the only one who purchased just book item(s). While it’s academically and technically impressive, the real-world usefulness of Venn diagrams involving more than three sets in a professional environment seems lower as they get gnarly and increasingly difficult to follow.

I hope this was educational and interesting. If you need the full source code, feel free to contact me via email trseattle at outlook dot com. All verified donors can request the code free of charge. To become a donor easily (one-time or recurring), please click here (processed securely by PayPal).

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top
+