In two my earlier blogs, I shared two methods & codes in Excel and Python on how to create Word Clouds from any text or document. You can find them here:
How To Find The Most Occurring Words In A Document? (python) and Creating Word Cloud Accurately And Easily In Excel (Excel)
Now, I’ll turn it up a notch and demonstrate more advanced techniques including exporting the words frequency information to a CSV file and then generate a professional-looking Word Cloud graphic. This is another example of how I can leverage different strengths of Python and Excel in tandem to solve certain problems.
Specifically, the Python part will show you how to turn any text or document content into a list, parse out the noise and punctuations, create a frequency calculating object using Counter and save the final information in a dictionary object so we can access a word to search and find out how many times it occurred in the given text/document.
At a high-level, to create a word cloud from any text, I take these steps:
- Remove the punctuations from the text.
- Break up the text into individual words and store each word as a singular item in an array.
- Build a list of unwanted words (‘noise’) that we want to remove from the count.
- Build a new list of words that doesn’t contain any unwanted words.
- Create a counter object for getting frequency.
- Get the most common words in sorted (DESC) order!
- Allow searching for any word to see if it occurs in the given text. If it does, show its frequency. If it doesn’t fail gracefully.
- Save the dictionary object from memory to a CSV format.
- Generate a Word Cloud graphic in Excel using the exported CSV data.
Let’s start with an article that appeared in a Canadian news site recently. The actual text is below:
A Canadian woman has returned five artifacts she took from Pompeii in 2005, saying they have plagued her with bad luck. The woman, identified only as Nicole, sent two white mosaic tiles, two pieces of amphora vase and a piece of ceramic wall to the Archaeological Park of Pompeii, along with a letter explaining her decision.
‘I wanted to have a piece of history that couldn’t be bought,’ wrote the woman, who said she was ‘young and dumb’ at the time.
Since returning to Canada, she said, she has suffered two bouts of breast cancer, resulting in a double mastectomy, and her family has also been in financial trouble.
‘We can”t ever seem to get ahead in life,’ she wrote, blaming the bad luck on the tiles.’I took a piece of history captured in a time with so much negative energy attached to it,’ she wrote.
‘People died in such a horrible way and I took tiles related to that kind of destruction.’
Nearby Mount Vesuvius erupted in 79 AD, showering Pompeii with hot rock, volcanic ash and noxious gas and burying its residents. The woman related how she had given another tile to a friend and told her of the decision to send her artifacts back, but she said she doesn’t know if the friend will return hers.
‘We are good people and I don’t want to pass this curse on to my family, my children or myself anymore,’ she wrote. ‘Please forgive my careless act that I did years ago.’
First, I want to convert all of this into a common case such that our Search function later will work in case-insensitive fashion…
Now I need to remove punctuations as they’re not relevant for my Word Cloud.
Next, I want to remove the ‘noise’ words…words that are just artifacts of grammar, like “a”, “of”, etc. This list is extensible without changing the code.
Now, we have a pretty clean set of text where to look for frequencies! Then, I create a counter object which will allow me to get back a list of the words and their associated number of occurrences using cntr.most_common() where cntr is my counter object of wordlist list object. It looks like this at this point:
This is already good, but I actually would like to turn this into a key-value structure. So, a word is the key and its value will be its associated frequency. I can quickly do this by:
The dictionary object looks like this…
Beautiful! So, we can search for any word entered by me/user and quickly show its frequency or say it wasn’t found…
Here’s a sample session and output:
Remember, I took out the noise words, so ‘she’, ‘he’ etc. are removed.
Now that we have all the data we need, we’re ready to export this to a CSV file for charting. I do this using CSV library.
Opening the file in Excel now looks like this: Now I can easily sort the data by frequency or by words and create whatever chart I want!
So, using this table data, we can create a Word Cloud as below:
And there you have it. A beautiful infographic that started with an article in a few steps using Python and Excel. Want the entire program? See below how to support and get access to all the details.
▛ This post is not meant to be a formal tutorial, instead it is to offer key concepts, and approaches to problem-solving. Basic->Intermediate technical/coding knowledge is assumed. If you like more info or source file to learn/use, or need more basic assistance, you may contact me at tanman1129 at hotmail dot com. To support this voluntary effort, you can donate via Paypal from the button in my Home page, or become a Patron via my Patreon site. A supporter/patron gets direct answers on any of my blogs including code/data/source files whenever possible. At the very least, it’ll encourage me to share more posts/tips/knowledge in the future. ▟