Categorizing free-form feedback with ML – Musings by FlyingSalmon

One of the toughest parts of analyzing survey feedbacks is processing the free-form feedbacks into some quantifiable measures. This is because people express the same thoughts differently, use different styles, spellings, lingos, abbreviations, and not surprisingly, inadvertently introduce spelling and grammatical errors. In order to quantify such entries, we have to arrange each feedback into a specific category, and when the amount of text received is large and voluminous, it’s extremely time-consuming to manually read each and categorize properly. There’s also the challenge of understanding the sentiment. Some feedback may be sarcastic but composed of “positive” words and so on.

This is where a machine learning approach can come in extremely handy. While it may not be as good as analyzing the sentiment or make 100% accurate categorization in one go, it certainly saves tremendous amount of time and effort, making the final cleanup and adjustments much more efficient and worthwhile. There are specialized, packaged products to aid humans, but they are generally expensive and probably not easily accessible for learning purposes. In this post, I’ll share one of my general machine learning (ML) approaches to analyze and categorize user-feedbacks. However, this does require some Python programming and at least some basic understanding of ML.

Suppose we have saved the responses and the survey-taker or questionnaire or some sort of ID for each response so we can refer to them later on easily as comma-separated-values (CSV). The simplest format may look like this:

This is our dataset. This CSV file simply contains the ID and the free-from texts as submitted by all the survey-takers. From a quick glance, we know that positive, negative, and neutral feedback are all there spread throughout the data. We will be using this dataset to train a ML model that can understand the written English language well enough to tackle spelling and grammatical mistakes, understand semantics, and glean the overall sentiment of a feedback. Once all text are processed and understood, the model (or perhaps a different model, or in conjunction with another) will categorize them all into specific number of categories so we can visually interpret the results and get useful insights to the overall feedback.

My overall approach is as follows, implemented in Python code:

Read Data (CSV)->
Load Model (more on this below)->
Encode Data (generate Embeddings)->
Train Model on Embeddings->
Cluster Responses->
Visualize Clusters/results (more on this later)

The code reads the survey responses from a CSV file using the ‘pandas’ library. In the above CSV file example, the column we need to analyze is ‘responses’, so in extracting the values, the code removes any missing (null) values, and stores the remaining responses as a list.

Next, the code uses a sentence-transformer model from Hugging Face that’s pre-trained to understand the meaning and context of sentences. I used the ‘paraphrase-MiniLM-L6-v2’ model. This model also turns each response into numerical representations called “embeddings.” which are vectors that carry the semantic meaning of the responses in a way that a computer can understand.

A clustering algorithm (I used K-Means clustering) is then applied to the Embeddings. This model is trained on the data at run-time of the code, and it groups the responses into clusters based on their semantic similarity. Each response is assigned a cluster label (‘0’, ‘1’, or ‘2’), which represents the cluster it belongs to. The original survey data is then updated with a new column, “cluster,” in memory that contains the label for each response, linking the clusters back to the data

We have to specify the number of clusters to create however, and I started at 3 (because of the nature of the questionnaire, I expect three general sentiments: Positive, Negative, Neutral). Later, I also found the optimal number of clusters using two different methods: Silhouette score, and Elbow method. Each one gives us a computed optimal value for k (number of clusters) that’d work best based on our data. Thereafter, we can modify our k value accordingly and do the clustering again.

To summarize, the approach involves these components:

Libraries:
import pandas as pd from sentence_transformers import SentenceTransformer from sklearn.cluster import KMeans

For visualizations:

import matplotlib.pyplot as plt import seaborn as sns

For Silhouette score:

from sklearn.metrics import silhouette_score

Apparatus:

Model/ML methods: Sentence transformer (paraphrase-MiniLM-L6-v2), K-Means clustering
Optional model: TF-ID Vectorizer to exclude stop words and irrelevant words from the analysis to further improve clustering.
Visualization methods: Scatter plot, and/or bar chart.
Optional: Find optimal k (clusters) using Embeddings (using either Silhouette score or Elbow method)

My Silhouette Score method function: returns an optimal k value (int)

My Elbow method function: plots a line chart

A sample chart showing the Elbow method output:

For more information on Elbow method, see here.

Silhouette Score vs Elbow Method — which one to use?
”’
Both the Silhouette Score and the Elbow Method have their merits, but they evaluate clustering quality in different ways. Here are some points to keep in mind:

Elbow Method focuses on the within-cluster SSE (Sum of Squared Errors) and identifies the point at which adding more clusters doesn’t significantly reduce SSE.
It’s more about balancing the simplicity of fewer clusters with the improved fit of more clusters.
Silhouette Score measures how similar data points are to their own cluster compared to other clusters, offering insight into the separation between clusters.
It ranges from -1 (poor clustering) to 1 (perfect clustering), and higher scores usually indicate well-defined clusters.
Silhouette Score favors fewer clusters because cohesion tends to decrease as the number of clusters increases. But if we aim for more nuanced groupings, we might consider the Elbow Method’s result.

Clustering

A sample partial output from the program is shown below. As you can see most where categorized properly (about 70%) but there are a few that are misplaced, marked by red arrows. Although it was not perfect, the hyperparameters can be further tweaked, and training on larger datasets to yield better results.

In addition to the textual output above, we can also visualize the clusters using a bar chart for showing the cluster distribution, and a scatter plot in a 2D Space using t-SNE.

Cluster distribution chart:

Code for bar chart:

Clusters in 2D space:

Code for 2D space chart:

I hope you found this post helpful and interesting. Explore this site for more tips and articles. Be sure to also check out my Patreon site where you can find free downloads and optional fee-based code and documentation. Thanks for visiting!