Tuesday, July 16, 2024

# Analysis of Wine Data (analytics)

You’re given a massive list of data about wines around the world, their prices (a huge range from \$4 to \$3300 per bottle!), expert tasters’ scores, who scored them, how many scored them, name of the wine, description of the win, their origins including country, province, region, subregion, their variety or category, just to name a few.

Your job is to sift through this culled data (about 130,000 unique wine entries!) and answer the following questions:

###### 3.  Is there a way to predict price for a wine given its score? And predict its score, given its price?

Lot to ask you say! And you’d be right. But it can be done with front-loading some thought-process before you dive into calculations and charting right away. By thinking about the story to tell or questions to answer deeply first, we can organize the data and or just extract what we need from this massive dataset of 130000 rows x 13 columns.

Once that part planning part is done, the actual calculations will be done by your computer and software…while you sit back. Then you’ll need to verify if your planning was correct. That’s why data modelling is so critical, complex and is very different from just “pretty” charting. In the end, the result as far more valuable to you/your organization.

Putting all this data into perspective, I’ve already simplified the challenge into 3 discrete goals/questions for an analyst. In reality, most of the times we don’t even the guidance to chase such specific questions. In reality, it’s usually as vague as “what does this mean?” “How can we leverage this data to…” (fill in the blanks). But, that’s a much larger discussion beyond this post. So, I’ll just exemplify here some techniques to make sense of the data and come out with some meaningful analysis. This is also NOT an l training on Excel, PowerBI, or Analytics. To understand those, you want to refer to many serviceable courses/trainings available elsewhere.

The Data:
Here’s an extremely-cropped-dataset snapshot…

And then it continues till the last data rows…

First, because of the vast range of prices and sheer number of rows for each wine, it’s not going to useful for me to show each data point. Nor is it going to help me get to concise conclusions. So, first thing I’d do is bucket them into groups by, say, price range. That’d be a good starting point.
I’ll give them a price interval of \$50 and group the dataset, and created a pivot table out of row data into a new canvas. So now it’s more meaningful and manageable:

NOTE: I’m showing it in 2 columns for readability (it’s just 1 continue column of 30 rows at this point).
On this newly created pivot table, I added a new column “Average of Points” (i.e. average scores) by summarizing and finding the mean of data from the Points column L from the original dataset (see image snapshot above).

This now renders the new pivot table as below:

At this point, I don’t actually need to bother with the rest of the noise. I feel I have enough information to draw some trends, correlations, and visuals.
Using these 2 columns, I can draw a clustered pivot chart as follows:

We can clearly see the price buckets (x axis) and the scores (y axis). Additionally, I added a trendline in logarithmic scale and asked Excel to show me the exact formula based on my dataset right on the chart (see image above). So, we have this magic formula: Score =1.6329*LN(Price) + 90.563

With simple algebra, I can find the formula for Price, which is: LN(Price) = (Score- 90.563)/1.6329

But since we need Price (not the logarithm of price) in dollars back again, we take an ANTI-LOG of LN(Price).

To create a working sheet based on these formulae (that goes price->score, and score->price), I created 2 tables with the above formulas respectively. So, the highlighted column is where values are calculated, and the green cells are where we can enter any test values.

Let’s test this out!

Table 1 (price -> score prediction):

Table 2 (score -> price prediction):

Now, verify how Table 1 values match up with the Pivot Chart we create above. They do beautifully. Then match Table 2 values with the Pivot Chart we create above. They do beautifully again as expected. Finally, to triple-verify, match the numbers (input vs outputs and vice versa) in Table 1 and Table 2— and they are PERFECT.

So, our formulas are all good. And we can confidently answer the following question for ANY value thrown at us!

###### 2.  What’s a range (if any) that provides the best quality (points) at a reasonable price?

We just need to look at our meaningful, visual chart:

Let’s answer this: Is higher price correlated with higher points?

Only up to a certain point (between \$100 and \$200)! And then it plateaus, and the ROI doesn’t seem to be much. The quality (score/points) don’t seem to increase much after in relation to price!

Finally, let’s answer this: What’s a range (if any) that provides the best quality (points) at a reasonable price?

This might be dependent on your personal taste and budget, but I’d be comfortable to say (for myself) that if I’m going to taste something out of my usual wines and pay more to taste something unique, I’ll stick with \$300-\$700 range but would be quite content sticking within \$250-\$400 range.

So, how cool is that? It didn’t take quite as long as it seemed at first in the tactical part (the upfront thinking part requires some time for most analyses).

P.S. The wine glass is just my touch using a public-domain image from Office Online. You can perfectly achieve the same results with stock histograms, but I thought the image, that’s perfectly scaled to data automatically and that matches the topic of investigation (wine) would be more suitable.

P.P.S. Data presented actual snapshot of data exported from publicly available sources.

EXERCISE:

• I took the average points for each bucket in this solution. What if there were vastly different outliers in a couple of buckets? How would you accommodate for that such that the points considered as NOT skewed?
• Some wines as it turns out where given very high scores, but we notice that the number of tasters for those are extremely low (as low as 1!) and wherever the number of tasters is above 1000, the numbers seem to synthesize better. How would you address that so the scoring is fairly distributed?
•  What are interesting stories could you tell given the dataset?

HINT: There are very effective solutions to both of these challenges (when/why do you use filters? When/why do we have Mean and Median?).