Thursday, September 28, 2023

# A Real Data Merge Challenge (Excel)

Instead of demonstrating what each of the thousands of functions does in Excel, it’s most helpful, I think to demonstrate how to solve real-life scenarios and the formulas that come in handy as secondary. Today, I present this.

The Challenge: You are given 2 datasets of fruits and their prices. One set includes some items/fruits that the other may or may not have! If they appear on both lists, the 2nd list (list2) has the latest price. If the list2 doesn’t have the fruit listed but appears in 1st list (list1), we have to use that fruit and its price from list1. List2 may or may not have new fruits listed and the number of items in list1 and list2 cannot be assumed to be same. And imagine the lists are large…meaning, you cannot practically do this manually and your solution has to be scalable to millions of rows if needed.

Your Objective: Create one definitive list of fruits that contain ALL the fruits from list1 and list2, and with the correct and latest prices of pre-existing fruits, including adding the latest fruits and their prices if any.

The Solution: This is a good example of forethought: how to arrange the data, and what’d be a practical yet efficient way to produce the final list. There are numerous ways to achieve this, but here I’ll share one of my methods.

The Lists: The two lists provided are shown below and I’ve kept each of them as a separate tab in the workbook (this is not necessary for you but will help in following along below). NOTE for brevity I’ve kept the # of rows to minimal but the solution I offer is fully scalable to very large datasets.

List 1 (tab name: list1):

List 2 (tab name: list2):

As you notice some fruits appear only in List1, some only in List2, and there are some fruits that appear on both but the prices have changed! The data ranges are also not the same.

The Steps

First, we have to have one full list of all items (fruits). So, let’s start a new blank sheet and create a new field called “Fruit (Combined)”…which as you’ve guessed will include the combined list of fruits from list1 and list2.

The new column looks like this:

Yes, it includes duplicate entries. We need them now as the prices may be different depending on which list and we need to consider them both. More on that later.

At this point, I’d like to make this into a Table. Adding new columns and auto-filling formulas to it, and sorting will be easier as we’ll see later.

Let’s add a new column (col) in the table “From List1” to bring in the prices from List1  using Vlookup, something like: VLOOKUP(A7,list1!\$A\$2:\$B\$26,2,FALSE)

Basically, I’m saying: Get me the price of each fruit in my combined list by looking at list1. In our case, not all fruits will be found in list1 or list2 since they’re dispersed with overlaps and gaps amongst the two lists.

Therefore, all fruits that are NOT in list1 will show #N/A otherwise, it’ll show their prices from List1.

Now, let’s add another col in the same table and call it “From List2“. This is to bring in the prices from List2 tab using Vlookup. Similarly : =VLOOKUP(A7,list2!\$A\$2:\$B\$26,2,FALSE)

Remember: For prices that show for both of these new cols, we only must consider the latest numbers from List2 col lookup values.

For prices that show #N/A in “From List1” col  but a value in “From List2” means we only need to consider the  latest numbers from List2 col lookup values.

So, for all values first, we consider prices from “From List2” col. Only if it contains “#N/A”, then we consider the associated value from “From List1” col.

The table now looks like this (NOT exactly as you’ll have some duplicate items of some fruits (I cleaned it up before this snapshot, which I’ll explain below):

Then we add one more (final) col in the table called “Updated Price” and there we only bring in the price value from “From List2” col UNLESS it’s “#N/A” in which case, we bring in the value from “From List1“. By our definition and configuration then we can undeniably derive that #N/A cannot possibly exist in both From List1 and From List2 for the same fruit.

So, I need to apply a logic that says, “if the From List2 column contains a price, use it, otherwise use the price from From List1 column (meaning, it’s a fruit that existed only in List1 and no change in price).

This is where Excel’s IFNA() formula comes in very handy. I can say something like: IFNA(C7,B7) and copy the formula down to whatever range I like!

With the Updated Price col, the table now looks like this:

Don’t worry if you have some duplicate fruits in your table at this point…in fact, if you’re following these steps, you should. I just removed them in later steps and took the snapshot.

Now we have the full list of fruits and final prices but only need to remove the duplicates. Doing that is the easiest part of all 🙂 Click on Fruit (Combined) col header and choose Data->Remove duplicates by selecting the combined Fruit (Combined) col.

Now the table’s Fruit (Combined) and Updated Price cols are part of the cleaned, merged, final table/dataset!

For presentation/aesthetic purposes, you could hide the helper vlookup cols. Since the final col Updated Price is calculated dependent on the vlookup col, you cannot delete them unless you convert the Updated Price col from formulas to values. If you do, the final table should like this (showing sorted by Fruit column).

Hope this was useful. If you have other methods to resolve this problem, please share it. As always, you can get the original data and solutions by contacting me and/or donating from the Home page of this site.