Identifying anomalies, duplicates, and updating data necessitates comparing information from various sources. Accurate execution of these tasks is crucial, whether working solely with spreadsheets or using a mix of tools and languages like databases and web services. In this post, I will demonstrate various methods for comparing lists of identical or differing sizes across different scenarios, primarily using Excel, and as an added bonus, I’ll show how to harness Python’s capabilities to pinpoint differences in similar datasets.
Why is comparing lists accurately so critical? It’s probably best to exemplify by some scenarios. Let’s look at some examples.
Scenario 1:
METHOD 1: Do it in one shot (using ranges in the formula, not row by row and copying the formula down). This method is useful to find ANY differing value ANYWHERE in the range quickly. Different range sizes are okay.
Explanation: We compare two ranges with IF() and mark the differences, if any, with 1s, otherwise, they’ll appear as 0s. The formula takes this syntax: IF(B10:E22=G10:J22, 0,1) where List 1 is in range B10:E22, and List 2 is in range G10:J22. You should adjust these values according to your data layout.
Then, optionally, apply conditional formatting to highlight the differences with color. The results of two different comparisons are shown below:
Shortcoming: This method does not return the actual differing values, just marks the position in resulting grid (which is same as the first or larger range) with 1, so we need to refer back to ranges for actual differing value(s).
METHOD 2: Use IF(AND()) in first row and copy it down to rows needed.
Explanation: We compare each pair of cells (from list 1 and list 2) per corresponding columns, per row using IF(AND()…). If there’s a difference, we mark that row as different (with any string we want), otherwise, we leave the resulting row blank.
Shortcoming: It won’t tell you immediately which pairs triggered the difference in the row.
METHOD 3: Using just conditional formatting. This is one of the easiest but requires putting the values one after the other (by copy/paste or by VSTACK()). Here, we use VSTACK(List1 range, List2 range)…this places them vertically one after another in a new range. Then apply conditional formatting->Highlight Cell Rules->Duplicate Values.
Explanation: The duplicate values are highlighted accordingly, leaving the unique values unformatted and easily identifiable. This has the advantage of seeing the values as well as the cells where they differ by columns and relative row numbers.
METHOD 4: Using Lookup function (Xlookup or Vlookup)
Here we have two lists of items and prices for example: List 1, List 2. List 1 and 2 may or may not have duplicate items. List 2 contains the latest prices but only items with changed prices are listed in List 2 (not everything!).
Objective: Some items appear only in List 1, some only in List 2, and there are some fruits that appear on both but the prices have changed! The data ranges are also not the same. We need to create an updated list of prices list that has the latest prices for any changes from List 2, and the updated list must contain all items (fruits) with latest prices, and must have no duplicates!
The first challenge is: We need to find the differences between the lists (then we can merge with new values for items that changed in prices).
This requires a set of steps followed carefully but it has the advantage of being accurate and capable of handling complex scenarios of comparing, merging, and removing duplicates. This method deserves its own post, so I created one with step-by-step details here: A Real Data Merge Challenge (Excel)
What if the datasets or lists are in different sheets?
We can employ any of the above methods even if the lists, tables, or ranges are in different sheets. For example, in a new sheet, use the formula: IF(sheetA!A1:C48=sheetB!A1:C48,”0″,1)
In this example, the sheets are named ‘sheetA’ and ‘sheetB’ respectively and the data range in each is A1:C48. So, it’s easy to adjust the range and sheet names as per your need. The idea is that any difference in a cell (in any column or row) will appear as 1, otherwise it’ll be 0. After using the formula, it may be even more helpful to apply conditional formatting as well for the differences (if any) to jump out, especially when the datasets are large. However, as explained in METHOD 1, this will not actually retrieve the actual values that differ, rather show you relative position of the cells that differ.
In essence, there are pros and cons with each of these methods, the more details you want, the more it’ll take, the quicker you want results, the less details it’ll contain…so, choose what works best for your scenario.
With that, I want to introduce another challenge…that is comparing two different Excel workbooks or Excel sheets in the same or different workbooks, or different CSV files.
How to compare different XLSX (Excel) files, sheets in Excel files, or CSV files:
This can be efficiently done in Python, and much quicker than in Excel (even when using PowerQuery and such). The method I use is to use pandas library that provides us abilities to read and compare CSV and Excel files. Be sure to install and import the library first of course:
import pandas as pd
The idea is to use pandas.read_excel() function if they you’re comparing two different Excel workbooks. e.g.
file1 = 'pandastest_1.xlsx'
file2 = 'pandastest_2.xlsx'
df1 = pd.read_excel(file1)
df2 = pd.read_excel(file2)
If you’re comparing different sheets in the same workbook or even different workbooks, use pandas.read_excel() again but set sheet_name attribute to the name of the sheet. e.g.
workbook = 'SheetsDiffTest.xlsx'
sheet1 = 'Sheet1'
sheet2 = 'Sheet2'
df1 = pd.read_excel(workbook, sheet_name=sheet1)
df2 = pd.read_excel(workbook, sheet_name=sheet2)
The above snippets just opens the files, or sheets into memory into pandas dataframes. Once they’re loaded into dataframes, we have poweful features at our fingertips to manipulate, reshape, plot, convert the data and more. In our case, all we need is the compare method in pandas. But first, it’s important to note that this method should be used when the number of columns in the files, or worksheets you’re comparing are the same; the rows can be different…and that’s exactly what this method compare() is designed for. You can check if the columns are the same or not by:
df2 = df2[df1.columns] # returns True if same number of columns, otherwise False
Then call compare() method on a dataframe to compare with the other and save the result into a new dataframe as:
diff_df = df1.compare(df2)
Then just display the results in your Python IDE/shell window with:
print(diff_df)
If the files are identical, output will be something like:Empty DataFrame
Columns: []
Index: []
Otherwise, you’ll get the row number, and the cell values that differ. This is really powerful and fast. The caveat here is that it requires a Python development environment and knowledge of Python.
These are just some of the ways to identify differences across lists, tables, data ranges, sheets in a workbook, different workbooks/spreadsheets, CSV files. The actual method you decide to use should depend on your needs and of course one that yields accurate results every time regardless of the effort it takes. I hope this was educational.
Related: