Wednesday, September 11, 2024

# How to treat Excel like a database to do complex SQL-like queries

One of the most under-utilized, most powerful query features come from Excel’s database functions. While primary usage of Excel is more as a traditional spreadsheet, and it myriad of formulas (and VBA macros), the database functions allow us to treat it like a database and query from its dataset much like SQL statements would from a typical database table. In this blog, I’ll perform just a sample of such queries to demonstrate their power.

I have a dataset of 100,000 records of product sales as below:

There’s no grouping, sorting, or any particular arrangement. I want to retrieve some results from this to answer the following questions:

• Query 1: Find the maximum sales where OrderID >230000 AND quantity greater than OR 1 AND Sales <1000
• Query 2: Find the orderID where sales were between a range of 10 and 30 inclusive
• Query 3: Find an  order where sales were BELOW average
• Query 4: Get the total quantities sold of products  whose names start with Apple
• Query 5: Get total sales amount within a time-range AND only ordered from Seattle!

I even won’t turn it into a table, pivot table, or anything, but query right out of this large dataset as-is. Best of all, I won’t be using any SQL queries either without depending on any database application. It may sound like we’d need some gnarly formulas to answer the above queries, but with D-functions, it’s very quick and easy. They handled hundreds of thousands of records in my tests without any lag! I’m talking about DGET(), DSUM(), DMIN(), DAVERAGE(), etc.

These are great for quickly retrieving quantifiable results; not suitable for getting array or categorical results however. But for queries as above, they work perfectly with minimal formulas. The key is to arrange a table correctly that’s used for the criteria first. Then apply the appropriate function with those constraints. Let’s go through each of them and look at their results.

Query 1: Find the maximum sales where OrderID# is larger than 230000 AND quantity sold is greater than 1 OR equal to 1 AND Sales  was under \$1000.

So, if we arrange the column headers exactly as in the dataset to query, and enter each field’s condition under neath, and specify that range as the condition, Excel will retrieve the matching data and then apply the formula we specify to do the calculation on the retrieved data.

So, we want OrderID field to have criteria >230000, Qty field to >=1, and Sales to be <1000. Next, we put those criteria in a few cells and call this function: DMAX(data!A1:G100001,”Sales”,A23:C24)

where the dataset is in a sheet called “data” and its range is A1 to G100001. The Sales column is called Sales. The criteria table is then arranged in A23 to C24 including the header. The headers in the criteria must match the headers in the data we’re querying so Excel knows what condition to apply to which field (we could also do it by column number instead of name). The criteria table is below and the answer is in blue cell:

Query 2: Find the orderID where sales were between a range of \$10 and \$30 inclusive.

Translating the question into a criteria table, we have:

Notice the power of adding multiple conditions within each cell above! Answer is retrieved using: DMIN(data!A1:G100001,”OrderID”,A44:B45)…which says, look at the data range specified as 1st parameter, give me the OrderID if there’s any matching record as specified in the criteria table in range A44 to B45. And there’s the answer…order# 141235

Query 3: Find an  order where sales were BELOW average

For this, we just have one field sales to apply the condition, however, in order to compare average of 100,000 sales we need to calculate it and then find the ones that are BELOW that figure. Without having to do any average calculations separately, or having any helper columns, we can just compute the average using AVERAGE(sales datarange) and put that in the criteria table’s cell dynamically and tell Excel to

Using DMIN(data!A1:G100001,”OrderID”,A50:A51) where A51 contains the result of the AVERAGE() function, it gives us the order number!

Query 4: Get the total quantities sold of products  whose names start with Apple

Note that for this one, we want to do a partial match on product name as we’re looking for any product name that starts with “Apple” in our dataset. Happily, Excel allows wildcards as well here! So, the following criteria table gives as the result: 8,415 items were sold whose product names start with “Apple”!

Query 5: Get total sales amount within a time-range AND only ordered from Seattle!

To specify a time-range, in this case, we want monthly data for January 2019 where orders came from Seattle. If you notice that the date column in the dataset does not have Month, Day, Year separated. No problem! Also, notice that the Address field contains a long string of the full address and NOT broken up by zip code, city, etc. Still, not a problem at all for Excel! Although we could split them into separate fields using Excel’s other functions, why bother? We can use a wildcard here and if we specify starting boundary of date and the ending boundary, we don’t even need to know the month or filter for January!

And just like that, we have our answer:

Using DSUM() we are taking the Sales value of each record with “Seattle” as a word in the Address field and keep adding them throughout the dataset in just one go!

These are amazingly powerful and performant functions. Without these, we’d have to resort to splitting out the columns, use additional functions like Xlookup, Vlookup, Filter, Search, Match, Index, etc. or use PowerQuery, Pivot tables, etc. to extract the answers.

Based on what I see out there, I feel these are seriously under-utilized and most solutions resort to formulas and pivots right away when they’re not really needed. Hope you found this information/tip interesting.

This post is not meant to be a formal tutorial, instead it is  to offer key concepts, and approaches to problem-solving. Basic->Intermediate technical/coding knowledge is assumed. If you like more info or source file to learn/use, or need more basic assistance, you may contact me at tanman1129 at hotmail dot com. To support this voluntary effort, you can donate via Paypal from the button in my Home page, or become a Patron via my Patreon site. A supporter/patron gets direct answers on any of my blogs including code/data/source files whenever possible. At the very least, it’ll encourage me to share more posts/tips/knowledge in the future.