Joining tables based on complex conditions in Power Query

Mar 20, 2023
#power-bi #power-query

Going beyond field equality-based join conditions in Power Query.

Introduction

Power Query lets you join tables based on one or more columns and using a variety of join types (inner, left outer, etc.). The join condition, however, is always based on value equality.

This is sufficient in most scenarios, such as a when joining tables like Customers and Countries based on a common country code column. Any other type of condition, such as “greater than”, wouldn’t make sense here.

“Equality-based join”
A simple equality-based join

What about the case where the join condition becomes more complex to a point that an equality rule is no longer sufficient? Power Query does not guide you towards a way to implement other types of conditions, such as “greater than” or “on or before” for date columns.

Problem example

Let’s form an example with two tables. The first one, Sales, represents sold products:

Date Product Quantity Unit Price
1/20/2023 Yellow sweater 3 $48.00
1/25/2023 Yellow sweater 1 $48.00
2/07/2023 Green hat 1 $10.00
3/20/2023 Yellow sweater 5 $75.00

The second one, CostOfGoods, is a historical record of your supplier’s monthly price list. Note how there is no relation between the two tables, and especially the fact that dates do not match up:

Date Product Cost
1/1/2023 Yellow sweater $15.00
1/1/2023 Green hat $3.00
2/1/2023 Yellow sweater $17.00
2/1/2023 Green hat $4.00
3/1/2023 Yellow sweater $23.00
3/1/2023 Green hat $4.50

Our goal is to join the two tables such that each Sales row contains a new Unit Cost column, whose value will be the supplier’s cost at the time of the sale. This will enable us to easily calculate a margin later on.

Walkthrough

The logic to determine a sold product’s cost, in general terms, is simple:

  • Look for rows matching the product
  • Keep rows dated on or before the sale date
  • Take the most recent row by date

Expressing such logic in Power Query can easily be done by writing a function in M code. Let’s write this function, and then see how we can use it to populate our Sales table!

To start, create a new Blank Query and open it in the Advanced Editor. The GetItemCostOnDate function definition is as follows:

1
2
3
4
5
6
7
8
9
(CostOfGoods as table, Product as text, SalesDate as date) =>
    let
        #"Product filter" = Table.SelectRows(CostOfGoods, each ([Product] = Product)),
        #"Date filter" = Table.SelectRows(#"Product filter", each [Date] <= SalesDate),
        #"Sort by date desc" = Table.Sort(#"Date filter",{{"Date", Order.Descending}}),
        #"Most recent entry" = Table.First(#"Sort by date desc"),
        #"Cost" = #"Most recent entry"[Cost]
    in
        #"Cost"

Notice how the first input to this function is a reference to a table. The caller will be expected to pass CostOfGoods as an argument. While the function could access this table directly, this approach enables a significant performance optimization that will be explained later.

The function performs multiple steps to identify the correct row in which to find the product’s cost. This is where you can use the full range of M functions to express your logic (substring matching, …)!

Once saved, you should see Power Query offering you a form to invoke the function in an ad-hoc manner. Try it out to confirm it works!

&ldquo;Function invocation UI&rdquo;
UI to invoke the function on-demand

With the function now available, we can now use it to populate a new Cost column in our Sales table. Open the Advanced Editor of the Sales table and insert these steps where adding the column is desired:

1
2
BufferedCostOfGoods = Table.Buffer(CostOfGoods),
#"Invoke GetItemCostOnDate" = Table.AddColumn(#"Previous Step", "Unit Cost", each GetItemCostOnDate(BufferedCostOfGoods, [Product], [Sales Date])),

Things to note:

  • The CostOfGoods table is buffered before being passed as a reference in the next step. This will create an in-memory copy of the table and re-use it across function invocations, which can bring a significant performance improvement.
  • The GetItemCostOnDate function is called within an each statement, i.e. it will be executed for each row in our Sales table.

This gives us the desired table:

Date Product Quantity Unit Price Unit Cost
1/20/2023 Yellow sweater 3 $48.00 $15.00
1/25/2023 Yellow sweater 1 $48.00 $15.00
2/07/2023 Green hat 1 $10.00 $4.00
3/20/2023 Yellow sweater 5 $75.00 $23.00

Final words

Encapsulating complex join logic in a function isn’t the only way to achieve this task. When consuming the dataset in Power BI for example, an equivalent result could be achieved using DAX. In my opinion however, this approach is simpler and easier to trust.

An important factor to consider in evaluating this approach is that it doesn’t enable integrating report parameters in the join/filtering logic. This technique pre-computes values before the user can specify report filters, and as such the logic cannot depend on report filter selections.

Note that the function can also return an entire row (or even multiple rows), such that the “main” table could itself decide which columns to import, possibly how to aggregate values as well.

More reading

Table.Buffer (learn.microsoft.com)