Selecting the Latest Record in a Large Version History Table

This demonstrates the use of a window function to perform a select of the latest version of a record with a single query and no joins

Challenge

A table that contains many versions of each record is available but you must use the latest version.

Discussion

This problem could be solved by selecting the ID and MAX update date into a temporary table. Then that temporary table could be INNER JOINED back to the history table to obtain the result. Unfortunately, this requires two steps and storing an intermediate table that has no function other than finding the latest update.

The more elegant solution to perform this operation in a single query uses a Window Function with sort plus a filter.

Solution

The version history table

employee_iddepartmentsalaryupdate_date
3IT900002024-09-17
2HR850002024-09-17
5HR820002024-09-17
3IT770002023-10-01
3IT750002022-10-04
5IT720002024-07-12
2IT670002024-03-18
1Sales620002022-02-28
5Sales600002023-01-14
4Sales580002021-11-19

Step Setup

Using an extract step, create a window function expression in a column called Rank like:

func.rank().over(order_by=table.updated_date.desc(), partition_by=table.employee_id)

On the filter tab in the Extract step, set a filter like:

table.Rank == 1

The Result

employee_iddepartmentsalaryupdate_dateRank
3IT900002024-09-171
2HR850002024-09-171
5HR820002024-09-171
1Sales620002022-02-281
4Sales580002021-11-191

This approach is highly efficient and allows selection of the latest record in a multi-version history table in a single step. This works by ranking each record within the employee_id group by the update_date and then only picking the first record.

If there are multiple columns that make up the unique row key, you can add them to the partition_by argument as a list like:

partition_by=[table.first_column, table.second_column, table.third_column]

If you need to apply multi-column sorts you can apply that with a list of columns too like:

order_by=[table.first_column.desc(), table.second_column, table.third_column.desc()]