Four ways to forward-fill values in T-SQL (the last non NULL problem)

In this post, we will look at how to forward-fill values in T-SQL, also known as the last non-NULL problem.

By forward-filling, we're taking the previous row's value and using it in the current row if the current row value is NULL - in effect carrying the last non-NULL value forward.

The table below demonstrates forward-filling:

| value_with_nulls | forward_filled |
|------------------|----------------|
| 12               | 12             |
| NULL             | 12             |
| NULL             | 12             |
| 93               | 93             |
| 27               | 27             |
| NULL             | 27             |

Setting up the data

Each solution listed will use the same example data, set up as a temporary table in the SQL below.

DROP TABLE IF EXISTS #demo_data
CREATE TABLE #demo_data (
    event_id INT
    ,measured_on DATE
    ,measurement INT
)

INSERT INTO #demo_data
VALUES
(1,'2021-06-06',NULL)
,(1,'2021-06-07', 5)
,(1,'2021-06-08',NULL)
,(1,'2021-06-09',NULL)
,(2,'2021-05-22',42)
,(2,'2021-05-23',42)
,(2,'2021-05-25',NULL)
,(2,'2021-05-26',11)
,(2,'2021-05-27',NULL)
,(2,'2021-05-27',NULL)
,(3,'2021-07-01',NULL)
,(3,'2021-07-03',NULL);

There's just one thing the data needs, and that's a column to order the rows. In this data, I've used a date column, measured_on.

I've also included a column event_id. While it's not strictly required, it adds a grouping element that is more like real-world data.

Here's what the final result set should look like:

| event_id | measured_on | measurement | forward_filled |
|----------|-------------|-------------|----------------|
| 1        | 2021-06-06  | NULL        | NULL           |
| 1        | 2021-06-07  | 5           | 5              |
| 1        | 2021-06-08  | NULL        | 5              |
| 1        | 2021-06-09  | NULL        | 5              |
| 2        | 2021-05-22  | 42          | 42             |
| 2        | 2021-05-23  | 42          | 42             |
| 2        | 2021-05-25  | NULL        | 42             |
| 2        | 2021-05-26  | 11          | 11             |
| 2        | 2021-05-27  | NULL        | 11             |
| 2        | 2021-05-27  | NULL        | 11             |
| 3        | 2021-07-01  | NULL        | NULL           |
| 3        | 2021-07-03  | NULL        | NULL           |

If you'd like to jump straight into a solution, here is the complete list

1. With a subquery in the SELECT clause

The first method uses a subquery inside the SELECT clause to get the first non-NULL value before the current row.

First, we create a subquery that returns the first non-NULL value before the current row.

SELECT
    *
    ,(SELECT TOP 1
            inner_table.measurement
        FROM
            #demo_data as inner_table
        WHERE
                inner_table.event_id = #demo_data.event_id
            AND inner_table.measured_on < #demo_data.measured_on
            AND inner_table.measurement IS NOT NULL        ORDER BY
            inner_table.measured_on DESC) as step_1FROM
    #demo_data
| event_id | measured_on | measurement | step_1 |
|----------|-------------|-------------|--------|
| 1        | 2021-06-06  | NULL        | NULL   |
| 1        | 2021-06-07  | 5           | NULL   |
| 1        | 2021-06-08  | NULL        | 5      |
| 1        | 2021-06-09  | NULL        | 5      |
| 2        | 2021-05-22  | 42          | NULL   |
| 2        | 2021-05-23  | 42          | 42     |
| 2        | 2021-05-25  | NULL        | 42     |
| 2        | 2021-05-26  | 11          | 42     |
| 2        | 2021-05-27  | NULL        | 11     |
| 2        | 2021-05-27  | NULL        | 11     |
| 3        | 2021-07-01  | NULL        | NULL   |
| 3        | 2021-07-03  | NULL        | NULL   |

The last non-NULL value carries forward but only after the starting row.

To fix this, we wrap the subquery in a CASE statement which returns the subquery if the current value is NULL; otherwise, the non-NULL value.

SELECT
    *
    ,CASE 
        WHEN measurement IS NULL THEN (            SELECT TOP 1
                inner_table.measurement
            FROM
                #demo_data as inner_table
            WHERE
                    inner_table.event_id = #demo_data.event_id
                AND inner_table.measured_on < #demo_data.measured_on
                AND inner_table.measurement IS NOT NULL
            ORDER BY
                inner_table.measured_on DESC
        )
    ELSE        measurement    END as forward_filled
FROM
    #demo_data
| event_id | measured_on | measurement | forward_filled |
|----------|-------------|-------------|----------------|
| 1        | 2021-06-06  | NULL        | NULL           |
| 1        | 2021-06-07  | 5           | 5              |
| 1        | 2021-06-08  | NULL        | 5              |
| 1        | 2021-06-09  | NULL        | 5              |
| 2        | 2021-05-22  | 42          | 42             |
| 2        | 2021-05-23  | 42          | 42             |
| 2        | 2021-05-25  | NULL        | 42             |
| 2        | 2021-05-26  | 11          | 11             |
| 2        | 2021-05-27  | NULL        | 11             |
| 2        | 2021-05-27  | NULL        | 11             |
| 3        | 2021-07-01  | NULL        | NULL           |
| 3        | 2021-07-03  | NULL        | NULL           |

2. Using Outer Apply

This solution is similar to the above but uses OUTER APPLY in place of the subquery in the SELECT clause.

If you're unfamiliar with CROSS & OUTER APPLY in t-sql, these apply their subquery results to each row of the table to the left.

SELECT
    *
FROM
    #demo_data
        OUTER APPLY (
            SELECT TOP 1
                inner_resultset.measurement as forward_filled
            FROM
                #demo_data as inner_resultset
            WHERE
                    inner_resultset.event_id = #demo_data.event_id
                AND inner_resultset.measured_on <= #demo_data.measured_on
                AND inner_resultset.measurement IS NOT NULL
            ORDER BY
                inner_resultset.measured_on DESC
        ) as first_non_null
| event_id | measured_on | measurement | forward_filled |
|----------|-------------|-------------|----------------|
| 1        | 2021-06-06  | NULL        | NULL           |
| 1        | 2021-06-07  | 5           | 5              |
| 1        | 2021-06-08  | NULL        | 5              |
| 1        | 2021-06-09  | NULL        | 5              |
| 2        | 2021-05-22  | 42          | 42             |
| 2        | 2021-05-23  | 42          | 42             |
| 2        | 2021-05-25  | NULL        | 42             |
| 2        | 2021-05-26  | 11          | 11             |
| 2        | 2021-05-27  | NULL        | 11             |
| 2        | 2021-05-27  | NULL        | 11             |
| 3        | 2021-07-01  | NULL        | NULL           |
| 3        | 2021-07-03  | NULL        | NULL           |

Note that if we use CROSS APPLY instead of OUTER APPLY, we will eliminate the rows with NULL inside the column we're trying to carry forward.

3. Combining window functions and subqueries

The first component of this approach creates a column that groups the last non-NULL and NULL rows by event_id.

To create this column, we use the COUNT aggregate function with the OVER clause. OVER turns the COUNT into a window function and applies a COUNT function per group of event_id rows.

SELECT
    *
FROM
    (
        SELECT
            event_id
            ,measured_on
            ,measurement
            ,COUNT(measurement) OVER (PARTITION BY event_id ORDER BY measured_on) as grouper        FROM
            #demo_data
    ) as grouped

Run this query. You'll see that the grouper column increments only if a value is inside the measurement column and per event_id.

| event_id | measured_on | measurement | grouper |
|----------|-------------|-------------|---------|
| 1        | 2021-06-06  | NULL        | 0       |
| 1        | 2021-06-07  | 5           | 1       |
| 1        | 2021-06-08  | NULL        | 1       |
| 1        | 2021-06-09  | NULL        | 1       |
| 2        | 2021-05-22  | 42          | 1       |
| 2        | 2021-05-23  | 42          | 2       |
| 2        | 2021-05-25  | NULL        | 2       |
| 2        | 2021-05-26  | 11          | 3       |
| 2        | 2021-05-27  | NULL        | 3       |
| 2        | 2021-05-27  | NULL        | 3       |
| 3        | 2021-07-01  | NULL        | 0       |
| 3        | 2021-07-03  | NULL        | 0       |

To forward-fill, all we do is retrieve the MAX value by the new grouper column.

SELECT
    event_id
    ,measured_on
    ,measurement
    ,MAX(measurement) OVER (PARTITION BY event_id, grouper) as forward_filledFROM
    (
        SELECT
            event_id
            ,measured_on
            ,measurement
            ,COUNT(measurement) OVER (PARTITION BY event_id ORDER BY measured_on) as grouper
        FROM
            #demo_data
    ) as grouped
ORDER BY
    event_id
    ,measured_on
| event_id | measured_on | measurement | forward_filled |
|----------|-------------|-------------|----------------|
| 1        | 2021-06-06  | NULL        | NULL           |
| 1        | 2021-06-07  | 5           | 5              |
| 1        | 2021-06-08  | NULL        | 5              |
| 1        | 2021-06-09  | NULL        | 5              |
| 2        | 2021-05-22  | 42          | 42             |
| 2        | 2021-05-23  | 42          | 42             |
| 2        | 2021-05-25  | NULL        | 42             |
| 2        | 2021-05-26  | 11          | 11             |
| 2        | 2021-05-27  | NULL        | 11             |
| 2        | 2021-05-27  | NULL        | 11             |
| 3        | 2021-07-01  | NULL        | NULL           |
| 3        | 2021-07-03  | NULL        | NULL           |

4. Using a Recursive CTE

This solution uses a more advanced technique known as a Recursive CTE. Recursive CTEs are a special kind of Common Table Expression in which the CTE references itself. The goal of this post is not to explain CTEs, so I'll assume you're familiar with them.

There's a bit to unpack in this solution, but let's start with the recursive CTE itself.

We start with creating a base query and include a column with an incrementing row number per event_id. We call this column idx

;WITH base_data AS (
    SELECT 
        ROW_NUMBER() OVER (PARTITION BY #demo_data.event_id ORDER BY #demo_data.measured_on) AS idx        ,#demo_data.event_id
        ,#demo_data.measured_on
        ,#demo_data.measurement
    FROM
        #demo_data
)

To set up the recursive query, we first retrieve the first row of each event_id. Then we UNION ALL the remaining rows from the base query.

In the FROM clause, we INNER JOIN the recursing query and offset the idx column by 1. This offsetting allows us to access the previous row.

To achieve the forward-filling, we use ISNULL in the SELECT statement to substitute the previous row's value if the current row value is NULL.

;WITH base_data AS (
    SELECT 
        ROW_NUMBER() OVER (PARTITION BY #demo_data.event_id ORDER BY #demo_data.measured_on) AS idx        ,#demo_data.event_id
        ,#demo_data.measured_on
        ,#demo_data.measurement
    FROM
        #demo_data
), recursing_query AS (
    SELECT 
        idx
        ,event_id
        ,measured_on
        ,measurement
    FROM 
        base_data
    WHERE 
        idx = 1    
    UNION ALL

    SELECT 
        base_data.idx
        ,base_data.event_id
        ,base_data.measured_on
        ,ISNULL(base_data.measurement, recursing_query.measurement)    FROM
        base_data
            INNER JOIN recursing_query
                ON recursing_query.event_id = base_data.event_id                 AND recursing_query.idx = base_data.idx - 1)

Now we can SELECT the values from the CTE. The ORDER BY ensures the rows come out as expected.

;WITH base_data AS (
    SELECT 
        ROW_NUMBER() OVER (PARTITION BY #demo_data.event_id ORDER BY #demo_data.measured_on) AS idx
        ,#demo_data.event_id
        ,#demo_data.measured_on
        ,#demo_data.measurement
    FROM
        #demo_data
), recursing_query AS (
    SELECT 
        idx
        ,event_id
        ,measured_on
        ,measurement
    FROM 
        base_data
    WHERE 
        idx = 1
    
    UNION ALL

    SELECT 
        base_data.idx
        ,base_data.event_id
        ,base_data.measured_on
        ,ISNULL(base_data.measurement, recursing_query.measurement)
    FROM
        base_data
            INNER JOIN recursing_query
                ON recursing_query.event_id = base_data.event_id 
                AND recursing_query.idx = base_data.idx - 1
)
SELECT
    recursing_query.event_id
    ,recursing_query.measured_on
    ,recursing_query.measurement as forward_filled
FROM
    recursing_query
ORDER BY    recursing_query.event_id    ,recursing_query.measured_on
| event_id | measured_on | forward_filled |
|----------|-------------|----------------|
| 1        | 2021-06-06  | NULL           |
| 1        | 2021-06-07  | 5              |
| 1        | 2021-06-08  | 5              |
| 1        | 2021-06-09  | 5              |
| 2        | 2021-05-22  | 42             |
| 2        | 2021-05-23  | 42             |
| 2        | 2021-05-25  | 42             |
| 2        | 2021-05-26  | 11             |
| 2        | 2021-05-27  | 11             |
| 2        | 2021-05-27  | 11             |
| 3        | 2021-07-01  | NULL           |
| 3        | 2021-07-03  | NULL           |

At this point, we've forward-filled. To arrive at a solution that includes the original last non-NULL values, LEFT JOIN the base data.

--- ... Recursive setup

SELECT
    recursing_query.event_id
    ,recursing_query.measured_on
    ,base_data.measurement    ,recursing_query.measurement as forward_filled
FROM
    recursing_query
        LEFT OUTER JOIN base_data            ON base_data.event_id = recursing_query.event_id            AND base_data.idx = recursing_query.idxORDER BY
    recursing_query.event_id
    ,recursing_query.measured_on

Conclusion

As we've seen, there are multiple ways to solve the last non-NULL problem. The best solution will likely depend on the profile of your data.

Further Reading

© 2024 Andrew Villazon. All rights reserved.