Identify non-consecutive values in SQL Server

In this two-part series, we look at different solutions to a SQL problem - how to identify consecutive or non-consecutive values in a column. Also known as the Gaps and Islands problem.

Understanding the Gaps and Islands problem is useful for analyzing data that features sequences or breaks in sequences.

This post, part one of the series, will look at methods for identifying non-consecutive values (Gaps). In part two, we explore ways to identify consecutive values (Islands).

What is the Gaps and Islands problem?

As the name implies, there are two components.

Gaps - rows where a row value does not sequentially follow another...

1
289
10
111516
1722

...and Islands - rows where a row value follows another in an unbroken succession.

12
891011
15161722

Identifying Gaps

To identify Gaps, we need to determine where a sequence ends, and another begins. In between these points are where a gap starts and ends.

In general, each approach compares the current row with the next row to determine if it's the next value in the sequence. If it isn't, then we've found a gap.

With the LEAD window function

This approach uses the LEAD window function. The LEAD function lets you access values from rows that follow the current row.

First, we apply the LEAD function to generate a result set of the current row value and next row value.

-- Data setup
DECLARE @sequences TABLE
(
    value_of_interest INTEGER
)
INSERT INTO @sequences
VALUES (1),(2),(3),(6),(7),(9),(11),(12),(20),(25)

-- Solution begins
SELECT
    value_of_interest
    ,LEAD(value_of_interest,1, value_of_interest)         OVER(ORDER BY value_of_interest) as next_row_valueFROM
    @sequences
| value_of_interest | next_row_value |
|-------------------|----------------|
| 1                 | 2              |
| 2                 | 3              |
| 3                 | 6              |
| 6                 | 7              |
| 7                 | 9              |
| 9                 | 11             |
| 11                | 12             |
| 12                | 20             |
| 20                | 25             |
| 25                | 25             |

To identify gaps, we subtract the current row value from the next row value. For rows where a sequence ends, the difference will be greater than 1.

/* Data setup */

SELECT
    value_of_interest
    ,next_row_value
    ,next_row_value - value_of_interest as sequence_ind
FROM
    (
        SELECT
            num_sequence.value_of_interest
            ,LEAD(value_of_interest,1, value_of_interest) 
                OVER(ORDER BY value_of_interest) as next_row_value
        FROM
            @sequences as num_sequence
    ) as curr_and_next
| value_of_interest | next_row_value | sequence_ind |
|-------------------|----------------|--------------|
| 1                 | 2              | 1            |
| 2                 | 3              | 1            |
| 3                 | 6              | 3            || 6                 | 7              | 1            |
| 7                 | 9              | 2            || 9                 | 11             | 2            || 11                | 12             | 1            |
| 12                | 20             | 8            || 20                | 25             | 5            || 25                | 25             | 0            |

To isolate the sequence end and starts, we filter for the rows where the difference is greater than 1.

To arrive at the gaps, we add 1 to the sequence end and subtract 1 from the sequence start.

/* Data setup */

SELECT
    value_of_interest + 1 as gap_starts    ,next_row_value - 1 as gap_endsFROM
    (
    SELECT
        num_sequence.value_of_interest
        ,LEAD(value_of_interest,1, value_of_interest) 
            OVER(ORDER BY value_of_interest) as next_row_value
    FROM
        @sequences as num_sequence
    ) as curr_and_next
WHERE
    next_row_value - value_of_interest > 1
| gap_starts | gap_ends |
|------------|----------|
| 4          | 5        |
| 8          | 8        |
| 10         | 10       |
| 13         | 19       |
| 21         | 24       |

With the ROW_NUMBER function

This approach leverages a Common Table Expression (CTE) joined to itself to create a result set of the current row and next row. Like the above approach, we compare the current row and the next row to identify a gap.

First, we'll create a CTE of the rows and their row number.

/* Data setup */

;
WITH current_row AS
(
SELECT
    num_sequence.value_of_interest
    ,ROW_NUMBER()         OVER(ORDER BY num_sequence.value_of_interest) as row_numFROM
    @sequences as num_sequence
)
SELECT
    value_of_interest
    ,row_num
FROM
    current_row;
| value_of_interest | row_num |
|-------------------|---------|
| 1                 | 1       |
| 2                 | 2       |
| 3                 | 3       |
| 6                 | 4       |
| 7                 | 5       |
| 9                 | 6       |
| 11                | 7       |
| 12                | 8       |
| 20                | 9       |
| 25                | 10      |

To get the current row and the next row, we join the CTE to itself based on the row number plus 1.

/* Data setup */

;
WITH current_row AS
(
SELECT
    num_sequence.value_of_interest
    ,ROW_NUMBER() 
        OVER(ORDER BY num_sequence.value_of_interest) as row_num
FROM
    @sequences as num_sequence
)
SELECT
    current_row.value_of_interest
    ,next_row.value_of_interest as next_row_value
FROM
    current_row
        INNER JOIN current_row as next_row            ON next_row.row_num = current_row.row_num + 1;
| value_of_interest | next_row_value |
|-------------------|----------------|
| 1                 | 2              |
| 2                 | 3              |
| 3                 | 6              |
| 6                 | 7              |
| 7                 | 9              |
| 9                 | 11             |
| 11                | 12             |
| 12                | 20             |
| 20                | 25             |

To identify where a sequence ended, we subtract the current row from the next row. When the sequence ends, the result of the subtraction is greater than 1.

As we did in the previous solution, we isolate the sequence ends and starts by placing the subtraction in the WHERE clause.

/* Data setup */

;
WITH current_row AS
(
SELECT
    num_sequence.value_of_interest
    ,ROW_NUMBER() 
        OVER(ORDER BY num_sequence.value_of_interest) as row_num
FROM
    @sequences as num_sequence
)
SELECT
    current_row.value_of_interest as sequence_ended
    ,next_row.value_of_interest as sequence_began
FROM
    current_row
        INNER JOIN current_row as next_row
            ON next_row.row_num = current_row.row_num + 1
WHERE
    next_row.value_of_interest - current_row.value_of_interest > 1;
| sequence_ended | sequence_began |
|----------------|----------------|
| 3              | 6              |
| 7              | 9              |
| 9              | 11             |
| 12             | 20             |
| 20             | 25             |

To derive the gap start and end points, we add 1 to the sequence end and subtract 1 from the sequence start.

/* Data setup */

;
WITH current_row AS
(
SELECT
    num_sequence.value_of_interest
    ,ROW_NUMBER() 
        OVER(ORDER BY num_sequence.value_of_interest) as row_num
FROM
    @sequences as num_sequence
)
SELECT
    current_row.value_of_interest + 1 as gap_start    ,next_row.value_of_interest - 1 as gap_endFROM
    current_row
        INNER JOIN current_row as next_row
            ON next_row.row_num = current_row.row_num + 1
WHERE
    next_row.value_of_interest - current_row.value_of_interest > 1
;
| gap_start | gap_end |
|-----------|---------|
| 4         | 5       |
| 8         | 8       |
| 10        | 10      |
| 13        | 19      |
| 21        | 24      |

Using Subqueries

To see how this technique works, we'll explore its parts and put them together for a final result set.

The first thing we'll need to do is identify where a sequence ended. To do this, we use a subquery inside the SELECT. The subquery result is the next row's value if it is the next in the sequence; otherwise, NULL.

/* Data setup */

SELECT
    sequences_main.value_of_interest
    ,(        SELECT            sub.value_of_interest        FROM            @sequences as sub        WHERE            sub.value_of_interest = sequences_main.value_of_interest + 1    ) as sequence_end_indFROM
    @sequences as sequences_main

Rows with NULL indicate a sequence ended.

| value_of_interest | sequence_end_ind |
|-------------------|------------------|
| 1                 | 2                |
| 2                 | 3                |
| 3                 | NULL             || 6                 | 7                |
| 7                 | NULL             || 9                 | NULL             || 11                | 12               |
| 12                | NULL             || 20                | NULL             || 25                | NULL             |

Next, we need to determine where the next sequence started (or the end of our gap). We do this with another subquery that returns the first value after the current row.

/* Data setup */

SELECT
    sequences_main.value_of_interest
    ,(
        SELECT
            sub.value_of_interest
        FROM
            @sequences as sub
        WHERE
            sub.value_of_interest = sequences_main.value_of_interest + 1
    ) as sequence_end_ind
    ,(        SELECT            MIN(sub.value_of_interest)        FROM            @sequences as sub        WHERE            sub.value_of_interest > sequences_main.value_of_interest    ) as sequence_start_indFROM
    @sequences as sequences_main
| value_of_interest | sequence_ends | sequence_starts_again |
|-------------------|---------------|-----------------------|
| 1                 | 2             | 2                     |
| 2                 | 3             | 3                     |
| 3                 | NULL          | 6                     |
| 6                 | 7             | 7                     |
| 7                 | NULL          | 9                     |
| 9                 | NULL          | 11                    |
| 11                | 12            | 12                    |
| 12                | NULL          | 20                    |
| 20                | NULL          | 25                    |
| 25                | NULL          | NULL                  |

Now we need to filter for the rows that are NULL in the sequence ends column.

We do this by moving our first subquery to the WHERE clause and combine it with NOT EXISTS. Here we are saying we want the rows where the next row value is not part of a sequence.

/* Data Setup */

SELECT
    sequences_main.value_of_interest
    ,(
        SELECT
            MIN(sub.value_of_interest)
        FROM
            @sequences as sub
        WHERE
            sub.value_of_interest > sequences_main.value_of_interest
    ) as sequence_starts_again
FROM
    @sequences as sequences_main
WHERE NOT EXISTS    (        SELECT            sub.value_of_interest        FROM            @sequences as sub        WHERE            sub.value_of_interest = sequences_main.value_of_interest + 1    )
| sequence_end_ind | sequence_start_ind |
|------------------|-----------------------|
| 3                | 6                     |
| 7                | 9                     |
| 9                | 11                    |
| 12               | 20                    |
| 20               | 25                    |
| 25               | NULL                  |

There's just one problem with this result set. The last row gets included. We fix this by adding another WHERE condition that filters for values less than the maximum value.

/* Data Setup */

SELECT
    sequences_main.value_of_interest
    ,(
        SELECT
            MIN(sub.value_of_interest)
        FROM
            @sequences as sub
        WHERE
            sub.value_of_interest > sequences_main.value_of_interest
    ) as sequence_starts_again
FROM
    @sequences as sequences_main
WHERE NOT EXISTS
    (
        SELECT
            num_sequence_where.value_of_interest
        FROM
            @sequences as num_sequence_where
        WHERE
                num_sequence_where.value_of_interest = sequences_main.value_of_interest + 1
        )
    AND sequences_main.value_of_interest < (SELECT MAX(value_of_interest) FROM @sequences)
| value_of_interest | sequence_starts_again |
|-------------------|-----------------------|
| 3                 | 6                     |
| 7                 | 9                     |
| 9                 | 11                    |
| 12                | 20                    |
| 20                | 25                    |

Great, now we have a result set of rows where a sequence ended, and another began - or where a gap started and ended.

All that's left to is add 1 to the gap start and subtract 1 from the gap end.

/* Data Setup */

SELECT
    sequences_main.value_of_interest + 1 as gap_starts    ,(
        SELECT
            MIN(sub.value_of_interest)
        FROM
            @sequences as sub
        WHERE
            sub.value_of_interest > sequences_main.value_of_interest
    ) - 1 as gap_endsFROM
    @sequences as sequences_main
WHERE NOT EXISTS
    (
        SELECT
            num_sequence_where.value_of_interest
        FROM
            @sequences as num_sequence_where
        WHERE
                num_sequence_where.value_of_interest = sequences_main.value_of_interest + 1
        )
    AND sequences_main.value_of_interest < (SELECT MAX(value_of_interest) FROM @sequences)
| gap_starts | gap_ends |
|------------|----------|
| 4          | 5        |
| 8          | 8        |
| 10         | 10       |
| 13         | 19       |
| 21         | 24       |

Gaps in DATE or DATETIME sequences

What if the sequences are DATE or DATETIME? The solutions are the same but use the DATEDIFF and DATEADD functions.

LEAD approach

DECLARE @date_sequences TABLE (
    date_of_interest DATE
)
INSERT INTO @date_sequences
VALUES
('2021-01-01'),('2021-01-02'),('2021-01-03'),('2021-01-06')
,('2021-01-07'),('2021-01-09'),('2021-01-11'),('2021-01-12')
,('2021-01-20'),('2021-01-25')


SELECT
    DATEADD(DAY,1, date_of_interest) as gap_starts    ,DATEADD(DAY,-1, next_row_value) as gap_endsFROM
(
    SELECT
        date_of_interest
        ,LEAD(date_of_interest,1, date_of_interest) 
            OVER(ORDER BY date_of_interest) as next_row_value
    FROM
        @date_sequences
) as curr_and_next
WHERE
    DATEDIFF(DAY,date_of_interest, next_row_value) > 1

ROW_NUMBER approach

DECLARE @dtm_sequences TABLE (
    dtm_of_interest DATETIME
)
INSERT INTO @dtm_sequences
VALUES
('2021-02-23 00:00:01.000'),('2021-02-23 00:00:02.000'),('2021-02-23 00:00:03.000'),
('2021-02-23 00:00:06.000'),('2021-02-23 00:00:07.000'),('2021-02-23 00:00:09.000'),
('2021-02-23 00:00:11.000'),('2021-02-23 00:00:12.000'),('2021-02-23 00:00:20.000'),
('2021-02-23 00:00:25.000')

;
WITH current_row AS
(
SELECT
    dtm_of_interest
    ,ROW_NUMBER() 
        OVER(ORDER BY dtm_of_interest) as row_num
FROM
    @dtm_sequences
)
SELECT
    DATEADD(SECOND, 1, current_row.dtm_of_interest) as gap_start    ,DATEADD(SECOND, -1, next_row.dtm_of_interest) as gap_endFROM
    current_row
        INNER JOIN current_row as next_row
            ON next_row.row_num = current_row.row_num + 1
WHERE
    DATEDIFF(SECOND, current_row.dtm_of_interest, next_row.dtm_of_interest) > 1;

Subquery approach

DECLARE @date_sequences TABLE (
    date_of_interest DATE
)
INSERT INTO @date_sequences
VALUES
('2021-01-01'),('2021-01-02'),('2021-01-03'),('2021-01-06')
,('2021-01-07'),('2021-01-09'),('2021-01-11'),('2021-01-12')
,('2021-01-20'),('2021-01-25')


SELECT
    DATEADD(DAY,1,sequences_main.date_of_interest) as gap_starts    ,DATEADD(DAY,-1,         (
        SELECT
            MIN(sub.date_of_interest)
        FROM
            @date_sequences as sub
        WHERE
            sub.date_of_interest > sequences_main.date_of_interest
        )
    ) as gap_ends
FROM
    @date_sequences as sequences_main
WHERE NOT EXISTS
    (
    SELECT
        date_sequence_where.date_of_interest
    FROM
        @date_sequences as date_sequence_where
    WHERE
        date_sequence_where.date_of_interest = DATEADD(DAY, 1, sequences_main.date_of_interest)    )
    AND sequences_main.date_of_interest < (SELECT MAX(date_of_interest) FROM @date_sequences)

Conclusion

In this post, we looked at solutions to the Gaps part of the Gaps and Islands problem. The key to identifying Gaps is working out where a sequence ends, and another begins.

One aspect we didn't examine is the performance of each solution. I opted to avoid this to focus on ways to solve the problem.

If you're interested in the performance aspects, I highly recommend the book SQL Server MVP Deep Dives. Chapter 5, Gaps and Islands, by Itzik Ben-Gan, explores the Gaps and Islands solutions in substantial detail. Two solutions in this post are adapted from this chapter.

In the next post we'll look at the opposite problem, identifying consecutive values or Islands.

Further Reading

© 2024 Andrew Villazon. All rights reserved.