2008-08-28

Slow sub-select maybe LAG or LEAD might help

One of the queries in our application is used to edit event entries. Event entries have a start time and an end time, but when they are processed from the data files, they can be entered with just a start time (ie when the event started), but in cases of problems (crashes) these entries will not have matching entries with end time.

So in our editor, we are wanting to know the upper limit of when the event could have ended, which is the next start time.

ROWID    TIME_START   TIME_END  OTHER_FIELDS
1        2008-1-1     null      data
2        2008-1-2     2008-1-3  more data

So for row 1 we want 2008-1-2 as the suggested end time.

Originally we had a query that looked like this:

SELECT other_fields,
  TIME_START,
  TIME_END,
  NVL((
      SELECT MIN(TIME_START)
      FROM some_view
      WHERE MACHINE_ID = T1.MACHINE_ID
        AND TIME_START > T1.TIME_START
    ), TIME_START + 1/24) "NEXT_TIME_START"
  FROM some_view T1
  WHERE MACHINE_ID = machine_id
    AND NVL(TIME_END,
            NVL((
                  SELECT MIN(TIME_START)
                  FROM some_view WHERE MACHINE_ID = T1.MACHINE_ID
                  AND TIME_START > T1.TIME_START),
                TIME_START + 1/24)) > start_time_range
    AND TIME_START < end_time_range ORDER BY TIME_START

So this is a little trickier than described above as it deals with the edge case of when there is no next entry. In that case the current start time plus 1 hour is used.

This performs fine with small data sets, but when we had over a million rows in that view, the selection of one days data (~2700 rows) took 551 seconds (>9 minutes), and this was one of many large views, so the overall effect was a >30 minute wait.

I rewrote the query with the use of LEAD, and the performance went to 4 seconds (with data retrieval) or 0.6 seconds for a count(*) operation. Here’s that query now using LEAD.

SELECT *
FROM(SELECT other_fields,
       TIME_START,
       TIME_END,
       NVL( LEAD(TIME_START) OVER (PARTITION BY machine_id ORDER BY time_start),
            TIME_START + 1/24) AS "NEXT_TIME_START",
       FROM some_view T1
       WHERE MACHINE_ID = machine_id )
WHERE NEXT_TIME_START > start_time_range
  AND TIME_START < end_time_range
ORDER BY TIME_START

Now that I review the code I notice I also rearranged the code to not repeat the next_time_start calculation, so I am now not sure all the performance improvements can be attributed to LEAD, but I’ll take that ~138 times improvement either way.