Archive for the ‘SQL’ Category

Slow sub-select maybe LAG or LEAD might help

Thursday, August 28th, 2008

One of the queries in our application is used to edit event entries.  Event entries have a start time and an end time, but when they are processed from the data files, they can be entered with just a start time (ie when the event started), but in cases of problems (crashes) these entries will not have matching entries with end time.

So in our editor, we are wanting to know the upper limit of when the event could have ended, which is the next start time.

ROWID TIME_START TIME_END OTHER_FIELDS
1 2008-1-1 null data
2 2008-1-2 2008-1-3 more data

So for row 1 we want 2008-1-2 as the suggested end time.

Originally we had a query that looked like this:

SELECT other_fields,
  TIME_START,
  TIME_END,
  NVL((SELECT MIN(TIME_START)          
         FROM some_view
         WHERE MACHINE_ID = T1.MACHINE_ID
           AND TIME_START > T1.TIME_START),
       TIME_START + 1/24) "NEXT_TIME_START"
  FROM some_view T1
  WHERE MACHINE_ID = machine_id
    AND NVL(TIME_END,
            NVL((SELECT MIN(TIME_START)
                   FROM some_view WHERE MACHINE_ID = T1.MACHINE_ID
                   AND TIME_START > T1.TIME_START),
                TIME_START + 1/24)) > start_time_range
    AND TIME_START < end_time_range
  ORDER BY TIME_START

So this is a little trickier than described above as it deals with the edge case of when there is no next entry.  In that case the current start time plus 1 hour is used. I coloured all the optional/variables yellow.

This performs fine with small data sets, but when we had over a million rows in that view, the selection of one days data (~2700 rows) took 551 seconds (>9 minutes), and this was one of many large views, so the overall effect was a >30 minute wait.

I rewrote the query with the use of LEAD, and the performance went to 4 seconds (with data retrieval) or 0.6 seconds for a count(*) operation.  Here’s that query now using LEAD.

SELECT *
    FROM(SELECT other_fields,
           TIME_START,
           TIME_END,
           NVL( LEAD(TIME_START) OVER (PARTITION BY machine_id ORDER BY time_start),
                TIME_START + 1/24) AS "NEXT_TIME_START",
           FROM some_view T1
           WHERE MACHINE_ID = machine_id )
    WHERE NEXT_TIME_START > start_time_range
      AND TIME_START < end_time_range
    ORDER BY TIME_START

Now that I review the code I notice I also rearranged the code to not repeat the next_time_start calculation, so I am now not sure all the performance improvements can be attributed to LEAD, but I’ll take that ~138 times improvement either way.

Oracle’s Lag in MS SQL Server 2005

Monday, August 11th, 2008

I am currently porting our new database from Oracle 10g to MS SQL Server 2005, and I have it all done except the views that use the Oracle LAG and LEAD functions (non-ANSI).

What these functions provide (for the MS SQL camp) is the ability to get the next or previous rows when sorted. In my case I have a value that is the ‘volume change since the start’ at time intervals, and I want the relative change between each interval.

So PL/SQL of the view is:

SELECT t.*,
    t.volume - LAG(volume) OVER (PARTITION BY group_number 
                                 ORDER BY timestamp) AS volume_change
FROM volume_table t;

The partition clause splits the data into different buckets, then each bucket is sorted, with all results returned.

Asking on the NZ .Net User Group mailing list I got a pointer to this MS feedback page, but the solution presented there gives me an error “Incorrect syntax near ‘ROWS’.” when I run this query against SQL 2K5

SELECT MIN(volume) OVER(PARTITION BY group_number 
                        ORDER BY timestamp
                        ROWS BETWEEN 1 PRECEDING 
                        AND 1 PRECEDING) change
FROM volume_table;
GO

I had a side point showing why I wanted to avoid sub-select, as the performance of a different query had an orders of magnitude improvement from changing to using a LAG function, yet that same sub-select query runs just as fast as the “improved” Oracle statement in MS SQL Server, so I’ll just stick to the main topic, and post about that another day…

Chris recently showed how to use Common Table Expressions (CTE) (sort of auto-magic temp table) to find the first entries for a day, which is very close to what I was want, but the filtering is hard coded.  I could not see how to make it dynamic, so I used the idea, and started massaging the concept, till I finally got what I wanted.

Conceptually the Oracle solution could be done using cursors under the hood to provide the rolling previous (LAG) rows, where-as here I’m doing many look-ups but the table is not getting re-created as in the nested select method.

So my code is as follows:

WITH Rows( vol_diff, time, rn, gn ) AS
(
    SELECT v.volume,
        v.timestamp,
        Row_Number() OVER (PARTITION BY group_number
                           ORDER BY timestamp),
        group_number
    FROM volume_table v
),
PrevRows( timestamp, prev_vol, group_number) AS
(
    SELECT a.time, b.vol_diff, a.gn
    FROM Rows a
    LEFT JOIN Rows b 
        ON a.rn = b.rn + 1 
        AND a.gn = b.gn
)
SELECT v.*, v.volume - p.prev_vol as volume_change
FROM volume_table v
LEFT JOIN PrevRows p
    ON v.timestamp = p.timestamp
    AND v.group_number = p.group_number;
GO

So I use two CTE tables, one to partition and sort the data, the second to do a lag based join, then I can select the lagged based data, by matching the time and group to the current entry.It works a treat, and I will do some performance testing tomorrow once my production data has finished loading into my db.

After the results of the not discussed query I expect that the sub-select will be just as performant.

SELECT * FROM USER_ERRORS

Monday, December 3rd, 2007

The most important Oracle pl/sql right now is:

SELECT * FROM USER_ERRORS

I’ve been “developing” some new functions, and I was getting sick and tired of being told:

Warning: execution completed with warning
function MissingProd Compiled.

Now I can know what’s wrong, and fix it, before reloading my complete scheme.   Yay!

Oracle SQL Developer

Monday, August 6th, 2007

Having been developing my PL/SQL in Toad Data Modeler, and running via a customer VB program, the debugging effort has been poor at best. A co-worker pointed me to Oracle SQL Developer, it’s in the league as MS SQL Server Management Studio, but for Oracle!

The other point I’ve been tripping on, in PL/SQL to use a date (via SQL*Plus) to_date(’2007-08-06′,’YYYY-MM-DD’) is the trick.

PL/SQL cryptic error messages

Thursday, May 17th, 2007

I have just found (oh thank you Google and the Oracle forums) the solution to my PL/SQL error

CA_ADD_DESIGN_1 (28,5): PL/SQL: ORA-01747: invalid user.table.column, table.column, or column specification

My first problem was which line does the 28 refer to, as I’m editing one large PL/SQL file, it appears to count from the line before the

CREATE OR REPLACE
PROCEDURE ca_add_design_1(

But the real problem was what did the error mean.

The third Google result had a Oracle forum post, so I should have gone there first, but I didn’t. The first two sites are just AdSense whore sites, in fact 6 of the first 10 results are AdSense whores. It makes me sad that these sites are there, as the information contain was worthless, and been top Google results means they will get a lot of hits. GRRR and for bonus points these results were from google.co.nz, using google.com the useful answer is in 5th place.

Anyway the Oracle forum answer said “you have an extra comma after your lass SET clause before the WHERE” and low and behold I did.

    UPDATE  ca_p_design
    SET     design_type_id=NVL(pDesignType,design_type_id),
            session_id=NVL(SessionID,session_id),
            time_end=NVL(pTimeEnd,time_end),
    WHERE   design_id=ID;

So now my Update looks like:

    UPDATE  ca_p_design
    SET     design_type_id=NVL(pDesignType,design_type_id),
            session_id=NVL(SessionID,session_id),
            time_end=NVL(pTimeEnd,time_end)
    WHERE   design_id=ID;

and it works! I had spent ages triple checking each of my column names, and it turned out to be the column I “didn’t mention” that was upsetting it.