snowflake data warehouse date range based joins 45x speedup

I have been using http://www.snowflake.net for a new data processing at work for a few months, and it’s just amazing, to be able to run large queries over large data-sets, and the ability to increase the cluster size, when doing development work to get faster turnarounds, that are not impacting the production cluster, brilliant.

One of the thing I have noticed is slower than I would like is joins based on tableA.time being inside a time-range of tableB.start and tableB.end when the time period being is in the months not days.

So the pattern mapping a value from TABLE_I onto all rows in the time span (not including the end)

CREATE OR REPLACE TEMPORARY TABLE WORKING_B AS
SELECT tp.u_id, tp.time, i.value
FROM TABLE_P tp
JOIN TABLE_I i ON tp.u_id = i.u_id
AND tp.time >= i.start_time AND tp.time < i.end_time;

For one set of data spanning 3 months the above takes 45 minutes on a small cluster for TABLE_P 65M/TABLE_I 10M rows.
Where-as for a similar set of 4 days, and ~45M rows this takes 30 seconds.

So I add some TO_DATE(time), TO_DATE(start_time) columns to the two tables, and then added AND tp.time_day = i.start_time_day and the first query went to ~60 seconds. But I was missing a few million rows as my time ranges span multiple days…

So I did many things that didn’t work (like trying to use a GENERATOR with dynamic input) and settled on a simple solution

CREATE OR REPLACE TABLE TEN_YEARS_OF_DAYS(date) AS
SELECT DATEADD(day, (rn-1),DateADD(months,4,DATEADD(years,-10,CURRENT_DATE))) FROM (
SELECT row_number() over(order by 1) as rn
FROM TABLE(GENERATOR(rowCount => 365*10)) v);

CREATE OR REPLACE FUNCTION get_dates_for_N_days ( start_date DATE, end_date DATE )
RETURNS TABLE (date DATE)
AS 'SELECT date FROM TEN_YEARS_OF_DAYS WHERE date BETWEEN start_date AND end_date';

so this creates a table with ten years of data (moved 4 months into the future) and a table function that selects the rows from it, so I can do a lateral join on that function

CREATE OR REPLACE TEMPORARY TABLE TABLE_I_B AS
SELECT e.*, t.date as range_day_part
FROM TABLE_I e, LATERAL get_dates_for_N_days(TO_DATE(e.start_time), TO_DATE(e.end_time)) t;

So the above code creates another temp table with a row per table B record with every Day is a duplicate row, now we have more rows in the seconds table, but we can do a date based match to speedup the query.

CREATE OR REPLACE TEMPORARY TABLE WORKING_B_B AS
SELECT tp.u_id, tp.time, i.value
FROM TABLE_P_B tp
JOIN TABLE_I_B i
ON tp.u_id = i.u_id AND tp.time_day = i.range_day_part
AND tp.time >= i.start_time AND tp.time < i.end_time;

This code runs in 60 seconds and gives the same results as the 45 minute code.

Things to note, putting LATERAL table joins on a selects with CTE’s presently breaks the SQL parser, in fact even nested selects and LATERAL don’t mix, thus the extra tables with _B etc. Also CTE’s make life so much easier, but as you start joining to them a lot, performance slips, I have found where I do a complex join is a good time to output to a temporary table, and the performance again is crazy…

Snowflake is just a beast. I have a man crush…

When removing columns rebuild your views

If remove a column from a table in your Oracle DB, and you have a views that select * from tablename, you need to recreate the view.

I was comparing my old schema and new schema, and had all the correct column changes, and index changes, but was still having some views marked invalid and therefore failing to run.

It was noticed by a co-worker (via the DDL) that the view was still referring to the dropped column.

I had assumed that the code received back from the DB was verbose (every selected item named) because the tools were lazy, but this makes sense once you think about how select * may get compiled by the DB, but can catch you out when just comparing SQL text changes.

Slow sub-select maybe LAG or LEAD might help

One of the queries in our application is used to edit event entries.  Event entries have a start time and an end time, but when they are processed from the data files, they can be entered with just a start time (ie when the event started), but in cases of problems (crashes) these entries will not have matching entries with end time.

So in our editor, we are wanting to know the upper limit of when the event could have ended, which is the next start time.

ROWID TIME_START TIME_END OTHER_FIELDS
1 2008-1-1 null data
2 2008-1-2 2008-1-3 more data

So for row 1 we want 2008-1-2 as the suggested end time.

Originally we had a query that looked like this:

SELECT other_fields,
  TIME_START,
  TIME_END,
  NVL((SELECT MIN(TIME_START)           FROM some_view
         WHERE MACHINE_ID = T1.MACHINE_ID
           AND TIME_START > T1.TIME_START),
       TIME_START + 1/24) "NEXT_TIME_START"
  FROM some_view T1
  WHERE MACHINE_ID = machine_id
    AND NVL(TIME_END,
            NVL((SELECT MIN(TIME_START)
                   FROM some_view WHERE MACHINE_ID = T1.MACHINE_ID
                   AND TIME_START > T1.TIME_START),
                TIME_START + 1/24)) > start_time_range
    AND TIME_START < end_time_range ORDER BY TIME_START

 

So this is a little trickier than described above as it deals with the edge case of when there is no next entry.  In that case the current start time plus 1 hour is used. I coloured all the optional/variables yellow.

This performs fine with small data sets, but when we had over a million rows in that view, the selection of one days data (~2700 rows) took 551 seconds (>9 minutes), and this was one of many large views, so the overall effect was a >30 minute wait.

I rewrote the query with the use of LEAD, and the performance went to 4 seconds (with data retrieval) or 0.6 seconds for a count(*) operation.  Here’s that query now using LEAD.

SELECT *
    FROM(SELECT other_fields,
           TIME_START,
           TIME_END,
           NVL( LEAD(TIME_START) OVER (PARTITION BY machine_id ORDER BY time_start),
                TIME_START + 1/24) AS "NEXT_TIME_START",
           FROM some_view T1
           WHERE MACHINE_ID = machine_id )
    WHERE NEXT_TIME_START > start_time_range
      AND TIME_START < end_time_range
    ORDER BY TIME_START

 

Now that I review the code I notice I also rearranged the code to not repeat the next_time_start calculation, so I am now not sure all the performance improvements can be attributed to LEAD, but I’ll take that ~138 times improvement either way.

Oracle’s Lag in MS SQL Server 2005

I am currently porting our new database from Oracle 10g to MS SQL Server 2005, and I have it all done except the views that use the Oracle LAG and LEAD functions (non-ANSI).

What these functions provide (for the MS SQL camp) is the ability to get the next or previous rows when sorted. In my case I have a value that is the ‘volume change since the start’ at time intervals, and I want the relative change between each interval.

So PL/SQL of the view is:

SELECT t.*,
    t.volume - LAG(volume) OVER (PARTITION BY group_number  ORDER BY timestamp) AS volume_change
FROM volume_table t;

The partition clause splits the data into different buckets, then each bucket is sorted, with all results returned.

Asking on the NZ .Net User Group mailing list I got a pointer to this MS feedback page, but the solution presented there gives me an error “Incorrect syntax near ‘ROWS’.” when I run this query against SQL 2K5

SELECT MIN(volume) OVER(PARTITION BY group_number  ORDER BY timestamp
                        ROWS BETWEEN 1 PRECEDING  AND 1 PRECEDING) change
FROM volume_table; GO

I had a side point showing why I wanted to avoid sub-select, as the performance of a different query had an orders of magnitude improvement from changing to using a LAG function, yet that same sub-select query runs just as fast as the “improved” Oracle statement in MS SQL Server, so I’ll just stick to the main topic, and post about that another day…

Chris recently showed how to use Common Table Expressions (CTE) (sort of auto-magic temp table) to find the first entries for a day, which is very close to what I was want, but the filtering is hard coded.  I could not see how to make it dynamic, so I used the idea, and started massaging the concept, till I finally got what I wanted.

Conceptually the Oracle solution could be done using cursors under the hood to provide the rolling previous (LAG) rows, where-as here I’m doing many look-ups but the table is not getting re-created as in the nested select method.

So my code is as follows:

WITH Rows( vol_diff, time, rn, gn ) AS (     SELECT v.volume,
        v.timestamp,
        Row_Number() OVER (PARTITION BY group_number
                           ORDER BY timestamp),
        group_number
    FROM volume_table v
), PrevRows( timestamp, prev_vol, group_number) AS (     SELECT a.time, b.vol_diff, a.gn
    FROM Rows a
    LEFT JOIN Rows b  ON a.rn = b.rn + 1  AND a.gn = b.gn
) SELECT v.*, v.volume - p.prev_vol as volume_change
FROM volume_table v
LEFT JOIN PrevRows p
    ON v.timestamp = p.timestamp
    AND v.group_number = p.group_number; GO

So I use two CTE tables, one to partition and sort the data, the second to do a lag based join, then I can select the lagged based data, by matching the time and group to the current entry.It works a treat, and I will do some performance testing tomorrow once my production data has finished loading into my db.

After the results of the not discussed query I expect that the sub-select will be just as performant.

SELECT * FROM USER_ERRORS

The most important Oracle pl/sql right now is:

SELECT * FROM USER_ERRORS

I’ve been “developing” some new functions, and I was getting sick and tired of being told:

Warning: execution completed with warning
function MissingProd Compiled.

Now I can know what’s wrong, and fix it, before reloading my complete scheme.   Yay!