2016-08-02

Snowflake data warehouse date range based joins 45x speedup

I have been using http://www.snowflake.net for a new data processing at work for a few months, and it’s just amazing, to be able to run large queries over large data-sets, and the ability to increase the cluster size, when doing development work to get faster turnarounds, that are not impacting the production cluster, brilliant.

One of the thing I have noticed is slower than I would like is joins based on tableA.time being inside a time-range of tableB.start and tableB.end when the time period being is in the months not days.

So the pattern mapping a value from TABLE_I onto all rows in the time span (not including the end)

CREATE OR REPLACE TEMPORARY TABLE WORKING_B AS
SELECT tp.u_id, tp.time, i.value
FROM TABLE_P tp
JOIN TABLE_I i ON tp.u_id = i.u_id
AND tp.time >= i.start_time AND tp.time < i.end_time;

For one set of data spanning 3 months the above takes 45 minutes on a small cluster for TABLE_P 65M/TABLE_I 10M rows.
Where-as for a similar set of 4 days, and ~45M rows this takes 30 seconds.

So I add some TO_DATE(time), TO_DATE(start_time) columns to the two tables, and then added AND tp.time_day = i.start_time_day and the first query went to ~60 seconds. But I was missing a few million rows as my time ranges span multiple days…

So I did many things that didn’t work (like trying to use a GENERATOR with dynamic input) and settled on a simple solution

CREATE OR REPLACE TABLE TEN_YEARS_OF_DAYS(date) AS
SELECT DATEADD(day, (rn-1),DateADD(months,4,DATEADD(years,-10,CURRENT_DATE))) FROM (
SELECT row_number() over(order by 1) as rn
FROM TABLE(GENERATOR(rowCount => 365*10)) v);

CREATE OR REPLACE FUNCTION get_dates_for_N_days ( start_date DATE, end_date DATE )
RETURNS TABLE (date DATE)
AS 'SELECT date FROM TEN_YEARS_OF_DAYS WHERE date BETWEEN start_date AND end_date';

so this creates a table with ten years of data (moved 4 months into the future) and a table function that selects the rows from it, so I can do a lateral join on that function

CREATE OR REPLACE TEMPORARY TABLE TABLE_I_B AS
SELECT e.*, t.date as range_day_part
FROM TABLE_I e, LATERAL get_dates_for_N_days(TO_DATE(e.start_time), TO_DATE(e.end_time)) t;

So the above code creates another temp table with a row per table B record with every Day is a duplicate row, now we have more rows in the seconds table, but we can do a date based match to speedup the query.

CREATE OR REPLACE TEMPORARY TABLE WORKING_B_B AS
SELECT tp.u_id, tp.time, i.value
FROM TABLE_P_B tp
JOIN TABLE_I_B i
ON tp.u_id = i.u_id AND tp.time_day = i.range_day_part
AND tp.time >= i.start_time AND tp.time < i.end_time;

This code runs in 60 seconds and gives the same results as the 45 minute code.

Things to note, putting LATERAL table joins on a selects with CTE’s presently breaks the SQL parser, in fact even nested selects and LATERAL don’t mix, thus the extra tables with _B etc. Also CTE’s make life so much easier, but as you start joining to them a lot, performance slips, I have found where I do a complex join is a good time to output to a temporary table, and the performance again is crazy…

Snowflake is just a beast. I have a man crush…