I have been using http://www.snowflake.net for a new data processing at work for a few months, and it’s just amazing, to be able to run large queries over large data-sets, and the ability to increase the cluster size, when doing development work to get faster turnarounds, that are not impacting the production cluster, brilliant.
One of the thing I have noticed is slower than I would like is joins based on
tableA.time being inside a time-range of
tableB.start and tableB.end when the time period being is in the months not days.
So the pattern mapping a value from TABLE_I onto all rows in the time span (not including the end)
CREATE OR REPLACE TEMPORARY TABLE WORKING_B AS SELECT tp.u_id, tp.time, i.value FROM TABLE_P tp JOIN TABLE_I i ON tp.u_id = i.u_id AND tp.time >= i.start_time AND tp.time < i.end_time;
For one set of data spanning 3 months the above takes 45 minutes on a small cluster for TABLE_P 65M/TABLE_I 10M rows.
Where-as for a similar set of 4 days, and ~45M rows this takes 30 seconds.
So I add some TO_DATE(time), TO_DATE(start_time) columns to the two tables, and then added
AND tp.time_day = i.start_time_day and the first query went to ~60 seconds. But I was missing a few million rows as my time ranges span multiple days…
So I did many things that didn’t work (like trying to use a GENERATOR with dynamic input) and settled on a simple solution
CREATE OR REPLACE TABLE TEN_YEARS_OF_DAYS(date) AS SELECT DATEADD(day, (rn-1),DateADD(months,4,DATEADD(years,-10,CURRENT_DATE))) FROM ( SELECT row_number() over(order by 1) as rn FROM TABLE(GENERATOR(rowCount => 365*10)) v); CREATE OR REPLACE FUNCTION get_dates_for_N_days ( start_date DATE, end_date DATE ) RETURNS TABLE (date DATE) AS 'SELECT date FROM TEN_YEARS_OF_DAYS WHERE date BETWEEN start_date AND end_date';
so this creates a table with ten years of data (moved 4 months into the future) and a table function that selects the rows from it, so I can do a lateral join on that function
CREATE OR REPLACE TEMPORARY TABLE TABLE_I_B AS SELECT e.*, t.date as range_day_part FROM TABLE_I e, LATERAL get_dates_for_N_days(TO_DATE(e.start_time), TO_DATE(e.end_time)) t;
So the above code creates another temp table with a row per table B record with every Day is a duplicate row, now we have more rows in the seconds table, but we can do a date based match to speedup the query.
CREATE OR REPLACE TEMPORARY TABLE WORKING_B_B AS SELECT tp.u_id, tp.time, i.value FROM TABLE_P_B tp JOIN TABLE_I_B i ON tp.u_id = i.u_id AND tp.time_day = i.range_day_part AND tp.time >= i.start_time AND tp.time < i.end_time;
This code runs in 60 seconds and gives the same results as the 45 minute code.
Things to note, putting LATERAL table joins on a selects with CTE’s presently breaks the SQL parser, in fact even nested selects and LATERAL don’t mix, thus the extra tables with _B etc. Also CTE’s make life so much easier, but as you start joining to them a lot, performance slips, I have found where I do a complex join is a good time to output to a temporary table, and the performance again is crazy…
Snowflake is just a beast. I have a man crush…