2023-11-22

SQL overlapping time ranges

There was a nice stack overflow question today, on one of my favorite SQL themes of overlapping date ranges.

I was quite pleased with my “find distinct’s, join, and pick the winners per block” code:

with data (ITEM_ID, PRICE, START_DATE, END_DATE, LOADED_DATETIME) as (
    select * from values
    ('A', 1.00,	'2023-01-01'::date, '2023-01-31'::date, '2023-11-01'::date),
    ('A', 2.00,	'2023-01-10'::date, '2023-01-15'::date, '2023-11-02'::date),
    ('A', 3.00,	'2023-01-14'::date, '2023-01-17'::date, '2023-11-03'::date),
    ('A', 4.00,	'2023-01-15'::date, '2023-01-18'::date, '2023-11-04'::date),
    ('B', 4.00,	'2023-01-10'::date, '2023-01-31'::date, '2023-11-01'::date)
), dist_dates as (
    select distinct item_id, start_date from data
    union 
    select distinct item_id, end_date from data 
), id_date_ranges as (
    select 
        item_id, 
        start_date, 
        lead(start_date) over (partition by item_id 
            order by start_date) as end_date
    from dist_dates
    qualify end_date is not null
)
select 
    dr.*,
    d.price,
    d.loaded_datetime
from id_date_ranges as dr
join data as d
    on d.item_id = dr.item_id 
        and dr.start_date < d.end_date
        and d.start_date < dr.end_date
qualify row_number() over (partition by dr.item_id, dr.start_date 
    order by d.loaded_datetime desc) = 1
order by 1,2;

But if you use highly overlapping data like the above, you get results like:

This could be worked around with a blind merge, but that might merge rows that appear the same but where not part of the same event/streak to be merged. The workaround is to allocate a sequence value to each row, before chopping, and then after the join, only merges rows of the same source id.

so using SQE8() and CONDITIONAL_CHANGE_EVENT we get:

with data (ITEM_ID, PRICE, START_DATE, END_DATE, LOADED_DATETIME) as (
    select * from values
    ('A', 1.00,	'2023-01-01'::date, '2023-01-31'::date, '2023-11-01'::date),
    ('A', 2.00,	'2023-01-10'::date, '2023-01-15'::date, '2023-11-02'::date),
    ('A', 3.00,	'2023-01-14'::date, '2023-01-17'::date, '2023-11-03'::date),
    ('A', 4.00,	'2023-01-15'::date, '2023-01-18'::date, '2023-11-04'::date),
    ('B', 4.00,	'2023-01-10'::date, '2023-01-31'::date, '2023-11-01'::date)
), sdata as (
    select 
        *, 
        seq8() as seq_id 
    from data
), dist_dates as (
    select distinct item_id, start_date from data
    union 
    select distinct item_id, end_date from data 
), id_date_ranges as (
    select 
        item_id, 
        start_date, 
        lead(start_date) over (partition by item_id 
            order by start_date) as end_date
    from dist_dates
    qualify end_date is not null
), over_sliced as (
    select 
        dr.*,
        d.price,
        d.loaded_datetime
        ,d.seq_id
    from id_date_ranges as dr
    join sdata as d
        on d.item_id = dr.item_id 
            and dr.start_date < d.end_date
            and d.start_date < dr.end_date
    qualify row_number() over (partition by dr.item_id, dr.start_date 
        order by d.loaded_datetime desc) = 1
)
select
    *
    ,conditional_change_event(seq_id) over (partition by item_id 
        order by start_date) as cce
from over_sliced
order by 1,2;

Row 4 & 5 identified as from the same source event

So now we can min/max those dates, to get a single row back:

with data (ITEM_ID, PRICE, START_DATE, END_DATE, LOADED_DATETIME) as (
    select * from values
    ('A', 1.00,	'2023-01-01'::date, '2023-01-31'::date, '2023-11-01'::date),
    ('A', 2.00,	'2023-01-10'::date, '2023-01-15'::date, '2023-11-02'::date),
    ('A', 3.00,	'2023-01-14'::date, '2023-01-17'::date, '2023-11-03'::date),
    ('A', 4.00,	'2023-01-15'::date, '2023-01-18'::date, '2023-11-04'::date),
    ('B', 4.00,	'2023-01-10'::date, '2023-01-31'::date, '2023-11-01'::date)
), sdata as (
    select 
        *, 
        seq8() as seq_id 
    from data
), dist_dates as (
    select distinct item_id, start_date from data
    union 
    select distinct item_id, end_date from data 
), id_date_ranges as (
    select 
        item_id, 
        start_date, 
        lead(start_date) over (partition by item_id 
            order by start_date) as end_date
    from dist_dates
    qualify end_date is not null
), over_sliced as (
    select 
        dr.*,
        d.price,
        d.loaded_datetime
        ,d.seq_id
    from id_date_ranges as dr
    join sdata as d
        on d.item_id = dr.item_id 
            and dr.start_date < d.end_date
            and d.start_date < dr.end_date
    qualify row_number() over (partition by dr.item_id, dr.start_date 
        order by d.loaded_datetime desc) = 1
), slices_grouped as (
    select
        *
        ,conditional_change_event(seq_id) over (partition by item_id 
            order by start_date) as cce
    from over_sliced
)
select 
    item_id
    ,min(start_date) as start_date
    ,max(end_date) as end_date
    ,any_value(price) as price
    ,any_value(loaded_datetime) as loaded_datetime
from slices_grouped
group by item_id, cce
order by 1,2;

2023-11-20

nikonhacker.com update

This post is just to let people know, we know nikonhacker.com is down. Max has been hosting the site since 2011, and it was a wonderful resource.

But it was running on phpBB and over time it became a security problem, this is why for years you could not post new content, as the compatible version of php (5.x) was no longer safe to leave on the internet. The effort to get it all working felt like it was more than my interest in hacking camera’s.

Anyways, changes by Max’s hosting service provider now mean those hosts do not exist, and thus it was agreed due to the read-only nature that a redirect to archive.org would be an acceptable level of effort/value trade off.

https://web.archive.org/web/20230925033615/https://nikonhacker.com/

2023-10-31

Linearization of Values

In my old job at Telogis, we had the fun task of adding up vehicle fuel, and distance, to calculate things like fuel efficiency.

So we might have some data that looks like:

with data(vehicle_id, time_ts, dist_delta_km, fuel_delta_lt ) as (
    select 
        $1,
        try_to_timestamp_ntz($2, 'yyyy-mm-dd hh:mi:ss'),
        try_to_number($3, 10, 1),
        try_to_number($4, 10, 1)
    from values 
        (1, '2023-10-30 10:00:00', '10.0', '0.5'),
        (1, '2023-10-30 11:00:00', '5.0', '0.3')
)

and from that we would break it into days, and sum it up, like this sort-of:

select vehicle_id
    ,time_ts::date as day
    ,sum(dist_delta_km) as dist_km
    ,sum(fuel_delta_lt) as fuel_lt
    ,round(dist_km / fuel_lt, 2) as mpg_wink
from data
group by 1,2
order by 1,2;

VEHICLE_ID	DAY	DIST_KM	FUEL_LT	MPG_WINK*
1	2023-10-30	15	0.8	18.75

*Yes, that is not mpg, but I do not like the lt/km metric, and mpg is just the word I will use.

And sure enough, for a large fleet looking at a big picture of a week, this makes sense.

But if your data is unaligned, aka you get distance on most data rows, but only get fuel intermittently, over time you can get things like.

(1, '2023-10-29 20:00:00', '10.0', null),
(1, '2023-10-30 00:01:00', null, '5.0'),
(1, '2023-10-30 00:20:00', '25.0', null),
(1, '2023-10-30 00:21:00', null, '1.0')

So this data we have a 8pm distance, just after midnight we get a fuel reading, later some more distance and a smaller fuel change. Now this is a little odd, as the data is shown in delta (change) form

where the gold standard of data in my opinion is absolute (aka the odometer style) as if data is missing the change is still calculable. With this form there is an implicit “last fuel reading” that the after midnight value is “since”.

But all those nit-picking details aside, the core point here is that, “the fuel mostly belongs to the 29th not the 30th”. So how might we share the love?

We linearize it!

So if we mash a zero fuel reading into the delta data, to give us a “when is this data since”, like so:

(1, '2023-10-29 19:59:00', null, '0.0'),
(1, '2023-10-29 20:00:00', '10.0', null),
(1, '2023-10-30 00:01:00', null, '5.0'),
(1, '2023-10-30 00:20:00', '25.0', null),
(1, '2023-10-30 00:21:00', null, '1.0')

We currently get a division by zero error.. sigh, so lets make that safe with DIV0

,round(div0(dist_km, fuel_lt), 2) as mpg_wink

VEHICLE_ID	DAY	DIST_KM	FUEL_LT	MPG_WINK
1	2023-10-29	10	0	0
1	2023-10-30	25	6	4.17

So we can find the prior time of the fuel reading, and then slice that over days I will limit to 4 as here.

So here we are going to use NVL2 to conditionally do stuff if the current row (and the previous row’s fuel) are not null:

select
    vehicle_id
    ,time_ts
    ,fuel_delta_lt
    ,nvl2(fuel_delta_lt,
        lag(fuel_delta_lt) ignore nulls over (partition by vehicle_id order by time_ts) 
        ,null) as prior_fuel_delta
    ,nvl2(fuel_delta_lt,
        lag(nvl2(fuel_delta_lt, time_ts, null)) ignore nulls over (partition by vehicle_id order by time_ts) 
        ,null) as prior_fuel_time
from data
order by 1,2;

for the above data we get:

VEHICLE_ID	TIME_TS	FUEL_DELTA_LT	PRIOR_FUEL_DELTA	PRIOR_FUEL_TIME
1	2023-10-29 19:59:00.000	0	null	null
1	2023-10-29 20:00:00.000	null	null	null
1	2023-10-30 00:01:00.000	5	0	2023-10-29 19:59:00.000
1	2023-10-30 00:20:00.000	null	null	null
1	2023-10-30 00:21:00.000	1	5	2023-10-30 00:01:00.000

so we can see that for rows that “have fuel” we find the prior rows values of the fuel, and of the time for that same row. This also could be done with a OBJECT_CONSTRUCT, and then one LAG on that, and then a pull apart, and for more values, that can be more performant, as always test, and re-test later.

So we now have the prior time, so we can now slice that to days, and given we are going to deal with “overnight” splits, we might as well fully chunk this data into N-days chunks.

with data(vehicle_id, time_ts, dist_delta_km, fuel_delta_lt ) as (
    select 
        $1,
        try_to_timestamp_ntz($2, 'yyyy-mm-dd hh:mi:ss'),
        try_to_number($3, 10, 1),
        try_to_number($4, 10, 1)
    from values 
        (1, '2023-10-29 19:59:00', null, '0.0'),
        (1, '2023-10-29 20:00:00', '10.0', null),
        (1, '2023-10-30 00:01:00', null, '5.0'),
        (1, '2023-10-30 00:20:00', '25.0', null),
        (1, '2023-10-30 00:21:00', null, '1.0')
), prior_fuel as (
    select
        vehicle_id
        ,time_ts
        ,fuel_delta_lt
        ,nvl2(fuel_delta_lt,
            lag(fuel_delta_lt) ignore nulls over (partition by vehicle_id order by time_ts) 
            ,null) as prior_fuel_delta
        ,nvl2(fuel_delta_lt,
            lag(nvl2(fuel_delta_lt, time_ts, null)) ignore nulls over (partition by vehicle_id order by time_ts) 
            ,null) as prior_fuel_time
    from data
), max_fuel_range(rn) as (
    select * from values (0),(1),(2),(3)
)
select pf.*
    ,fr.*
    ,dateadd('day', fr.rn, date_trunc('day', pf.prior_fuel_time)) as p_days
from prior_fuel as pf
left join max_fuel_range as fr
    on prior_fuel_time is not null
qualify 
   (pf.prior_fuel_time is null OR
    p_days < pf.time_ts)
order by 1,2;

So I am not loving that OR but for hand waving “something like this” purposes, we now have our data with days splits available:

VEHICLE_ID	TIME_TS	FUEL_DELTA_LT	PRIOR_FUEL_DELTA	PRIOR_FUEL_TIME	RN	P_DAYS
1	2023-10-29 19:59:00.000	0	null	null	null	null
1	2023-10-29 20:00:00.000	null	null	null	null	null
1	2023-10-30 00:01:00.000	5	0	2023-10-29 19:59:00.000	0	2023-10-29 00:00:00.000
1	2023-10-30 00:01:00.000	5	0	2023-10-29 19:59:00.000	1	2023-10-30 00:00:00.000
1	2023-10-30 00:20:00.000	null	null	null	null	null
1	2023-10-30 00:21:00.000	1	5	2023-10-30 00:01:00.000	0	2023-10-30 00:00:00.000

So now we can LEAST/GREATEST the times to correctly time the span, and thus get the span duration, and the record duration, and thus linearize that fuel:

with data(vehicle_id, time_ts, dist_delta_km, fuel_delta_lt ) as (
    select 
        $1,
        try_to_timestamp_ntz($2, 'yyyy-mm-dd hh:mi:ss'),
        try_to_number($3, 10, 1),
        try_to_number($4, 10, 1)
    from values 
        (1, '2023-10-29 19:59:00', null, '0.0'),
        (1, '2023-10-29 20:00:00', '10.0', null),
        (1, '2023-10-30 00:01:00', null, '5.0'),
        (1, '2023-10-30 00:20:00', '25.0', null),
        (1, '2023-10-30 00:21:00', null, '1.0'),
        (2, '2023-10-28 23:00:00', null, '0.0'),
        (2, '2023-10-31 4:00:00', null, '53.0')
), prior_fuel as (
    select
        vehicle_id
        ,time_ts
        ,fuel_delta_lt
        ,nvl2(fuel_delta_lt,
            lag(nvl2(fuel_delta_lt, time_ts, null)) ignore nulls over (partition by vehicle_id order by time_ts) 
            ,null) as prior_fuel_time
    from data
), max_fuel_range(rn) as (
    select * from values (0),(1),(2),(3)
)
select pf.*
    ,fr.*
    ,dateadd('day', fr.rn, date_trunc('day', pf.prior_fuel_time)) as p_days
    ,greatest(pf.prior_fuel_time, p_days) as span_start
    ,least(dateadd('day',1,p_days), pf.time_ts) as span_end
    ,(date_part('epoch_second', span_end) - date_part('epoch_second', span_start))/3600 as span_len
    ,(date_part('epoch_second', pf.time_ts) - date_part('epoch_second', pf.prior_fuel_time))/3600 as record_len
    ,nvl2(pf.fuel_delta_lt, round(div0(fuel_delta_lt, record_len) * span_len, 3), null ) as linearize_fuel
from prior_fuel as pf
left join max_fuel_range as fr
    on prior_fuel_time is not null
qualify 
   (pf.prior_fuel_time is null OR
    p_days < pf.time_ts)
order by 1,2;

VEHICLE_ID	TIME_TS	FUEL_DELTA_LT	PRIOR_FUEL_TIME	RN	P_DAYS	SPAN_START	SPAN_END	SPAN_LEN	RECORD_LEN	LINEARIZE_FUEL
1	2023-10-29 19:59:00.000	0
1	2023-10-29 20:00:00.000
1	2023-10-30 00:01:00.000	5	2023-10-29 19:59:00.000	0	2023-10-29 00:00:00.000	2023-10-29 19:59:00.000	2023-10-30 00:00:00.000	4.016667	4.033333	4.979
1	2023-10-30 00:01:00.000	5	2023-10-29 19:59:00.000	1	2023-10-30 00:00:00.000	2023-10-30 00:00:00.000	2023-10-30 00:01:00.000	0.016667	4.033333	0.021
1	2023-10-30 00:20:00.000
1	2023-10-30 00:21:00.000	1	2023-10-30 00:01:00.000	0	2023-10-30 00:00:00.000	2023-10-30 00:01:00.000	2023-10-30 00:21:00.000	0.333333	0.333333	1
2	2023-10-28 23:00:00.000	0
2	2023-10-31 04:00:00.000	53	2023-10-28 23:00:00.000	0	2023-10-28 00:00:00.000	2023-10-28 23:00:00.000	2023-10-29 00:00:00.000	1	53	1
2	2023-10-31 04:00:00.000	53	2023-10-28 23:00:00.000	1	2023-10-29 00:00:00.000	2023-10-29 00:00:00.000	2023-10-30 00:00:00.000	24	53	24
2	2023-10-31 04:00:00.000	53	2023-10-28 23:00:00.000	2	2023-10-30 00:00:00.000	2023-10-30 00:00:00.000	2023-10-31 00:00:00.000	24	53	24
2	2023-10-31 04:00:00.000	53	2023-10-28 23:00:00.000	3	2023-10-31 00:00:00.000	2023-10-31 00:00:00.000	2023-10-31 04:00:00.000	4	53	4

yippie!

Which should be jiggled around a little more, and that linearization should have the multiplication prior to the division for numerical stability.

so should look more like:

with data(vehicle_id, time_ts, dist_delta_km, fuel_delta_lt ) as (
    select 
        $1,
        try_to_timestamp_ntz($2, 'yyyy-mm-dd hh:mi:ss'),
        try_to_number($3, 10, 1),
        try_to_number($4, 10, 1)
    from values 
        (1, '2023-10-29 19:59:00', null, '0.0'),
        (1, '2023-10-29 20:00:00', '10.0', null),
        (1, '2023-10-30 00:01:00', null, '5.0'),
        (1, '2023-10-30 00:20:00', '25.0', null),
        (1, '2023-10-30 00:21:00', null, '1.0'),
        (2, '2023-10-28 23:00:00', null, '0.0'),
        (2, '2023-10-31 4:00:00', null, '53.0')
), prior_fuel as (
    select
        vehicle_id
        ,time_ts
        ,fuel_delta_lt
        ,nvl2(fuel_delta_lt,
            lag(nvl2(fuel_delta_lt, time_ts, null)) ignore nulls over (partition by vehicle_id order by time_ts) 
            ,null) as prior_fuel_time
    from data
), max_fuel_range(rn) as (
    select * from values (0),(1),(2),(3)
), fuel_spans as (
    select pf.*
        ,fr.*
        ,dateadd('day', fr.rn, date_trunc('day', pf.prior_fuel_time)) as p_days
    from prior_fuel as pf
    left join max_fuel_range as fr
        on prior_fuel_time is not null
    qualify 
       (pf.prior_fuel_time is null OR
        p_days < pf.time_ts)
)
select fs.* exclude (prior_fuel_time, rn, p_days)
    ,greatest(fs.prior_fuel_time, fs.p_days) as span_start
    ,least(dateadd('day',1,fs.p_days), fs.time_ts) as span_end
    ,(date_part('epoch_second', span_end) - date_part('epoch_second', span_start)) as span_len
    ,(date_part('epoch_second', fs.time_ts) - date_part('epoch_second', fs.prior_fuel_time)) as record_len
    ,nvl2(fs.fuel_delta_lt, round(div0(fuel_delta_lt * span_len, record_len),4), null ) as linearize_fuel
from fuel_spans as fs
order by 1,2, span_start;

giving:

VEHICLE_ID	TIME_TS	FUEL_DELTA_LT	SPAN_START	SPAN_END	SPAN_LEN	RECORD_LEN	LINEARIZE_FUEL
1	2023-10-29 19:59:00.000	0
1	2023-10-29 20:00:00.000
1	2023-10-30 00:01:00.000	5	2023-10-29 19:59:00.000	2023-10-30 00:00:00.000	14,460	14,520	4.9793
1	2023-10-30 00:01:00.000	5	2023-10-30 00:00:00.000	2023-10-30 00:01:00.000	60	14,520	0.0207
1	2023-10-30 00:20:00.000
1	2023-10-30 00:21:00.000	1	2023-10-30 00:01:00.000	2023-10-30 00:21:00.000	1,200	1,200	1
2	2023-10-28 23:00:00.000	0
2	2023-10-31 04:00:00.000	53	2023-10-28 23:00:00.000	2023-10-29 00:00:00.000	3,600	190,800	1
2	2023-10-31 04:00:00.000	53	2023-10-29 00:00:00.000	2023-10-30 00:00:00.000	86,400	190,800	24
2	2023-10-31 04:00:00.000	53	2023-10-30 00:00:00.000	2023-10-31 00:00:00.000	86,400	190,800	24
2	2023-10-31 04:00:00.000	53	2023-10-31 00:00:00.000	2023-10-31 04:00:00.000	14,400	190,800	4

So here we have “one signal” fuel, stretched over time linearly, so that fuel used yesterday can be counted against “yesterday”. So bringing the original sum code back to life, and keeping the OG code, and the linear fuel:

with data(vehicle_id, time_ts, dist_delta_km, fuel_delta_lt ) as (
    select 
        $1,
        try_to_timestamp_ntz($2, 'yyyy-mm-dd hh:mi:ss'),
        try_to_number($3, 10, 1),
        try_to_number($4, 10, 1)
    from values 
        (1, '2023-10-29 19:59:00', null, '0.0'),
        (1, '2023-10-29 20:00:00', '10.0', null),
        (1, '2023-10-30 00:01:00', null, '5.0'),
        (1, '2023-10-30 00:20:00', '25.0', null),
        (1, '2023-10-30 00:21:00', null, '1.0')--,
        --(2, '2023-10-28 23:00:00', null, '0.0'),
        --(2, '2023-10-31 4:00:00', null, '53.0')
), prior_fuel as (
    select
        vehicle_id
        ,time_ts
        ,dist_delta_km
        ,fuel_delta_lt
        ,nvl2(fuel_delta_lt,
            lag(nvl2(fuel_delta_lt, time_ts, null)) ignore nulls over (partition by vehicle_id order by time_ts) 
            ,null) as prior_fuel_time
    from data
), max_fuel_range(rn) as (
    select * from values (0),(1),(2),(3)
), fuel_spans as (
    select pf.*
        ,fr.*
        ,dateadd('day', fr.rn, date_trunc('day', pf.prior_fuel_time)) as p_days
    from prior_fuel as pf
    left join max_fuel_range as fr
        on prior_fuel_time is not null
    qualify 
       (pf.prior_fuel_time is null OR
        p_days < pf.time_ts)
), fuel_linearized as (
    select fs.* exclude (prior_fuel_time, rn, p_days)
        ,greatest(fs.prior_fuel_time, fs.p_days) as span_start
        ,least(dateadd('day',1,fs.p_days), fs.time_ts) as span_end
        ,(date_part('epoch_second', span_end) - date_part('epoch_second', span_start)) as span_len
        ,(date_part('epoch_second', fs.time_ts) - date_part('epoch_second', fs.prior_fuel_time)) as record_len
        ,nvl2(fs.fuel_delta_lt, round(div0(fuel_delta_lt * span_len, record_len),4), null ) as linearize_fuel
    from fuel_spans as fs
)
select 
    vehicle_id
    ,nvl(span_start, time_ts)::date as day
    ,sum(dist_delta_km) as dist_km
    ,sum(iff(span_end = time_ts, fuel_delta_lt, null)) as og_fuel_lt
    ,round(div0(dist_km, og_fuel_lt), 2) as og_mpg
    ,sum(linearize_fuel) as lin_fuel_lt
    ,round(div0(dist_km, lin_fuel_lt), 2) as lin_mpg    
from fuel_linearized as fl
group by 1,2
order by 1,2;

VEHICLE_ID	DAY	DIST_KM	OG_FUEL_LT	OG_MPG	LIN_FUEL_LT	LIN_MPG
1	2023-10-29	10	null	null	4.9793	2.01
1	2023-10-30	25	6	4.17	1.0207	24.49

So we should linearize the distance also… which is a rinse and repeat of the above.

But the next thing to consider is “how much fuel is spent while the vehicle is off”, which is to say, if the vehicle does report “anything” while off, and is off for 1 day, so like that vehicle 2 data lets say, but we also know it was off, we do not want to treat that time equally. This is Weighted Linearization, and I will post on that next.

2022-09-07

A Sequal to SQL - perhaps not Malloy

So I was reading Hacker News today and came across a article A sequel to SQL? An intro to Malloy, in which a near the top comment was

Try writing this SQL. Malloy writes SQL you can’t (I’m pretty sure of it).
https://twitter.com/lloydtabb/status/1556287859601985536

and at the time, there was no SQL responces written.. so as I read the Twitter thread, at no point was my brain like “this is too hard in SQL” and was rather dulled when I got to the auto-generated SQL,

with d1 as (
    select 
        state
        ,name
        ,sum(number) as births
    from test.public.use_1910_current
    where year > 1990
    group by 1,2
), name_pop as (
    select
        name
        ,state
        ,round((births*sum(births)over())/(sum(births)over(partition by state)*sum(births)over(partition by name)),3) as popularity_ratio
    from d1
    qualify births > 1000
), top_ten as (
    select *, name||'('||popularity_ratio||')' as formatted 
    from name_pop
    qualify row_number() over (partition by state order by popularity_ratio desc) <= 10
)
select state
    ,listagg(formatted, ', ') within group(order by popularity_ratio desc) as by_name_list_detail
from top_ten
group by 1
order by 1 limit 6;

STATE	BY_NAME_LIST_DETAIL
AK	James(1.727), Robert(1.627), Logan(1.611), Hannah(1.551), Samuel(1.541), Tyler(1.518), Benjamin(1.484), John(1.466), Ethan(1.453), William(1.443)
AL	Brantley(3.458), Braxton(3.185), Mary(3.060), Raven(2.734), Chandler(2.621), Allie(2.601), Ryleigh(2.573), Kameron(2.554), Jaylen(2.352), Whitney(2.336)
AR	Braxton(3.054), Dalton(2.850), Bentley(2.725), Hayden(2.430), Dakota(2.375), Clayton(2.355), Dustin(2.324), Brooklyn(2.137), Shelby(2.089), Payton(2.066)
AZ	Notnamed(47.491), Ramon(3.489), Tatum(3.408), Jesus(3.397), Francisco(3.145), Ernesto(2.970), Manuel(2.911), Adriel(2.852), Ruben(2.605), Ximena(2.598)
CA	Azucena(4.215), Xitlali(4.180), Joseluis(4.052), Xochitl(4.045), Miguelangel(3.894), Jocelyne(3.850), Lupita(3.839), Viridiana(3.791), Juancarlos(3.738), Arman(3.717)
CO	Aspen(5.148), Angelo(2.248), Quinn(1.879), Ezekiel(1.837), Rowan(1.833), Dillon(1.817), Everett(1.797), Emerson(1.752), Elias(1.688), Tanner(1.681)

Anyways getting the data file from the github Malloy project

Loading parquet data files into Snowflake via the legacy WebUI

Making a new table with the data pulled from the variant column:

create or replace table test.public.use_1910_current as
    select 
        raw:state::text as state
        ,raw:name::text as name
        ,raw:number::int as number
        ,raw:year::int as year
    from test.public.name_data;

At some point doing fixed point math of integers runs into the default 6dp default for division in Snowflake, thus you should rewrite the fraction of fractions as is reciprocal and multiplied form because 106.250 is not 98.857, and even if you do swap to doubles, you should always multiply verse divide first as you get better numerical stability:

select 
    column1 as a,
    column2 as b, 
    column3 as c,
    column4 as d, 
    round((a/b)/(c/d),3) as r1,
    round((a/b)*(d/c),3) as r2,
    round((a*d)/(b*c),3) as r3
from values
(1,2,3,4),
(1108,2606253,1375,319731698);

A	B	C	D	R1	R2	R3
1	2	3	4	0.667	0.667	0.667
1108	2606253	1375	319731698	106.250	98.826	98.857

So transcribing the generated SQL into working Snowflake SQL:

with __stage0 as (
    select
        group_set,
        __lateral_join_bag.name__3,
        __lateral_join_bag.state__3,
        iff(group_set=3, coalesce(sum(names.number),0), null) as a,
        max( iff(group_set=1, coalesce(sum(names.number),0), null) ) over (partition by __lateral_join_bag.state__3) as b,
        max( iff(group_set=2, coalesce(sum(names.number),0), null) ) over (partition by __lateral_join_bag.name__3) as c,
        max( iff(group_set=0, coalesce(sum(names.number),0), null) ) over() as d,
        (a/b) / (c/d) as popularity_ratio__3
    from test.public.use_1910_current as names
    cross join (
        select 
            row_number() over(order by null) -1 as group_set 
        from table(generator(ROWCOUNT =>4))
        //select column1 as group_set from values (0),(1),(2),(3)
    ) as x
    ,lateral  (
        select 
            case when x.group_set in(3,2) then names.name end as name__3,
            case when group_set in(3,1) then names.state end as state__3
    ) as __lateral_join_bag
    
    where names.year>1990
    group by 1,2,3
    having (group_set<>3 or (group_set=3 and (case when group_set=3 then coalesce(sum(names.number),0)end)>1000))
)
,__stage1 as (
    select name__3 as name,
        state__3 as state,
        round(any_value(case when group_set=3 then popularity_ratio__3 end),3) as popularity_ratio
    from __stage0
    where group_set not in (0,1,2)
    group by 1,2
    qualify row_number() over(partition by state order by popularity_ratio desc) <= 10
    order by 3 desc
)
,__stage2 as (
    select group_set,
        case when group_set in (0,1) then
            base.state 
        end as state__0,
        case when group_set=1 then
            base.name
        end as name__1,
        case when group_set=1 then
            base.popularity_ratio
        end as popularity_ratio__1
    from __stage1 as base
    cross join(select 
            row_number() over(order by null) -1 as group_set 
        from table(generator(ROWCOUNT =>2))) as x
    group by 1,2,3,4
)
select
    state__0 as state,
    listagg(case when group_set=1 then name__1||'('||popularity_ratio__1||')' end, ', ') within group (order by popularity_ratio__1 desc) as by_name_list_detail
from __stage2
group by 1
order by 1 asc;

There are some things that should be changed about this code, the order by 3 desc in __stage1 is a obvious waste of processing. But once you look at that, you notice __stage2 is joining to itself twice and then in the final select only taking one of those sets of results.

so the latter half of the code can be written:

,__stage1 as (
    select name__3 as name,
        state__3 as state,
        round(any_value(case when group_set=3 then popularity_ratio__3 end),3) as popularity_ratio
    from __stage0
    where group_set not in (0,1,2)
    group by 1,2
    qualify row_number() over(partition by state order by popularity_ratio desc) <= 10
    --order by 3 desc
)
select
    state,
    listagg(name||'('||popularity_ratio||')', ', ') within group (order by popularity_ratio desc) as by_name_list_detail
from __stage1
group by 1
order by 1 asc;

[July 2023]
10 months later, there was more to this post, but time has moved on… so this appears to just be old ramblings.. nothing new perhaps. Anyways many people rewrote the SQL nicely on HackerNews, so that is nice to see.

2022-05-12

Puzzles in SQL

The other day our office manager brought in an old style puzzle, that I used to do as a child.

After thinking about how I used to solve them, decided to let SQL do the grunt work for me:

I used lower case to mark the tail of the plane, and upper case to front of the plane.

The cards are encoded as these id’s

| 0 | 1 | 2 |
| 3 | 4 | 5 |
| 6 | 7 | 8 |

with data(id,a,b,c,d) as (
    select * from values 
    (0, 'b', 'w', 'B', 'Y' ),
    (1, 'y', 'g', 'B', 'W' ),
    (2, 'w', 'G', 'B', 'y' ),
    (3, 'b', 'y', 'G', 'W' ),
    (4, 'g', 'y', 'G', 'W' ),
    (5, 'b', 'w', 'G', 'Y' ),
    (6, 'B', 'Y', 'g', 'w' ),
    (7, 'g', 'w', 'B', 'Y' ),
    (8, 'w', 'y', 'B', 'G' )
)

then swapped those values to an encoding where “a correct match” sums to 10

), encode as (
    select id,
        decode(a, 'b', 1, 'B', 9, 'g', 2, 'G', 8, 'y', 3, 'Y', 7, 'w', 4, 'W', 6 ) as a,
        decode(b, 'b', 1, 'B', 9, 'g', 2, 'G', 8, 'y', 3, 'Y', 7, 'w', 4, 'W', 6 ) as b,
        decode(c, 'b', 1, 'B', 9, 'g', 2, 'G', 8, 'y', 3, 'Y', 7, 'w', 4, 'W', 6 ) as c,
        decode(d, 'b', 1, 'B', 9, 'g', 2, 'G', 8, 'y', 3, 'Y', 7, 'w', 4, 'W', 6 ) as d
    from data
)

and from there made the permutations of the 4 rotations of each tile

), tiles_rot as (
   select id
        ,decode((column1+0) % 4, 0 , a, 1, b, 2, c, 3, d) as a
        ,decode((column1+1) % 4, 0 , a, 1, b, 2, c, 3, d) as b
        ,decode((column1+2) % 4, 0 , a, 1, b, 2, c, 3, d) as c
        ,decode((column1+3) % 4, 0 , a, 1, b, 2, c, 3, d) as d
        ,column1 as r
   from encode cross join (values(0),(1),(2),(3)) as v
)

then wove those together with the don’t repeat the same tile rules, and enforce the “must match existing tiles on edges” rules for the 12 joins as the tiles are placed:

select 
    a.id || ' (0)' as a,
    b.id || ' (' || b.r || ')' as b, 
    c.id || ' (' || c.r || ')' as c, 
    d.id || ' (' || d.r || ')' as d, 
    e.id || ' (' || e.r || ')' as e, 
    f.id || ' (' || f.r || ')' as f, 
    g.id || ' (' || g.r || ')' as g, 
    h.id || ' (' || h.r || ')' as h, 
    i.id || ' (' || i.r || ')' as i
from encode as a
join tiles_rot as b
    on b.id != a.id 
    and a.b + b.d = 10 
join tiles_rot as c
    on c.id != a.id and c.id != b.id 
    and b.c + c.a = 10 
join tiles_rot as d
    on d.id != a.id and d.id != b.id and d.id != c.id 
    and a.c + d.a = 10 
    and c.d + d.b = 10
join tiles_rot as e
    on e.id != a.id and e.id != b.id and e.id != c.id 
    and e.id != d.id 
    and d.d + e.b = 10 
join tiles_rot as f
    on f.id != a.id and f.id != b.id and f.id != c.id 
    and f.id != d.id and f.id != e.id 
    and f.c +e.a = 10 and f.b + a.d = 10
join tiles_rot as g
    on g.id != a.id and g.id != b.id and g.id != c.id 
    and g.id != d.id and g.id != e.id and g.id != f.id 
    and f.a + g.c = 10
join tiles_rot as h
    on h.id != a.id and h.id != b.id and h.id != c.id 
    and h.id != d.id and h.id != e.id and h.id != f.id 
    and h.id != g.id 
    and h.d + g.b = 10 and h.c + a.a = 10
join tiles_rot as i
    on i.id != a.id and i.id != b.id and i.id != c.id 
    and i.id != d.id and i.id != e.id and i.id != f.id 
    and i.id != g.id and i.id != h.id 
    and i.d + h.b = 10 and i.c + b.a = 10

A	B	C	D	E	F	G	H	I
1 (0)	8 (0)	0 (0)	3 (0)	7 (0)	2 (3)	6 (3)	5 (1)	4 (1)
1 (0)	8 (0)	0 (0)	3 (0)	6 (2)	2 (3)	7 (1)	5 (1)	4 (1)
3 (0)	5 (0)	7 (0)	4 (0)	0 (0)	6 (2)	2 (3)	1 (0)	8 (0)
3 (0)	5 (0)	6 (2)	4 (0)	0 (0)	7 (0)	2 (3)	1 (0)	8 (0)

and it takes just ~1 second to run in Snowflake on an extra small instance.

and the layout is in this pattern:

| C | D | E |
| B | A | F |
| I | H | G |

looking at the four solutions, they are two base solutions where two cards can be swapped.. is interesting.

2021-11-08

Stack Overflow homework questions and deleting them

Well I answered a “homework” question today, trying to help someone “learn”, and as the question evolved recommended follow-up question be asked in another question, answer those also, looked latter points lost, question gone?? following my browser history to the prior question to find it gone? Mention this on the new question and it also gets deleted.

https://stackoverflow.com/questions/69875135/sql-sum-of-multiple-records-on-same-date

https://i.stack.imgur.com/qBQR6.png

https://stackoverflow.com/questions/69877332/fill-in-missing-dates-in-table-sql

Ok, so cloud_user / cloudaemon you knew you were meant to not be asking on StackOverflow, then don’t!

2021-08-09

URL Rewriting in Cloudfront with Lambda@Edge functions

So I previously put a Lambda@Edge function in place to fix the \blog\post\ requests to become \blog\post\index.html which is what the files are, and I also put a 403/404 error handler page to catch those 404 and log then to Analytics events, and sure enough, there where a number of pages that had been have errors.

I fixed the internal page links, assuming those where the primary dirvers, and moved on.

Those failures just keep happening though

So in the weekend I rewrote the lambda@edge function from

function handler(event) {
    var request = event.request;
    var uri = request.uri;
    
    // Check whether the URI is missing a file name.
    if (uri.endsWith('/')) {
        request.uri += 'index.html';
    } 
    // Check whether the URI is missing a file extension.
    else if (!uri.includes('.')) {
        request.uri += '/index.html';
    }

    return request;
}

to a new flashy one, that has a list of urls to rewrite. I was originally doing to use a javascript Set but those are not supported, so went for a dictionary.

'use strict';

var patch_url = 'https://simeonpilgrim.com/nikon-patch/nikon-patch.html';
var patch_response = {
	statusCode: 302,
	statusDescription: 'Found',
	headers: 
		{ "location": { "value": patch_url } }
};

var patch_redirect_set = {'/nikon-patch/Nikon-Patch.html': true,
	'/nikon-patch/nikon-patch-beta.html': true,
	'/nikon-patch-test/index.html': true,
	'/nikon-patch': true,
	'/nikon-patch/': true,
	};

function handler(event) {
    var request = event.request;
    var uri = request.uri;
    
    if( patch_redirect_set[uri] ){
		return patch_response;
	}
	
	if( patch_redirect_set[uri] ){
		return patch_response;
	}
    // Check whether the URI is missing a file name.
    if (uri.endsWith('/')) {
        request.uri += 'index.html';
    } 
    // Check whether the URI is missing a file extension.
    else if (!uri.includes('.')) {
        request.uri += '/index.html';
    }
    
    return request;
}

I rather like the Test page on the lambda pages, so once it was not failing, I published it, and tested. And it worked, so there we are. A framework to fix urls. It might make more sense to put the mapping in the dictionary, and if I end up wanting to fix non-patch pages I might go that route. But it’s more less wrong. So that’s a win.

2021-08-07

Getting Started Disassembling Nikon Firmware

Download the firmware you are wanting to improve. Steve Goossens has a really nice patching guide that has links to the Nikon download archives, if you are looking for older firmware.

Open the firmware in the Nikon-Patch Local UI tool (It is a C# project, so you will need a windows pc and Visual Studio 2019 Community to run it) and decode/exact the firmware

The main UI firmware is the B firmware, we will open that in Ghidra

So demonstrating with the D3300 1.02 firmware, we get the files a860_010101.bin and b860_102.bin from the decode step.

So create a new Ghidra project, then import the B firmware

Then select the ARM v7 little endian

Finding the firmware offset

For now ignore the Options, but we will come back and load the file at the correct offset later. Also for the next couple of step don’t save anything, as we will be deleting and re-importing files as we work out, where the files should be loaded.

Once we open the file, you will be prompted to “Auto disassemble this file” say No you that, and press d key on the first line

this will look like:

    //
    // ram 
    // ram:00000000-ram:016bff89
    //
         assume spsr = 0x0  (Default)
                         Reset                                           XREF[1]:     Entry Point(*)  
    00000000 18 f0 9f e5     ldr        pc=>LAB_90352fe0,[DAT_00000020]                  = 90352FE0h

This is noting the B firware reset jump address is loaded from 0x20 in the firmware, which is address 0x90352FE0, and Nikon B firmwares follow a pattern of being at 0xNN020000 thus this model camera is loaded at 0x90020000

Now we have the correct offset, close the CodeBrowser and don’t save

Deleted the firmware from the project, and confirm yes to permanently deleting

and reimport, but this time set the memory offset

Now we open the CodeBrower again, and again say no to the Analyze option. Because this time we are wanting to learn the address the itron kernel gets loaded to, and if we have that memory mapped before the auto magic process begins, there is a lot of errors avoided.

So here we are going to decode the first jump, and follow that twice, and then we will be in the reset function (at 0x90352fe8 hit f to mark that as a function) and then scroll there are some sections where a repeating pattern occurs, that is a serries of MemCopy operations.

The MemCopy functions usage look like this

clicking on the addres 9035329f we see some code that looks like this. It’s loading 48 bytes into all the spare registers via the LDMIA instruction and then writing them to the destination. This is the memcopy we are looking for.

So now we can right click on the function in the Disassembler window and Edit Function Signature

and then the jumping back to the reset (Alt+LeftArrow steps back) function will look like:

FUN_903533a0(PTR_DAT_903531cc,PTR_DAT_903531d0,0xae000000);
FUN_90020098();
memcopy((undefined4 *)PTR_LAB_903531d4,DAT_903531d8,DAT_903531dc + 3U & 0xfffffffc);
memcopy((undefined4 *)PTR_LAB_903531e0,DAT_903531e4,DAT_903531e8 + 3U & 0xfffffffc);
memcopy((undefined4 *)PTR_LAB_903531ec,DAT_903531f0,DAT_903531f4 + 3U & 0xfffffffc);
memcopy((undefined4 *)PTR_DAT_903531f8,DAT_903531fc,DAT_90353200 + 3U & 0xfffffffc);
memcopy((undefined4 *)PTR_DAT_90353204,DAT_90353208,DAT_9035320c + 3U & 0xfffffffc);
memcopy((undefined4 *)PTR_DAT_90353210,DAT_90353214,DAT_90353218 + 3U & 0xfffffffc);
FUN_900200a4();
dst = (undefined4 *)((DAT_9035321c & 0xfff00000) + 0x100000);

which looks strange, because the destination pointers are not followed. But we can alter the memory mapping to drop Write from the ROM area

and the code then becomes:

memcopy((undefined4 *)&LAB_9158a358,(undefined4 *)0x10000000,0x3b5b4);
memcopy((undefined4 *)&LAB_915c590c,(undefined4 *)0x1003b5b4,0x179c);
memcopy((undefined4 *)&LAB_915c590c,(undefined4 *)0x1003b5b4,0);
memcopy((undefined4 *)&DAT_916a73a0,(undefined4 *)0x1011d420,0x28b3c);
memcopy((undefined4 *)&DAT_916a73a0,(undefined4 *)0x1011d420,0);
memcopy((undefined4 *)&DAT_916a739c,(undefined4 *)0x1011d03c,4);

Now we know the offsets we need to add as memory maps, on the very screen we used to fix the ROM value, the “Memory Map” screen. Really only those three blocks are needed to be mapped, these are RAM area’s and there is some general RAM that needs to be identified, but for now this is the start we need.

Which is exactly what I have been noting in the Firmware Notes in the code repo, when I remember.

So now we can delete all this again, and load the code base firmware at the correct location, and add the memory maps, and then press Auto Analyze and walk away for a cup of tea.

Adding the Kernel memory maps

So using the find or Firmware Notes values we want to add the three blocks of memory for the Code, Data, and ? third data segments:

Adding RAM1 memory map, make sure you tick execute (x) and select Byte Mapped

repeat for RAM2 and RAM3

and you will end up with

Now it’s time to Auto Analyze

Going forward from here

There are number of landmarks to look out for, there is a chunk of kernel debugging, that you can find the strings for, and then find the stubbed print finctions, thus find all calls, those give a lot of details, there is a small amount of RTTI information, there are some documented protocols like the TIFF/JPG header writing code, the PTP handlers are driven by tables, so are quite easy to find, the EyeFi code can be found quickly.

And then there is the method of loading other camera models firmware, side-by-side, and looking for the addresses of existing patches and finding the related code.

I intend to write more on most of these things, but it also has taken me 4+ years to write this up, so it’s not going to happen overnight…

2021-08-01

It helps to plug it in

So at some point, I came across this picture, I feel it was some quality corporate compliance training. Anyways, it had these great chaps, collaborating, and synergising over stuff on those computers screens..

Let us collaborate and synergise, yes lets

Only problems, is those devices are very much “not plugged in” now I know, cables are ugly, and staged shots look best when they are not “real” I get it. But it also irks me. Maybe this is just be saying “I still have not grown up enough to not talking about meaningless things that irk me”.. yes that might be the only value. Still not there yet.

Google image search shows it used in a number of places so it seems like it’s from some Stock photo library.

2021-08-01

SageTV

Goodbye SageTV thanks for all good times

The time has come that the trusty SageTV has been turned off.

This year has not been the best for the SageTV and to be honest, all the problems where between my server (Win10) and the HDHomeRun. Back in September 2020, after a Windows Update, the SageTV process failed to connect to the HDHomeRun. After much fluffing around I installed OpenDCT and got that working, and SageTV could see that.

The a couple of weeks back both Sage/OpenDCT and EPG Collector stopped being able to see my HDHomeRun. Again more fluffing around, installing updates, I got it so the HDHomeRun tools could see the channels, and turn to stuff, and a little while later, so could SageTV see live TV, but EPG Collector never had any luck.

Then Michaela called it and said, the streaming apps where good enough, and it could be shut down. At one level I was super glad, to get to stop thinking about what to try next, or that I really should get back on to that task. And another level, super said to give up on the system. I had been a really neat tool for the last 10 years.

So other neat things, are I have freed up a chunk of disk space. And deleted multiple copies of installations, that were keep to debug changes. And deleted most the code changes I had done to SageTV, and EPG Collector, and tweaks to OpenDCT (I think those were submitted). Anyways. Letting code changes go. Feels really relaxing, permission to just forget those TODO’s.

Uninstalling 'XMLTV Importer for SageTV'

So now I have no Scheduled tasks, and can allow the PC to turn off at night.. so that all feels better, going forward.

I really value the community efforts of: SageTV, EPG Collector, Comskip, OpenDCT, XMLTV Importer for SageTV