2023-10-31

Linearization of Values

In my old job at Telogis, we had the fun task of adding up vehicle fuel, and distance, to calculate things like fuel efficiency.

So we might have some data that looks like:

with data(vehicle_id, time_ts, dist_delta_km, fuel_delta_lt ) as (
    select 
        $1,
        try_to_timestamp_ntz($2, 'yyyy-mm-dd hh:mi:ss'),
        try_to_number($3, 10, 1),
        try_to_number($4, 10, 1)
    from values 
        (1, '2023-10-30 10:00:00', '10.0', '0.5'),
        (1, '2023-10-30 11:00:00', '5.0', '0.3')
)

and from that we would break it into days, and sum it up, like this sort-of:

select vehicle_id
    ,time_ts::date as day
    ,sum(dist_delta_km) as dist_km
    ,sum(fuel_delta_lt) as fuel_lt
    ,round(dist_km / fuel_lt, 2) as mpg_wink
from data
group by 1,2
order by 1,2;

VEHICLE_ID	DAY	DIST_KM	FUEL_LT	MPG_WINK*
1	2023-10-30	15	0.8	18.75

*Yes, that is not mpg, but I do not like the lt/km metric, and mpg is just the word I will use.

And sure enough, for a large fleet looking at a big picture of a week, this makes sense.

But if your data is unaligned, aka you get distance on most data rows, but only get fuel intermittently, over time you can get things like.

(1, '2023-10-29 20:00:00', '10.0', null),
(1, '2023-10-30 00:01:00', null, '5.0'),
(1, '2023-10-30 00:20:00', '25.0', null),
(1, '2023-10-30 00:21:00', null, '1.0')

So this data we have a 8pm distance, just after midnight we get a fuel reading, later some more distance and a smaller fuel change. Now this is a little odd, as the data is shown in delta (change) form

where the gold standard of data in my opinion is absolute (aka the odometer style) as if data is missing the change is still calculable. With this form there is an implicit “last fuel reading” that the after midnight value is “since”.

But all those nit-picking details aside, the core point here is that, “the fuel mostly belongs to the 29th not the 30th”. So how might we share the love?

We linearize it!

So if we mash a zero fuel reading into the delta data, to give us a “when is this data since”, like so:

(1, '2023-10-29 19:59:00', null, '0.0'),
(1, '2023-10-29 20:00:00', '10.0', null),
(1, '2023-10-30 00:01:00', null, '5.0'),
(1, '2023-10-30 00:20:00', '25.0', null),
(1, '2023-10-30 00:21:00', null, '1.0')

We currently get a division by zero error.. sigh, so lets make that safe with DIV0

,round(div0(dist_km, fuel_lt), 2) as mpg_wink

VEHICLE_ID	DAY	DIST_KM	FUEL_LT	MPG_WINK
1	2023-10-29	10	0	0
1	2023-10-30	25	6	4.17

So we can find the prior time of the fuel reading, and then slice that over days I will limit to 4 as here.

So here we are going to use NVL2 to conditionally do stuff if the current row (and the previous row’s fuel) are not null:

select
    vehicle_id
    ,time_ts
    ,fuel_delta_lt
    ,nvl2(fuel_delta_lt,
        lag(fuel_delta_lt) ignore nulls over (partition by vehicle_id order by time_ts) 
        ,null) as prior_fuel_delta
    ,nvl2(fuel_delta_lt,
        lag(nvl2(fuel_delta_lt, time_ts, null)) ignore nulls over (partition by vehicle_id order by time_ts) 
        ,null) as prior_fuel_time
from data
order by 1,2;

for the above data we get:

VEHICLE_ID	TIME_TS	FUEL_DELTA_LT	PRIOR_FUEL_DELTA	PRIOR_FUEL_TIME
1	2023-10-29 19:59:00.000	0	null	null
1	2023-10-29 20:00:00.000	null	null	null
1	2023-10-30 00:01:00.000	5	0	2023-10-29 19:59:00.000
1	2023-10-30 00:20:00.000	null	null	null
1	2023-10-30 00:21:00.000	1	5	2023-10-30 00:01:00.000

so we can see that for rows that “have fuel” we find the prior rows values of the fuel, and of the time for that same row. This also could be done with a OBJECT_CONSTRUCT, and then one LAG on that, and then a pull apart, and for more values, that can be more performant, as always test, and re-test later.

So we now have the prior time, so we can now slice that to days, and given we are going to deal with “overnight” splits, we might as well fully chunk this data into N-days chunks.

with data(vehicle_id, time_ts, dist_delta_km, fuel_delta_lt ) as (
    select 
        $1,
        try_to_timestamp_ntz($2, 'yyyy-mm-dd hh:mi:ss'),
        try_to_number($3, 10, 1),
        try_to_number($4, 10, 1)
    from values 
        (1, '2023-10-29 19:59:00', null, '0.0'),
        (1, '2023-10-29 20:00:00', '10.0', null),
        (1, '2023-10-30 00:01:00', null, '5.0'),
        (1, '2023-10-30 00:20:00', '25.0', null),
        (1, '2023-10-30 00:21:00', null, '1.0')
), prior_fuel as (
    select
        vehicle_id
        ,time_ts
        ,fuel_delta_lt
        ,nvl2(fuel_delta_lt,
            lag(fuel_delta_lt) ignore nulls over (partition by vehicle_id order by time_ts) 
            ,null) as prior_fuel_delta
        ,nvl2(fuel_delta_lt,
            lag(nvl2(fuel_delta_lt, time_ts, null)) ignore nulls over (partition by vehicle_id order by time_ts) 
            ,null) as prior_fuel_time
    from data
), max_fuel_range(rn) as (
    select * from values (0),(1),(2),(3)
)
select pf.*
    ,fr.*
    ,dateadd('day', fr.rn, date_trunc('day', pf.prior_fuel_time)) as p_days
from prior_fuel as pf
left join max_fuel_range as fr
    on prior_fuel_time is not null
qualify 
   (pf.prior_fuel_time is null OR
    p_days < pf.time_ts)
order by 1,2;

So I am not loving that OR but for hand waving “something like this” purposes, we now have our data with days splits available:

VEHICLE_ID	TIME_TS	FUEL_DELTA_LT	PRIOR_FUEL_DELTA	PRIOR_FUEL_TIME	RN	P_DAYS
1	2023-10-29 19:59:00.000	0	null	null	null	null
1	2023-10-29 20:00:00.000	null	null	null	null	null
1	2023-10-30 00:01:00.000	5	0	2023-10-29 19:59:00.000	0	2023-10-29 00:00:00.000
1	2023-10-30 00:01:00.000	5	0	2023-10-29 19:59:00.000	1	2023-10-30 00:00:00.000
1	2023-10-30 00:20:00.000	null	null	null	null	null
1	2023-10-30 00:21:00.000	1	5	2023-10-30 00:01:00.000	0	2023-10-30 00:00:00.000

So now we can LEAST/GREATEST the times to correctly time the span, and thus get the span duration, and the record duration, and thus linearize that fuel:

with data(vehicle_id, time_ts, dist_delta_km, fuel_delta_lt ) as (
    select 
        $1,
        try_to_timestamp_ntz($2, 'yyyy-mm-dd hh:mi:ss'),
        try_to_number($3, 10, 1),
        try_to_number($4, 10, 1)
    from values 
        (1, '2023-10-29 19:59:00', null, '0.0'),
        (1, '2023-10-29 20:00:00', '10.0', null),
        (1, '2023-10-30 00:01:00', null, '5.0'),
        (1, '2023-10-30 00:20:00', '25.0', null),
        (1, '2023-10-30 00:21:00', null, '1.0'),
        (2, '2023-10-28 23:00:00', null, '0.0'),
        (2, '2023-10-31 4:00:00', null, '53.0')
), prior_fuel as (
    select
        vehicle_id
        ,time_ts
        ,fuel_delta_lt
        ,nvl2(fuel_delta_lt,
            lag(nvl2(fuel_delta_lt, time_ts, null)) ignore nulls over (partition by vehicle_id order by time_ts) 
            ,null) as prior_fuel_time
    from data
), max_fuel_range(rn) as (
    select * from values (0),(1),(2),(3)
)
select pf.*
    ,fr.*
    ,dateadd('day', fr.rn, date_trunc('day', pf.prior_fuel_time)) as p_days
    ,greatest(pf.prior_fuel_time, p_days) as span_start
    ,least(dateadd('day',1,p_days), pf.time_ts) as span_end
    ,(date_part('epoch_second', span_end) - date_part('epoch_second', span_start))/3600 as span_len
    ,(date_part('epoch_second', pf.time_ts) - date_part('epoch_second', pf.prior_fuel_time))/3600 as record_len
    ,nvl2(pf.fuel_delta_lt, round(div0(fuel_delta_lt, record_len) * span_len, 3), null ) as linearize_fuel
from prior_fuel as pf
left join max_fuel_range as fr
    on prior_fuel_time is not null
qualify 
   (pf.prior_fuel_time is null OR
    p_days < pf.time_ts)
order by 1,2;

VEHICLE_ID	TIME_TS	FUEL_DELTA_LT	PRIOR_FUEL_TIME	RN	P_DAYS	SPAN_START	SPAN_END	SPAN_LEN	RECORD_LEN	LINEARIZE_FUEL
1	2023-10-29 19:59:00.000	0
1	2023-10-29 20:00:00.000
1	2023-10-30 00:01:00.000	5	2023-10-29 19:59:00.000	0	2023-10-29 00:00:00.000	2023-10-29 19:59:00.000	2023-10-30 00:00:00.000	4.016667	4.033333	4.979
1	2023-10-30 00:01:00.000	5	2023-10-29 19:59:00.000	1	2023-10-30 00:00:00.000	2023-10-30 00:00:00.000	2023-10-30 00:01:00.000	0.016667	4.033333	0.021
1	2023-10-30 00:20:00.000
1	2023-10-30 00:21:00.000	1	2023-10-30 00:01:00.000	0	2023-10-30 00:00:00.000	2023-10-30 00:01:00.000	2023-10-30 00:21:00.000	0.333333	0.333333	1
2	2023-10-28 23:00:00.000	0
2	2023-10-31 04:00:00.000	53	2023-10-28 23:00:00.000	0	2023-10-28 00:00:00.000	2023-10-28 23:00:00.000	2023-10-29 00:00:00.000	1	53	1
2	2023-10-31 04:00:00.000	53	2023-10-28 23:00:00.000	1	2023-10-29 00:00:00.000	2023-10-29 00:00:00.000	2023-10-30 00:00:00.000	24	53	24
2	2023-10-31 04:00:00.000	53	2023-10-28 23:00:00.000	2	2023-10-30 00:00:00.000	2023-10-30 00:00:00.000	2023-10-31 00:00:00.000	24	53	24
2	2023-10-31 04:00:00.000	53	2023-10-28 23:00:00.000	3	2023-10-31 00:00:00.000	2023-10-31 00:00:00.000	2023-10-31 04:00:00.000	4	53	4

yippie!

Which should be jiggled around a little more, and that linearization should have the multiplication prior to the division for numerical stability.

so should look more like:

with data(vehicle_id, time_ts, dist_delta_km, fuel_delta_lt ) as (
    select 
        $1,
        try_to_timestamp_ntz($2, 'yyyy-mm-dd hh:mi:ss'),
        try_to_number($3, 10, 1),
        try_to_number($4, 10, 1)
    from values 
        (1, '2023-10-29 19:59:00', null, '0.0'),
        (1, '2023-10-29 20:00:00', '10.0', null),
        (1, '2023-10-30 00:01:00', null, '5.0'),
        (1, '2023-10-30 00:20:00', '25.0', null),
        (1, '2023-10-30 00:21:00', null, '1.0'),
        (2, '2023-10-28 23:00:00', null, '0.0'),
        (2, '2023-10-31 4:00:00', null, '53.0')
), prior_fuel as (
    select
        vehicle_id
        ,time_ts
        ,fuel_delta_lt
        ,nvl2(fuel_delta_lt,
            lag(nvl2(fuel_delta_lt, time_ts, null)) ignore nulls over (partition by vehicle_id order by time_ts) 
            ,null) as prior_fuel_time
    from data
), max_fuel_range(rn) as (
    select * from values (0),(1),(2),(3)
), fuel_spans as (
    select pf.*
        ,fr.*
        ,dateadd('day', fr.rn, date_trunc('day', pf.prior_fuel_time)) as p_days
    from prior_fuel as pf
    left join max_fuel_range as fr
        on prior_fuel_time is not null
    qualify 
       (pf.prior_fuel_time is null OR
        p_days < pf.time_ts)
)
select fs.* exclude (prior_fuel_time, rn, p_days)
    ,greatest(fs.prior_fuel_time, fs.p_days) as span_start
    ,least(dateadd('day',1,fs.p_days), fs.time_ts) as span_end
    ,(date_part('epoch_second', span_end) - date_part('epoch_second', span_start)) as span_len
    ,(date_part('epoch_second', fs.time_ts) - date_part('epoch_second', fs.prior_fuel_time)) as record_len
    ,nvl2(fs.fuel_delta_lt, round(div0(fuel_delta_lt * span_len, record_len),4), null ) as linearize_fuel
from fuel_spans as fs
order by 1,2, span_start;

giving:

VEHICLE_ID	TIME_TS	FUEL_DELTA_LT	SPAN_START	SPAN_END	SPAN_LEN	RECORD_LEN	LINEARIZE_FUEL
1	2023-10-29 19:59:00.000	0
1	2023-10-29 20:00:00.000
1	2023-10-30 00:01:00.000	5	2023-10-29 19:59:00.000	2023-10-30 00:00:00.000	14,460	14,520	4.9793
1	2023-10-30 00:01:00.000	5	2023-10-30 00:00:00.000	2023-10-30 00:01:00.000	60	14,520	0.0207
1	2023-10-30 00:20:00.000
1	2023-10-30 00:21:00.000	1	2023-10-30 00:01:00.000	2023-10-30 00:21:00.000	1,200	1,200	1
2	2023-10-28 23:00:00.000	0
2	2023-10-31 04:00:00.000	53	2023-10-28 23:00:00.000	2023-10-29 00:00:00.000	3,600	190,800	1
2	2023-10-31 04:00:00.000	53	2023-10-29 00:00:00.000	2023-10-30 00:00:00.000	86,400	190,800	24
2	2023-10-31 04:00:00.000	53	2023-10-30 00:00:00.000	2023-10-31 00:00:00.000	86,400	190,800	24
2	2023-10-31 04:00:00.000	53	2023-10-31 00:00:00.000	2023-10-31 04:00:00.000	14,400	190,800	4

So here we have “one signal” fuel, stretched over time linearly, so that fuel used yesterday can be counted against “yesterday”. So bringing the original sum code back to life, and keeping the OG code, and the linear fuel:

with data(vehicle_id, time_ts, dist_delta_km, fuel_delta_lt ) as (
    select 
        $1,
        try_to_timestamp_ntz($2, 'yyyy-mm-dd hh:mi:ss'),
        try_to_number($3, 10, 1),
        try_to_number($4, 10, 1)
    from values 
        (1, '2023-10-29 19:59:00', null, '0.0'),
        (1, '2023-10-29 20:00:00', '10.0', null),
        (1, '2023-10-30 00:01:00', null, '5.0'),
        (1, '2023-10-30 00:20:00', '25.0', null),
        (1, '2023-10-30 00:21:00', null, '1.0')--,
        --(2, '2023-10-28 23:00:00', null, '0.0'),
        --(2, '2023-10-31 4:00:00', null, '53.0')
), prior_fuel as (
    select
        vehicle_id
        ,time_ts
        ,dist_delta_km
        ,fuel_delta_lt
        ,nvl2(fuel_delta_lt,
            lag(nvl2(fuel_delta_lt, time_ts, null)) ignore nulls over (partition by vehicle_id order by time_ts) 
            ,null) as prior_fuel_time
    from data
), max_fuel_range(rn) as (
    select * from values (0),(1),(2),(3)
), fuel_spans as (
    select pf.*
        ,fr.*
        ,dateadd('day', fr.rn, date_trunc('day', pf.prior_fuel_time)) as p_days
    from prior_fuel as pf
    left join max_fuel_range as fr
        on prior_fuel_time is not null
    qualify 
       (pf.prior_fuel_time is null OR
        p_days < pf.time_ts)
), fuel_linearized as (
    select fs.* exclude (prior_fuel_time, rn, p_days)
        ,greatest(fs.prior_fuel_time, fs.p_days) as span_start
        ,least(dateadd('day',1,fs.p_days), fs.time_ts) as span_end
        ,(date_part('epoch_second', span_end) - date_part('epoch_second', span_start)) as span_len
        ,(date_part('epoch_second', fs.time_ts) - date_part('epoch_second', fs.prior_fuel_time)) as record_len
        ,nvl2(fs.fuel_delta_lt, round(div0(fuel_delta_lt * span_len, record_len),4), null ) as linearize_fuel
    from fuel_spans as fs
)
select 
    vehicle_id
    ,nvl(span_start, time_ts)::date as day
    ,sum(dist_delta_km) as dist_km
    ,sum(iff(span_end = time_ts, fuel_delta_lt, null)) as og_fuel_lt
    ,round(div0(dist_km, og_fuel_lt), 2) as og_mpg
    ,sum(linearize_fuel) as lin_fuel_lt
    ,round(div0(dist_km, lin_fuel_lt), 2) as lin_mpg    
from fuel_linearized as fl
group by 1,2
order by 1,2;

VEHICLE_ID	DAY	DIST_KM	OG_FUEL_LT	OG_MPG	LIN_FUEL_LT	LIN_MPG
1	2023-10-29	10	null	null	4.9793	2.01
1	2023-10-30	25	6	4.17	1.0207	24.49

So we should linearize the distance also… which is a rinse and repeat of the above.

But the next thing to consider is “how much fuel is spent while the vehicle is off”, which is to say, if the vehicle does report “anything” while off, and is off for 1 day, so like that vehicle 2 data lets say, but we also know it was off, we do not want to treat that time equally. This is Weighted Linearization, and I will post on that next.