Linearization of Values

In my old job at Telogis, we had the fun task of adding up vehicle fuel, and distance, to calculate things like fuel efficiency.

So we might have some data that looks like:

with data(vehicle_id, time_ts, dist_delta_km, fuel_delta_lt ) as (
    select 
        $1,
        try_to_timestamp_ntz($2, 'yyyy-mm-dd hh:mi:ss'),
        try_to_number($3, 10, 1),
        try_to_number($4, 10, 1)
    from values 
        (1, '2023-10-30 10:00:00', '10.0', '0.5'),
        (1, '2023-10-30 11:00:00', '5.0', '0.3')
)

and from that we would break it into days, and sum it up, like this sort-of:

select vehicle_id
    ,time_ts::date as day
    ,sum(dist_delta_km) as dist_km
    ,sum(fuel_delta_lt) as fuel_lt
    ,round(dist_km / fuel_lt, 2) as mpg_wink
from data
group by 1,2
order by 1,2;
VEHICLE_ID DAY DIST_KM FUEL_LT MPG_WINK*
1 2023-10-30 15 0.8 18.75

*Yes, that is not mpg, but I do not like the lt/km metric, and mpg is just the word I will use.

And sure enough, for a large fleet looking at a big picture of a week, this makes sense.

But if your data is unaligned, aka you get distance on most data rows, but only get fuel intermittently, over time you can get things like.

(1, '2023-10-29 20:00:00', '10.0', null),
(1, '2023-10-30 00:01:00', null, '5.0'),
(1, '2023-10-30 00:20:00', '25.0', null),
(1, '2023-10-30 00:21:00', null, '1.0')

So this data we have a 8pm distance, just after midnight we get a fuel reading, later some more distance and a smaller fuel change. Now this is a little odd, as the data is shown in delta (change) form

where the gold standard of data in my opinion is absolute (aka the odometer style) as if data is missing the change is still calculable. With this form there is an implicit “last fuel reading” that the after midnight value is “since”.

But all those nit-picking details aside, the core point here is that, “the fuel mostly belongs to the 29th not the 30th”. So how might we share the love?

We linearize it!

So if we mash a zero fuel reading into the delta data, to give us a “when is this data since”, like so:

(1, '2023-10-29 19:59:00', null, '0.0'),
(1, '2023-10-29 20:00:00', '10.0', null),
(1, '2023-10-30 00:01:00', null, '5.0'),
(1, '2023-10-30 00:20:00', '25.0', null),
(1, '2023-10-30 00:21:00', null, '1.0')

We currently get a division by zero error.. sigh, so lets make that safe with DIV0

,round(div0(dist_km, fuel_lt), 2) as mpg_wink
VEHICLE_ID DAY DIST_KM FUEL_LT MPG_WINK
1 2023-10-29 10 0 0
1 2023-10-30 25 6 4.17

So we can find the prior time of the fuel reading, and then slice that over days I will limit to 4 as here.

So here we are going to use NVL2 to conditionally do stuff if the current row (and the previous row’s fuel) are not null:

select
    vehicle_id
    ,time_ts
    ,fuel_delta_lt
    ,nvl2(fuel_delta_lt,
        lag(fuel_delta_lt) ignore nulls over (partition by vehicle_id order by time_ts) 
        ,null) as prior_fuel_delta
    ,nvl2(fuel_delta_lt,
        lag(nvl2(fuel_delta_lt, time_ts, null)) ignore nulls over (partition by vehicle_id order by time_ts) 
        ,null) as prior_fuel_time
from data
order by 1,2;

for the above data we get:

VEHICLE_ID TIME_TS FUEL_DELTA_LT PRIOR_FUEL_DELTA PRIOR_FUEL_TIME
1 2023-10-29 19:59:00.000 0 null null
1 2023-10-29 20:00:00.000 null null null
1 2023-10-30 00:01:00.000 5 0 2023-10-29 19:59:00.000
1 2023-10-30 00:20:00.000 null null null
1 2023-10-30 00:21:00.000 1 5 2023-10-30 00:01:00.000

so we can see that for rows that “have fuel” we find the prior rows values of the fuel, and of the time for that same row. This also could be done with a OBJECT_CONSTRUCT, and then one LAG on that, and then a pull apart, and for more values, that can be more performant, as always test, and re-test later.

So we now have the prior time, so we can now slice that to days, and given we are going to deal with “overnight” splits, we might as well fully chunk this data into N-days chunks.

with data(vehicle_id, time_ts, dist_delta_km, fuel_delta_lt ) as (
    select 
        $1,
        try_to_timestamp_ntz($2, 'yyyy-mm-dd hh:mi:ss'),
        try_to_number($3, 10, 1),
        try_to_number($4, 10, 1)
    from values 
        (1, '2023-10-29 19:59:00', null, '0.0'),
        (1, '2023-10-29 20:00:00', '10.0', null),
        (1, '2023-10-30 00:01:00', null, '5.0'),
        (1, '2023-10-30 00:20:00', '25.0', null),
        (1, '2023-10-30 00:21:00', null, '1.0')
), prior_fuel as (
    select
        vehicle_id
        ,time_ts
        ,fuel_delta_lt
        ,nvl2(fuel_delta_lt,
            lag(fuel_delta_lt) ignore nulls over (partition by vehicle_id order by time_ts) 
            ,null) as prior_fuel_delta
        ,nvl2(fuel_delta_lt,
            lag(nvl2(fuel_delta_lt, time_ts, null)) ignore nulls over (partition by vehicle_id order by time_ts) 
            ,null) as prior_fuel_time
    from data
), max_fuel_range(rn) as (
    select * from values (0),(1),(2),(3)
)
select pf.*
    ,fr.*
    ,dateadd('day', fr.rn, date_trunc('day', pf.prior_fuel_time)) as p_days
from prior_fuel as pf
left join max_fuel_range as fr
    on prior_fuel_time is not null
qualify 
   (pf.prior_fuel_time is null OR
    p_days < pf.time_ts)
order by 1,2;

So I am not loving that OR but for hand waving “something like this” purposes, we now have our data with days splits available:

VEHICLE_ID TIME_TS FUEL_DELTA_LT PRIOR_FUEL_DELTA PRIOR_FUEL_TIME RN P_DAYS
1 2023-10-29 19:59:00.000 0 null null null null
1 2023-10-29 20:00:00.000 null null null null null
1 2023-10-30 00:01:00.000 5 0 2023-10-29 19:59:00.000 0 2023-10-29 00:00:00.000
1 2023-10-30 00:01:00.000 5 0 2023-10-29 19:59:00.000 1 2023-10-30 00:00:00.000
1 2023-10-30 00:20:00.000 null null null null null
1 2023-10-30 00:21:00.000 1 5 2023-10-30 00:01:00.000 0 2023-10-30 00:00:00.000

So now we can LEAST/GREATEST the times to correctly time the span, and thus get the span duration, and the record duration, and thus linearize that fuel:

with data(vehicle_id, time_ts, dist_delta_km, fuel_delta_lt ) as (
    select 
        $1,
        try_to_timestamp_ntz($2, 'yyyy-mm-dd hh:mi:ss'),
        try_to_number($3, 10, 1),
        try_to_number($4, 10, 1)
    from values 
        (1, '2023-10-29 19:59:00', null, '0.0'),
        (1, '2023-10-29 20:00:00', '10.0', null),
        (1, '2023-10-30 00:01:00', null, '5.0'),
        (1, '2023-10-30 00:20:00', '25.0', null),
        (1, '2023-10-30 00:21:00', null, '1.0'),
        (2, '2023-10-28 23:00:00', null, '0.0'),
        (2, '2023-10-31 4:00:00', null, '53.0')
), prior_fuel as (
    select
        vehicle_id
        ,time_ts
        ,fuel_delta_lt
        ,nvl2(fuel_delta_lt,
            lag(nvl2(fuel_delta_lt, time_ts, null)) ignore nulls over (partition by vehicle_id order by time_ts) 
            ,null) as prior_fuel_time
    from data
), max_fuel_range(rn) as (
    select * from values (0),(1),(2),(3)
)
select pf.*
    ,fr.*
    ,dateadd('day', fr.rn, date_trunc('day', pf.prior_fuel_time)) as p_days
    ,greatest(pf.prior_fuel_time, p_days) as span_start
    ,least(dateadd('day',1,p_days), pf.time_ts) as span_end
    ,(date_part('epoch_second', span_end) - date_part('epoch_second', span_start))/3600 as span_len
    ,(date_part('epoch_second', pf.time_ts) - date_part('epoch_second', pf.prior_fuel_time))/3600 as record_len
    ,nvl2(pf.fuel_delta_lt, round(div0(fuel_delta_lt, record_len) * span_len, 3), null ) as linearize_fuel
from prior_fuel as pf
left join max_fuel_range as fr
    on prior_fuel_time is not null
qualify 
   (pf.prior_fuel_time is null OR
    p_days < pf.time_ts)
order by 1,2;
VEHICLE_ID TIME_TS FUEL_DELTA_LT PRIOR_FUEL_TIME RN P_DAYS SPAN_START SPAN_END SPAN_LEN RECORD_LEN LINEARIZE_FUEL
1 2023-10-29 19:59:00.000 0
1 2023-10-29 20:00:00.000
1 2023-10-30 00:01:00.000 5 2023-10-29 19:59:00.000 0 2023-10-29 00:00:00.000 2023-10-29 19:59:00.000 2023-10-30 00:00:00.000 4.016667 4.033333 4.979
1 2023-10-30 00:01:00.000 5 2023-10-29 19:59:00.000 1 2023-10-30 00:00:00.000 2023-10-30 00:00:00.000 2023-10-30 00:01:00.000 0.016667 4.033333 0.021
1 2023-10-30 00:20:00.000
1 2023-10-30 00:21:00.000 1 2023-10-30 00:01:00.000 0 2023-10-30 00:00:00.000 2023-10-30 00:01:00.000 2023-10-30 00:21:00.000 0.333333 0.333333 1
2 2023-10-28 23:00:00.000 0
2 2023-10-31 04:00:00.000 53 2023-10-28 23:00:00.000 0 2023-10-28 00:00:00.000 2023-10-28 23:00:00.000 2023-10-29 00:00:00.000 1 53 1
2 2023-10-31 04:00:00.000 53 2023-10-28 23:00:00.000 1 2023-10-29 00:00:00.000 2023-10-29 00:00:00.000 2023-10-30 00:00:00.000 24 53 24
2 2023-10-31 04:00:00.000 53 2023-10-28 23:00:00.000 2 2023-10-30 00:00:00.000 2023-10-30 00:00:00.000 2023-10-31 00:00:00.000 24 53 24
2 2023-10-31 04:00:00.000 53 2023-10-28 23:00:00.000 3 2023-10-31 00:00:00.000 2023-10-31 00:00:00.000 2023-10-31 04:00:00.000 4 53 4

yippie!

Which should be jiggled around a little more, and that linearization should have the multiplication prior to the division for numerical stability.

so should look more like:

with data(vehicle_id, time_ts, dist_delta_km, fuel_delta_lt ) as (
    select 
        $1,
        try_to_timestamp_ntz($2, 'yyyy-mm-dd hh:mi:ss'),
        try_to_number($3, 10, 1),
        try_to_number($4, 10, 1)
    from values 
        (1, '2023-10-29 19:59:00', null, '0.0'),
        (1, '2023-10-29 20:00:00', '10.0', null),
        (1, '2023-10-30 00:01:00', null, '5.0'),
        (1, '2023-10-30 00:20:00', '25.0', null),
        (1, '2023-10-30 00:21:00', null, '1.0'),
        (2, '2023-10-28 23:00:00', null, '0.0'),
        (2, '2023-10-31 4:00:00', null, '53.0')
), prior_fuel as (
    select
        vehicle_id
        ,time_ts
        ,fuel_delta_lt
        ,nvl2(fuel_delta_lt,
            lag(nvl2(fuel_delta_lt, time_ts, null)) ignore nulls over (partition by vehicle_id order by time_ts) 
            ,null) as prior_fuel_time
    from data
), max_fuel_range(rn) as (
    select * from values (0),(1),(2),(3)
), fuel_spans as (
    select pf.*
        ,fr.*
        ,dateadd('day', fr.rn, date_trunc('day', pf.prior_fuel_time)) as p_days
    from prior_fuel as pf
    left join max_fuel_range as fr
        on prior_fuel_time is not null
    qualify 
       (pf.prior_fuel_time is null OR
        p_days < pf.time_ts)
)
select fs.* exclude (prior_fuel_time, rn, p_days)
    ,greatest(fs.prior_fuel_time, fs.p_days) as span_start
    ,least(dateadd('day',1,fs.p_days), fs.time_ts) as span_end
    ,(date_part('epoch_second', span_end) - date_part('epoch_second', span_start)) as span_len
    ,(date_part('epoch_second', fs.time_ts) - date_part('epoch_second', fs.prior_fuel_time)) as record_len
    ,nvl2(fs.fuel_delta_lt, round(div0(fuel_delta_lt * span_len, record_len),4), null ) as linearize_fuel
from fuel_spans as fs
order by 1,2, span_start;

giving:

VEHICLE_ID TIME_TS FUEL_DELTA_LT SPAN_START SPAN_END SPAN_LEN RECORD_LEN LINEARIZE_FUEL
1 2023-10-29 19:59:00.000 0
1 2023-10-29 20:00:00.000
1 2023-10-30 00:01:00.000 5 2023-10-29 19:59:00.000 2023-10-30 00:00:00.000 14,460 14,520 4.9793
1 2023-10-30 00:01:00.000 5 2023-10-30 00:00:00.000 2023-10-30 00:01:00.000 60 14,520 0.0207
1 2023-10-30 00:20:00.000
1 2023-10-30 00:21:00.000 1 2023-10-30 00:01:00.000 2023-10-30 00:21:00.000 1,200 1,200 1
2 2023-10-28 23:00:00.000 0
2 2023-10-31 04:00:00.000 53 2023-10-28 23:00:00.000 2023-10-29 00:00:00.000 3,600 190,800 1
2 2023-10-31 04:00:00.000 53 2023-10-29 00:00:00.000 2023-10-30 00:00:00.000 86,400 190,800 24
2 2023-10-31 04:00:00.000 53 2023-10-30 00:00:00.000 2023-10-31 00:00:00.000 86,400 190,800 24
2 2023-10-31 04:00:00.000 53 2023-10-31 00:00:00.000 2023-10-31 04:00:00.000 14,400 190,800 4

So here we have “one signal” fuel, stretched over time linearly, so that fuel used yesterday can be counted against “yesterday”. So bringing the original sum code back to life, and keeping the OG code, and the linear fuel:

with data(vehicle_id, time_ts, dist_delta_km, fuel_delta_lt ) as (
    select 
        $1,
        try_to_timestamp_ntz($2, 'yyyy-mm-dd hh:mi:ss'),
        try_to_number($3, 10, 1),
        try_to_number($4, 10, 1)
    from values 
        (1, '2023-10-29 19:59:00', null, '0.0'),
        (1, '2023-10-29 20:00:00', '10.0', null),
        (1, '2023-10-30 00:01:00', null, '5.0'),
        (1, '2023-10-30 00:20:00', '25.0', null),
        (1, '2023-10-30 00:21:00', null, '1.0')--,
        --(2, '2023-10-28 23:00:00', null, '0.0'),
        --(2, '2023-10-31 4:00:00', null, '53.0')
), prior_fuel as (
    select
        vehicle_id
        ,time_ts
        ,dist_delta_km
        ,fuel_delta_lt
        ,nvl2(fuel_delta_lt,
            lag(nvl2(fuel_delta_lt, time_ts, null)) ignore nulls over (partition by vehicle_id order by time_ts) 
            ,null) as prior_fuel_time
    from data
), max_fuel_range(rn) as (
    select * from values (0),(1),(2),(3)
), fuel_spans as (
    select pf.*
        ,fr.*
        ,dateadd('day', fr.rn, date_trunc('day', pf.prior_fuel_time)) as p_days
    from prior_fuel as pf
    left join max_fuel_range as fr
        on prior_fuel_time is not null
    qualify 
       (pf.prior_fuel_time is null OR
        p_days < pf.time_ts)
), fuel_linearized as (
    select fs.* exclude (prior_fuel_time, rn, p_days)
        ,greatest(fs.prior_fuel_time, fs.p_days) as span_start
        ,least(dateadd('day',1,fs.p_days), fs.time_ts) as span_end
        ,(date_part('epoch_second', span_end) - date_part('epoch_second', span_start)) as span_len
        ,(date_part('epoch_second', fs.time_ts) - date_part('epoch_second', fs.prior_fuel_time)) as record_len
        ,nvl2(fs.fuel_delta_lt, round(div0(fuel_delta_lt * span_len, record_len),4), null ) as linearize_fuel
    from fuel_spans as fs
)
select 
    vehicle_id
    ,nvl(span_start, time_ts)::date as day
    ,sum(dist_delta_km) as dist_km
    ,sum(iff(span_end = time_ts, fuel_delta_lt, null)) as og_fuel_lt
    ,round(div0(dist_km, og_fuel_lt), 2) as og_mpg
    ,sum(linearize_fuel) as lin_fuel_lt
    ,round(div0(dist_km, lin_fuel_lt), 2) as lin_mpg    
from fuel_linearized as fl
group by 1,2
order by 1,2;
VEHICLE_ID DAY DIST_KM OG_FUEL_LT OG_MPG LIN_FUEL_LT LIN_MPG
1 2023-10-29 10 null null 4.9793 2.01
1 2023-10-30 25 6 4.17 1.0207 24.49

So we should linearize the distance also… which is a rinse and repeat of the above.

But the next thing to consider is “how much fuel is spent while the vehicle is off”, which is to say, if the vehicle does report “anything” while off, and is off for 1 day, so like that vehicle 2 data lets say, but we also know it was off, we do not want to treat that time equally. This is Weighted Linearization, and I will post on that next.