Hi Team,
I have a time series Apache Iceberg table that has 3 columns:
Timestamp
ID
Value
I would like to get the latest value and timestamp for each ID in the dataset. This is a query that is likely going to be called to live refresh the data in a dashboard so it needs to be efficient.
Because I am not concerned about if I have a sensor that last operated months ago. I decided the most efficient way to tackle this is to have a time filter that filters the data down to its latest partition. In this case the table is partition into days. I can also filter by only the ids I want to look at.
Therefore I get the below query:
SELECT h.ts, h.id, h.data
FROM iceberg.table AS h
WHERE TO_DATE(h.ts) = CURRENT_DATE()
AND h.id IN (‘2817498’,‘2817497’)
The data however is captured at 15 minute intervals, so really what I want to do now is take this days worth of data from this query and remove the duplicates that don’t correspond to the latest date for each ID. How would I got about doing this? Or is there another way?
Note the latest timestamp for each id isn’t exactly the same hence I can’t just filter on the max timestamp.
This last step is quite trivial in the back end application that calls this query but it would be nice to have it all done in Dremio.