Retrieve rows based on the latest timestamp for each group

Hi Team,

I have a time series Apache Iceberg table that has 3 columns:
Timestamp
ID
Value

I would like to get the latest value and timestamp for each ID in the dataset. This is a query that is likely going to be called to live refresh the data in a dashboard so it needs to be efficient.

Because I am not concerned about if I have a sensor that last operated months ago. I decided the most efficient way to tackle this is to have a time filter that filters the data down to its latest partition. In this case the table is partition into days. I can also filter by only the ids I want to look at.

Therefore I get the below query:

SELECT h.ts, h.id, h.data
FROM iceberg.table AS h
WHERE TO_DATE(h.ts) = CURRENT_DATE()
AND h.id IN (‘2817498’,‘2817497’)

The data however is captured at 15 minute intervals, so really what I want to do now is take this days worth of data from this query and remove the duplicates that don’t correspond to the latest date for each ID. How would I got about doing this? Or is there another way?

Note the latest timestamp for each id isn’t exactly the same hence I can’t just filter on the max timestamp.

This last step is quite trivial in the back end application that calls this query but it would be nice to have it all done in Dremio.

It’s good that you partitioned by date and included a filter on that in your query.

Maybe you can use a window function that partitions on ID with sort on timestamp to get the first record in each ID partition?

@Benny_Chow thanks for the response. Any advice on how to actually write that as a query? Prior to posting this question I had a go at using FIRST_VALUE here.
https://docs.dremio.com/software/sql-reference/sql-functions/functions/FIRST_VALUE/?parent=window

However the output of this is not quite as I had expected as its creates a new column that returns the results of the expression against the original data. Therefore does not quite solve the problem of returning a table with the duplicates removed that don’t correspond to the latest date for each ID.

Perhaps something like this:

SELECT state, city, _id, ROW_NUMBER() OVER(PARTITION BY state ORDER BY _id DESC) AS row_num
FROM "zips.json"
QUALIFY row_num = 1;

Note QUALIFY was added in v23. On prior versions that didn’t support QUALIFY yet, you could achieve the same with an outer select and filter on row_num.

Thanks @Benny_Chow I will give this go. I will need to upgrade my Dremio instance as I am on v22.