Retrieve rows based on the latest timestamp for each group

nikhil.makan · January 13, 2023, 3:13am

Hi Team,

I have a time series Apache Iceberg table that has 3 columns:
Timestamp
ID
Value

I would like to get the latest value and timestamp for each ID in the dataset. This is a query that is likely going to be called to live refresh the data in a dashboard so it needs to be efficient.

Because I am not concerned about if I have a sensor that last operated months ago. I decided the most efficient way to tackle this is to have a time filter that filters the data down to its latest partition. In this case the table is partition into days. I can also filter by only the ids I want to look at.

Therefore I get the below query:

SELECT h.ts, h.id, h.data
FROM iceberg.table AS h
WHERE TO_DATE(h.ts) = CURRENT_DATE()
AND h.id IN (‘2817498’,‘2817497’)

The data however is captured at 15 minute intervals, so really what I want to do now is take this days worth of data from this query and remove the duplicates that don’t correspond to the latest date for each ID. How would I got about doing this? Or is there another way?

Note the latest timestamp for each id isn’t exactly the same hence I can’t just filter on the max timestamp.

This last step is quite trivial in the back end application that calls this query but it would be nice to have it all done in Dremio.

Benny_Chow · January 13, 2023, 7:01pm

It’s good that you partitioned by date and included a filter on that in your query.

Maybe you can use a window function that partitions on ID with sort on timestamp to get the first record in each ID partition?

nikhil.makan · January 15, 2023, 10:25pm

@Benny_Chow thanks for the response. Any advice on how to actually write that as a query? Prior to posting this question I had a go at using FIRST_VALUE here.
https://docs.dremio.com/software/sql-reference/sql-functions/functions/FIRST_VALUE/?parent=window

However the output of this is not quite as I had expected as its creates a new column that returns the results of the expression against the original data. Therefore does not quite solve the problem of returning a table with the duplicates removed that don’t correspond to the latest date for each ID.

Benny_Chow · January 15, 2023, 11:05pm

Perhaps something like this:

SELECT state, city, _id, ROW_NUMBER() OVER(PARTITION BY state ORDER BY _id DESC) AS row_num
FROM "zips.json"
QUALIFY row_num = 1;

Note QUALIFY was added in v23. On prior versions that didn’t support QUALIFY yet, you could achieve the same with an outer select and filter on row_num.

nikhil.makan · January 17, 2023, 7:44am

Thanks @Benny_Chow I will give this go. I will need to upgrade my Dremio instance as I am on v22.

Topic		Replies	Views
SQL Filtering on Timestamps Returning Incorrect Records	7	4773	March 22, 2018
Dremio not using partition filter	3	162	May 19, 2024
Resampling/Group By Time Series Data	3	1375	February 2, 2023
Upcoming support for query Apache Iceberg deleted data?	5	1127	November 25, 2021
Optimizing Incremental Reflections in Dremio: Handling Newly Inserted Data with Timestamp Columns	6	71	January 8, 2025

Retrieve rows based on the latest timestamp for each group

Related topics