could you please give a link or some steps on how to use a modified date as the “Identify new records using the field”? currently I am not able to find “Identify new records using the field” in dataset settings -> reflection refresh. How do I make it appear? do I have to add a new column to the dataset/PDS as the modified date?
I think you are asking about the field to define for incremental refreshes.
If it is a file system based source then we have only one choice “Incremental update based on new files”, if it is a source like a RDBMS (Oracle/Postgres) or NoSQL like Mongo then you can define a field with the below restrictions
Kindly let us know if you have any other questions
thanks a ton for your quick response. I have one more question: why can’t we manipulate reflection refresh in dataset settings on top of VDS? i.e. there is no reflection refresh policy provided in dataset settings in VDS whereas there is in PDS. Why?
I did a test and found out the VDS’s refresh policy follows the one created on top of PDS. is that the rule you set up?
Currently we work on bottom up approach where we always refresh the PDS. To refresh a reflection on a VDS, simple refresh one of the PDS referenced in the VDS definition using a REST API call
ok. I see. and what we discussed is about raw reflections. Does aggregation reflections have incremental refresh policy supported?
As I tested,
If I choose “incremental update based on new files” in Refresh Method and then click “Refresh Now” in Refresh Policy, the aggregation reflection can be used to accelerate the sql: SELECT sum(“Count”) FROM testfile. Does that mean dremio is supportive of incremental refresh for aggregation reflections?
Thanks a million.
Is there anything that would cause this not to work? I am adding new batches of files to an s3 source and the dependent reflection does not update no matter what I do. In fact the only way to get a different query result is if I don’t have any reflections defined at all.
Edit: Manually hitting Reflect reflections makes it refresh, but thats the only thing that works, including hitting the Dataset refresh api endpoint which doesn’t work either.
Reflections will refresh every X hours which can be configured per source under
Reflection Refresh. The refresh now button using the REST API, are you sure you are getting the correct id?
Ok so I figured out the issue (repeating what you said mostly here @doron)
- I was under the impression that incremental refresh actually updated the reflection as some function of the new files arriving but it still refreshes hourly, just uses incremental refresh to make it more efficient.
- The refresh api is for the physical data source, it was hard to get the physical data source had to use url encoding and then tweak that some, but eventually found it. Once I found that the refresh worked although it was very slow, up to 20 -30 seconds for a simple data set that just consisted of id’s under 10,000
One issue that I am noticing that may be a big one. I was under the impression that when an impression is stale that it simply affects the speed with which the query is run and not its accuracy, but I’m finding on testing that the query actually produces stale results when reflections are not up to date. Does this match your understanding?