Can you please confirm if my understanding regarding virtual dataset is correct
Whenever new data arrive in Source, physical data set will be refreshed as per MetaData refresh settings and virtual data set (derived from physical data set) will be refreshed automatically as soon as physical data set is refreshed?
Hey Monika, if you don’t have any reflections, refreshing metadata for all sources/tables involved in the virtual dataset should be sufficient.
If you have reflections that affect the VDS, you’d first refresh metadata and then the relevant reflections for that VDS (either on VDS’s parents, or the VDS itself). After this, you should get up to date results.
@can Thanks for quick response. I have created VDS by joining 2 physical Dataset (HDFS & postgreSQL) and refreshed both. Both still VDS show old data only. Please suggest. No reflection on VDS.
Hmm. Are you observing the stale data in the UI or when you run queries over JDBC/ODBC? I’m wondering if you are doing a “Preview” instead of a “Run” in the UI. “Previews” are meant to give you a relatively quick peak of the virtual dataset, and are not meant for actual analysis or correctness.
Could you share a query profile for a job that you observed this?
Also could you describe the expected results vs. the observed? Are the tables not getting joined? Is Dremio thinking all your join keys are NULL?
@Can Was testing MetaData Refresh on physical dataset with large data. Strange to see old record count only.
To give you more background, I have mapped around 15 million records from HDFS and MetaData refresh interval is set to 1 minute. Added few new files to HDFS directory. Even after 10 minutes count(*) displays old count only.
Even Refreshed Metadata using ALTER TABLE <> REFRESH METADATA but no luck.
Old Count : 15652833
New Count : 17045521
Only Removing and adding format again for physical data source shows correct number.
I think that shouldn’t be the case. Dataset should be refreshed automatically according to it’s refresh interval.
@can
Tried hard to replicate this scenario but unable to do it now. Will let you know if it occurs in future.Refresh MetaData results in ‘MetaData for table <> refreshed’.
Hi @can
Can you please clarify my understanding regarding reflection:
When we create a reflection on PDS then same reflection can be used for optimizating VDS query as well?
If there is any change in Data Source (lets say new data arrives) then reflection should be refreshed to use it further (No need to re-create it, let it refresh automatically based on refresh interval)?
Hey Monika, yeap, reflections you have on parent PDSs (that a VDS selects from), may be leveraged to accelerate queries on the VDS. Dremio considers all relevant reflections, even if they are not directly on the dataset you have in you user query (e.g. reflections on the parent, reflections on other VDSs if they apply).
You can either have automatic refreshes through the refresh policy, or trigger them manually yourself via APIs. Remember that you’ll need to refresh metadata before refreshing reflections to use the latest data available in your source systems.
@can Thanks for confirmation.
Have below scenario where reflection on PDS is not used for accelerating VDS query;
Created a reflection on PDS (HDFS source) and a VDS from HDFS source & PostgreSQL source (by joining both). But reflection created on PDS is not used by VDS.
@Monika_Goel its probably better if you post this on a separate thread – as this discussion is no longer regarding VDS stale data.
Quickly looking at the profiles: The results here are expected, Dremio’s optimizer chooses not to use the reflection you have because its pretty much the same format/structure as your original parquet files (meaning they have the same perf. & cost.) I’d recommend checking this post from Kelly, as it’s very relevant to what you’re seeing right now (S3 vs. HDFS shouldn’t matter in this case): Why use reflection on reading data from S3?
@can Thanks for your response. Will start a new thread for this as in reflection I have used partitioned & sorted functionality and in query put filters on partitioned column and order by on sorted column.