Virtual Dataset shows stale data

Monika_Goel · September 19, 2018, 7:39pm

Can you please confirm if my understanding regarding virtual dataset is correct

Whenever new data arrive in Source, physical data set will be refreshed as per MetaData refresh settings and virtual data set (derived from physical data set) will be refreshed automatically as soon as physical data set is refreshed?

can · September 19, 2018, 7:49pm

Hey Monika, if you don’t have any reflections, refreshing metadata for all sources/tables involved in the virtual dataset should be sufficient.

If you have reflections that affect the VDS, you’d first refresh metadata and then the relevant reflections for that VDS (either on VDS’s parents, or the VDS itself). After this, you should get up to date results.

Monika_Goel · September 19, 2018, 7:52pm

@can Thanks for quick response. I have created VDS by joining 2 physical Dataset (HDFS & postgreSQL) and refreshed both. Both still VDS show old data only. Please suggest. No reflection on VDS.

can · September 19, 2018, 7:53pm

Do you have reflections on the tables from HDFS and postgreSQL?

Monika_Goel · September 19, 2018, 7:55pm

@can No reflection yet. Just doing a POC with product.

can · September 19, 2018, 8:04pm

Hmm. Are you observing the stale data in the UI or when you run queries over JDBC/ODBC? I’m wondering if you are doing a “Preview” instead of a “Run” in the UI. “Previews” are meant to give you a relatively quick peak of the virtual dataset, and are not meant for actual analysis or correctness.

Monika_Goel · September 19, 2018, 8:31pm

@Can I can view stale data in UI only and executing run command.

Basically join columns from PostgreSQL are mapped as NULL as I Truncated and loaded PostgreSQl again.

can · September 19, 2018, 11:36pm

Could you share a query profile for a job that you observed this?
Also could you describe the expected results vs. the observed? Are the tables not getting joined? Is Dremio thinking all your join keys are NULL?

Monika_Goel · September 20, 2018, 8:15pm

@Can Was testing MetaData Refresh on physical dataset with large data. Strange to see old record count only.
To give you more background, I have mapped around 15 million records from HDFS and MetaData refresh interval is set to 1 minute. Added few new files to HDFS directory. Even after 10 minutes count(*) displays old count only.
Even Refreshed Metadata using ALTER TABLE <> REFRESH METADATA but no luck.

Old Count : 15652833
New Count : 17045521

Only Removing and adding format again for physical data source shows correct number.
I think that shouldn’t be the case. Dataset should be refreshed automatically according to it’s refresh interval.

can · September 20, 2018, 9:18pm

A few questions:

Could you share the output of REFRESH METADATA command? Here is an example:

image.png1121×446 48.3 KB
Do you see repeating errors in server.log? If any, should be around every 1 minutes or so based on your refresh interval.
Could you share a screenshot of the metadata refresh policy for your HDFS source (and dataset if diff. from source refresh policy)?

Monika_Goel · September 21, 2018, 2:11pm

@can
Tried hard to replicate this scenario but unable to do it now. Will let you know if it occurs in future.Refresh MetaData results in ‘MetaData for table <> refreshed’.

Thanks for looking into it.

Monika_Goel · September 21, 2018, 8:22pm

Hi @can
Can you please clarify my understanding regarding reflection:
When we create a reflection on PDS then same reflection can be used for optimizating VDS query as well?
If there is any change in Data Source (lets say new data arrives) then reflection should be refreshed to use it further (No need to re-create it, let it refresh automatically based on refresh interval)?

Thanks

can · September 21, 2018, 8:47pm

Hey Monika, yeap, reflections you have on parent PDSs (that a VDS selects from), may be leveraged to accelerate queries on the VDS. Dremio considers all relevant reflections, even if they are not directly on the dataset you have in you user query (e.g. reflections on the parent, reflections on other VDSs if they apply).

You can either have automatic refreshes through the refresh policy, or trigger them manually yourself via APIs. Remember that you’ll need to refresh metadata before refreshing reflections to use the latest data available in your source systems.

Monika_Goel · September 24, 2018, 3:14pm

@can Thanks for confirmation.
Have below scenario where reflection on PDS is not used for accelerating VDS query;
Created a reflection on PDS (HDFS source) and a VDS from HDFS source & PostgreSQL source (by joining both). But reflection created on PDS is not used by VDS.

can · September 24, 2018, 6:55pm

@Monika_Goel its probably better if you post this on a separate thread – as this discussion is no longer regarding VDS stale data.

Quickly looking at the profiles: The results here are expected, Dremio’s optimizer chooses not to use the reflection you have because its pretty much the same format/structure as your original parquet files (meaning they have the same perf. & cost.) I’d recommend checking this post from Kelly, as it’s very relevant to what you’re seeing right now (S3 vs. HDFS shouldn’t matter in this case): Why use reflection on reading data from S3?

Monika_Goel · September 24, 2018, 7:22pm

@can Thanks for your response. Will start a new thread for this as in reflection I have used partitioned & sorted functionality and in query put filters on partitioned column and order by on sorted column.

Topic		Replies	Views
Some virtual datasets are out of date and need to be manually updated	11	2759	February 20, 2021
Refreshing a VDS reflection	1	1964	September 10, 2018
Bug in the process of refreshing physical dataset : dependent reflections refreshed too soon?	6	1653	November 28, 2018
Refreshing reflection of VDS never triggered	3	2058	July 14, 2021
External_query() reflection refresh schedule - what is it?	2	121	June 27, 2024

Virtual Dataset shows stale data

Related topics