Virtual Dataset shows stale data


#1

Can you please confirm if my understanding regarding virtual dataset is correct

Whenever new data arrive in Source, physical data set will be refreshed as per MetaData refresh settings and virtual data set (derived from physical data set) will be refreshed automatically as soon as physical data set is refreshed?


#2

Hey Monika, if you don’t have any reflections, refreshing metadata for all sources/tables involved in the virtual dataset should be sufficient.

If you have reflections that affect the VDS, you’d first refresh metadata and then the relevant reflections for that VDS (either on VDS’s parents, or the VDS itself). After this, you should get up to date results.


#3

@can Thanks for quick response. I have created VDS by joining 2 physical Dataset (HDFS & postgreSQL) and refreshed both. Both still VDS show old data only. Please suggest. No reflection on VDS.


#4

Do you have reflections on the tables from HDFS and postgreSQL?


#5

@can No reflection yet. Just doing a POC with product.


#6

Hmm. Are you observing the stale data in the UI or when you run queries over JDBC/ODBC? I’m wondering if you are doing a “Preview” instead of a “Run” in the UI. “Previews” are meant to give you a relatively quick peak of the virtual dataset, and are not meant for actual analysis or correctness.

image


#7

@Can I can view stale data in UI only and executing run command.

Basically join columns from PostgreSQL are mapped as NULL as I Truncated and loaded PostgreSQl again.


#8

Could you share a query profile for a job that you observed this?
Also could you describe the expected results vs. the observed? Are the tables not getting joined? Is Dremio thinking all your join keys are NULL?


#9

@Can Was testing MetaData Refresh on physical dataset with large data. Strange to see old record count only.
To give you more background, I have mapped around 15 million records from HDFS and MetaData refresh interval is set to 1 minute. Added few new files to HDFS directory. Even after 10 minutes count(*) displays old count only.
Even Refreshed Metadata using ALTER TABLE <> REFRESH METADATA but no luck.

Old Count : 15652833
New Count : 17045521

Only Removing and adding format again for physical data source shows correct number.
I think that shouldn’t be the case. Dataset should be refreshed automatically according to it’s refresh interval.


#10

A few questions:

  • Could you share the output of REFRESH METADATA command? Here is an example:
  • Do you see repeating errors in server.log? If any, should be around every 1 minutes or so based on your refresh interval.
  • Could you share a screenshot of the metadata refresh policy for your HDFS source (and dataset if diff. from source refresh policy)?

#11

@can
Tried hard to replicate this scenario but unable to do it now. Will let you know if it occurs in future.Refresh MetaData results in ‘MetaData for table <> refreshed’.

Thanks for looking into it.


#12

Hi @can
Can you please clarify my understanding regarding reflection:
When we create a reflection on PDS then same reflection can be used for optimizating VDS query as well?
If there is any change in Data Source (lets say new data arrives) then reflection should be refreshed to use it further (No need to re-create it, let it refresh automatically based on refresh interval)?

Thanks


#13

Hey Monika, yeap, reflections you have on parent PDSs (that a VDS selects from), may be leveraged to accelerate queries on the VDS. Dremio considers all relevant reflections, even if they are not directly on the dataset you have in you user query (e.g. reflections on the parent, reflections on other VDSs if they apply).

You can either have automatic refreshes through the refresh policy, or trigger them manually yourself via APIs. Remember that you’ll need to refresh metadata before refreshing reflections to use the latest data available in your source systems.


#14

@can Thanks for confirmation.
Have below scenario where reflection on PDS is not used for accelerating VDS query;
Created a reflection on PDS (HDFS source) and a VDS from HDFS source & PostgreSQL source (by joining both). But reflection created on PDS is not used by VDS.


#15

@Monika_Goel its probably better if you post this on a separate thread – as this discussion is no longer regarding VDS stale data.

Quickly looking at the profiles: The results here are expected, Dremio’s optimizer chooses not to use the reflection you have because its pretty much the same format/structure as your original parquet files (meaning they have the same perf. & cost.) I’d recommend checking this post from Kelly, as it’s very relevant to what you’re seeing right now (S3 vs. HDFS shouldn’t matter in this case): Why use reflection on reading data from S3?


#16

@can Thanks for your response. Will start a new thread for this as in reflection I have used partitioned & sorted functionality and in query put filters on partitioned column and order by on sorted column.