Physical Datasets disappearing

ma82 · March 4, 2022, 8:56am

We are using Dremio 19.0 CE and we frequently find physical datasets promoted from S3 files to disappear from the catalog: some query fails with Error while expanding view and we have to promote them again.

Note that we have the option Remove dataset definitions if underlying data is unavailable. so we don’t expect them to be automatically forgotten.

What can we do to investigate the issue?

balaji.ramaswamy · March 8, 2022, 5:01am

@ma82 Have you unchecked the above option? Under the Dremio log folder, there will be a file called metadata_refresh.log and more metadata_refresh logs under the log/archive folder with the date as part of the file name. If you look through these logs, do you see Dremio losing connectivity to S3?

ma82 · March 10, 2022, 10:11am

@balaji.ramaswamy

Thanks for your reply.

Yes, we had the Remove dataset definitions if underlying data is unavailable option unchecked, sorry for writing imprecisely.

As a workaround we decided to try enabling the
Automatically format files into physical datasets when users issue queries.
option and it seems to work correctly but we’d prefer not to incur in the costs of reformatting datasets which we expect to be already formatted.

I’m looking for metadata_refresh.log but in our helm chart-based setup i cannot find it in any pods.
Should I look into the kubernetes logs of these pods, instead? If so, can I filter them somehow?

balaji.ramaswamy · March 25, 2022, 2:52pm

@ma82 You are right, maybe all logs are going into standard out, few questions

Does your Dremio coordinator go down unexpectedly?
Do you change any settings on your S3 source?
Does Dremio lose connection S3 anytime?

ma82 · April 1, 2022, 9:23am

Hi @balaji.ramaswamy

Our dremio coordinator doesn’t go down unexpectedly: sometimes we restart either that or the executors, though (e.g. for k8s node updates), and in that case we do see physical datasets disappearing. However, it doesn’t explain all sudden losses of catalog entries.
We changed settings very rarely on our S3 source, so we don’t expect this to be a relevant reason.
We can’t see any sign of losing connection to S3: this is something that I wouldn’t expect to happen frequently in an AWS VPC from EKS pods so I think this is not a relevant event.

Please let me know whether I can grep for some message in particular.

balaji.ramaswamy · April 3, 2022, 7:08pm

@ma82 Would you be able to see if during the restarts there was a metadata refresh happening in the background? Basically match restart timestamps with metadata_refresh.log and see if there was one in process

Topic		Replies	Views
Physical Dataset not Auto Refreshing	4	906	June 21, 2023
Near real time metadata refresh	8	2434	December 10, 2021
Refreshing physical datasets in external s3 source failing Dremio University	1	1242	October 29, 2021
NAS datalake - datasources disappear	4	1105	June 29, 2021
Metadata Refresh - Stop Refreshing for old Datasets Dremio University	1	1188	December 20, 2021

Physical Datasets disappearing

Related topics