Refresh Metadata Taking Ling Time

Hi,

We are using Dremio with S3 compatible storage as datasource. Our datasets on Datalake updates(appends/overwrites new files) for every 3hrs. To reflect the new files on Dremio PDS, we are refreshing the metadata with the below query after every write operation to the source.

ALTER PDS <PHYSICAL-DATASET-PATH> REFRESH METADATA

But it is taking around 15mins to refresh the metadata. The data size of dataset is ~100GB.

I am not sure why it is taking more than 15mins just to refresh the metadata. Is it expected or is there any alternative for refresh metadata?

Note: I even tried the Forget metadata and recreate PDS through API. Recreation of PDS took the same time (18mins)

Hi, I’m very interested in this too. My team always used an attached disk storage (EBS on AWS) as data source but our goal is to move to S3 as data grow. We have now almost 400 GB of parquet partitioned data and yes, the problem you noticed is one of the biggest problems we are facing and is preventing us to move toward S3 with Dremio. If the data source is the disk, then the metadata refresh takes just a few seconds, but if we use S3 it takes 15 to 20 minutes (same data replicated on S3). Our case is even worse, since the data is updated asynchronously by other processes and can change very quickly. We also want to read the new data after a few minutes and this is not possible giving the metadata refresh delay (and during metadata refresh you should not write/read data…).
Things become weird when you modify or delete a file that already exists… Dremio by default caches locally the data from S3 so you keep getting old data from your query even if the data has been updated or removed, until you run a refresh metadata. You can of course disable the caching but this solves only the updated data case, if you delete a file from the collection you must refresh metadata as soon as possible to be sure that Dremio does not fail with a “NoSuchKeyException” when attempting to read the source object.

I think that the functionality that Dremio is really missing is a procedure to tell the metastore what single files in the collection have been updated or removed (please someone tell me if such procedure already exists). I saw this available in other engines like PrestoSQL that we found also more resilient and coherent to data updates not showing the problems listed before (we are reconsidering it right now…). Funny thing is… yesterday I read a Dremio-Presto comparison (edited by Dremio) and Dremio seems indeed faster in many ways, but I wonder if all the caching and the strict metadata handling have a part in defining this performance difference.

2 Likes

@Narendra1154,

What is the S3 compatible source? Is the long metadata refresh only observed for the S3 compatible source?

What version of Dremio are you using?

@Luca,
Same question for you: what version of Dremio are you experiencing the issue with?

@ben This is a problem we have since the first versions (we adopted Dremio a couple year ago). Now I’m testing with the latest community release 4.6.1. Parquet are written with pyarrow (version >=0.14) and are snappy compressed. The same data on disk get metadata refresh in 15 seconds, on S3 it takes about 30 minutes. This is the reason why we are still using EBS as storage, but we must move to S3 soon. Disk storage anyway has another problem: if you modify a parquet inside the PDS you must refresh metadata before querying otherwise Dremio may fail the query with a “magic number” error, the same does not happen on S3 (as stated in my previous post).

@ben,

The S3 source format is parquet, partitioned by a column. We are using an enterprise edition - 4.3.1.

Is there a way, where Dremio PDS can automatically read all the files under folder, say something like <folder_name>/* so that anytime a new file is added, we don’t have to do a refresh.

What we observed is that, on overwriting, if the file names are same, we don’t need to refresh metadata. this is needed only if we have new filenames coming up.
I understand that refresh metadata, also does collect some stats like count etc, which takes more time.
Is there a service, where it just recognizes the newly added files?

@ben
S3 compatible source: IBM CloudObjectStorage
Dremio Version: 4.3.1(Enterprise Edition)

We don’t have other sources. So, I didn’t test with other sources

Time taken for reflection refresh also increases if it is done after executing

ALTER PDS <PHYSICAL-DATASET-PATH> REFRESH METADATA

Any other way to update the metadata without increasing the Reflection refresh time.