Refresh Metadata Taking Ling Time

Luca · August 5, 2020, 2:50pm

Hi, I’m very interested in this too. My team always used an attached disk storage (EBS on AWS) as data source but our goal is to move to S3 as data grow. We have now almost 400 GB of parquet partitioned data and yes, the problem you noticed is one of the biggest problems we are facing and is preventing us to move toward S3 with Dremio. If the data source is the disk, then the metadata refresh takes just a few seconds, but if we use S3 it takes 15 to 20 minutes (same data replicated on S3). Our case is even worse, since the data is updated asynchronously by other processes and can change very quickly. We also want to read the new data after a few minutes and this is not possible giving the metadata refresh delay (and during metadata refresh you should not write/read data…).
Things become weird when you modify or delete a file that already exists… Dremio by default caches locally the data from S3 so you keep getting old data from your query even if the data has been updated or removed, until you run a refresh metadata. You can of course disable the caching but this solves only the updated data case, if you delete a file from the collection you must refresh metadata as soon as possible to be sure that Dremio does not fail with a “NoSuchKeyException” when attempting to read the source object.

I think that the functionality that Dremio is really missing is a procedure to tell the metastore what single files in the collection have been updated or removed (please someone tell me if such procedure already exists). I saw this available in other engines like PrestoSQL that we found also more resilient and coherent to data updates not showing the problems listed before (we are reconsidering it right now…). Funny thing is… yesterday I read a Dremio-Presto comparison (edited by Dremio) and Dremio seems indeed faster in many ways, but I wonder if all the caching and the strict metadata handling have a part in defining this performance difference.

Topic		Replies	Views
Dremio Refreshing Data	3	3514	May 12, 2020
Near real time metadata refresh	8	2434	December 10, 2021
Unable to refresh metadata for the dataset (due to concurrent updates). Please retry Dremio Cloud	2	444	December 6, 2023
Metadata Refresh not working as expected	5	1534	May 2, 2022
Metadata Refresh process for Hive is taking long time	4	1813	February 2, 2021

Refresh Metadata Taking Ling Time

Related topics