Refresh Metadata Taking Ling Time

Hi, I’m very interested in this too. My team always used an attached disk storage (EBS on AWS) as data source but our goal is to move to S3 as data grow. We have now almost 400 GB of parquet partitioned data and yes, the problem you noticed is one of the biggest problems we are facing and is preventing us to move toward S3 with Dremio. If the data source is the disk, then the metadata refresh takes just a few seconds, but if we use S3 it takes 15 to 20 minutes (same data replicated on S3). Our case is even worse, since the data is updated asynchronously by other processes and can change very quickly. We also want to read the new data after a few minutes and this is not possible giving the metadata refresh delay (and during metadata refresh you should not write/read data…).
Things become weird when you modify or delete a file that already exists… Dremio by default caches locally the data from S3 so you keep getting old data from your query even if the data has been updated or removed, until you run a refresh metadata. You can of course disable the caching but this solves only the updated data case, if you delete a file from the collection you must refresh metadata as soon as possible to be sure that Dremio does not fail with a “NoSuchKeyException” when attempting to read the source object.

I think that the functionality that Dremio is really missing is a procedure to tell the metastore what single files in the collection have been updated or removed (please someone tell me if such procedure already exists). I saw this available in other engines like PrestoSQL that we found also more resilient and coherent to data updates not showing the problems listed before (we are reconsidering it right now…). Funny thing is… yesterday I read a Dremio-Presto comparison (edited by Dremio) and Dremio seems indeed faster in many ways, but I wonder if all the caching and the strict metadata handling have a part in defining this performance difference.

3 Likes