We are using Dremio with S3 compatible storage as datasource. Our datasets on Datalake updates(appends/overwrites new files) for every 3hrs. To reflect the new files on Dremio PDS, we are refreshing the metadata with the below query after every write operation to the source.
ALTER PDS <PHYSICAL-DATASET-PATH> REFRESH METADATA
But it is taking around 15mins to refresh the metadata. The data size of dataset is ~100GB.
I am not sure why it is taking more than 15mins just to refresh the metadata. Is it expected or is there any alternative for refresh metadata?
Note: I even tried the Forget metadata and recreate PDS through API. Recreation of PDS took the same time (18mins)
Hi, I’m very interested in this too. My team always used an attached disk storage (EBS on AWS) as data source but our goal is to move to S3 as data grow. We have now almost 400 GB of parquet partitioned data and yes, the problem you noticed is one of the biggest problems we are facing and is preventing us to move toward S3 with Dremio. If the data source is the disk, then the metadata refresh takes just a few seconds, but if we use S3 it takes 15 to 20 minutes (same data replicated on S3). Our case is even worse, since the data is updated asynchronously by other processes and can change very quickly. We also want to read the new data after a few minutes and this is not possible giving the metadata refresh delay (and during metadata refresh you should not write/read data…).
Things become weird when you modify or delete a file that already exists… Dremio by default caches locally the data from S3 so you keep getting old data from your query even if the data has been updated or removed, until you run a refresh metadata. You can of course disable the caching but this solves only the updated data case, if you delete a file from the collection you must refresh metadata as soon as possible to be sure that Dremio does not fail with a “NoSuchKeyException” when attempting to read the source object.
I think that the functionality that Dremio is really missing is a procedure to tell the metastore what single files in the collection have been updated or removed (please someone tell me if such procedure already exists). I saw this available in other engines like PrestoSQL that we found also more resilient and coherent to data updates not showing the problems listed before (we are reconsidering it right now…). Funny thing is… yesterday I read a Dremio-Presto comparison (edited by Dremio) and Dremio seems indeed faster in many ways, but I wonder if all the caching and the strict metadata handling have a part in defining this performance difference.
What is the S3 compatible source? Is the long metadata refresh only observed for the S3 compatible source?
What version of Dremio are you using?
Same question for you: what version of Dremio are you experiencing the issue with?
@ben This is a problem we have since the first versions (we adopted Dremio a couple year ago). Now I’m testing with the latest community release 4.6.1. Parquet are written with pyarrow (version >=0.14) and are snappy compressed. The same data on disk get metadata refresh in 15 seconds, on S3 it takes about 30 minutes. This is the reason why we are still using EBS as storage, but we must move to S3 soon. Disk storage anyway has another problem: if you modify a parquet inside the PDS you must refresh metadata before querying otherwise Dremio may fail the query with a “magic number” error, the same does not happen on S3 (as stated in my previous post).
The S3 source format is parquet, partitioned by a column. We are using an enterprise edition - 4.3.1.
Is there a way, where Dremio PDS can automatically read all the files under folder, say something like <folder_name>/* so that anytime a new file is added, we don’t have to do a refresh.
What we observed is that, on overwriting, if the file names are same, we don’t need to refresh metadata. this is needed only if we have new filenames coming up.
I understand that refresh metadata, also does collect some stats like count etc, which takes more time.
Is there a service, where it just recognizes the newly added files?
S3 compatible source: IBM CloudObjectStorage
Dremio Version: 4.3.1(Enterprise Edition)
We don’t have other sources. So, I didn’t test with other sources
Time taken for reflection refresh also increases if it is done after executing
ALTER PDS <PHYSICAL-DATASET-PATH> REFRESH METADATA
Any other way to update the metadata without increasing the Reflection refresh time.
Metadata and Reflection refresh are 2 different things. For Dremio to know about the new data on lake we need to do refresh metadata. For the reflection to know about the new data we need to do reflection refresh. We are working on optimizing metadata refresh speeds and will be available later this year
Has there been any progrsess on this? And ETA? We are using S3 storage, with asynchronously mutating data, and AWS Glue as a metadata catalog. Dremio’s metadata management performance & behavior is a blocker to Dremio adoption. If Dremio were able to auto-demote (forget about) removed files/S3 objects, that in conjucntion with auto-promotion of new files/S3 Objects, might solve our issues.
If files on a particular folder are removed, can you add “Alter pds refresh metadata” as part of pipleline and you should be fine?
We could, but that takes way too long for Glue/S3 datasets (the original pain point in this thread). Add/Delete is really tring to work around update (much like how Postgres inplements updates).
Dremio’s performance is really impressive, but not being able to mutate & consistently query the underlying data is a blocker for us.
Postgres is a transactional OLTP database. Dremio is a SQL engine on data lakes. I agree the modern trend is real time querying on data lakes. That is the reason we are coming up with Apache Iceberg integration so we can achieve the same
Sounds exciting. Does the integration pertain to leveraging Iceberge as a datasource, or using its tech as a foundational techmology to improving Dremio’s engine such that it can manange metadata in near RT?
That is correct and to get around the overhead of metadata collection time/storage
as we adopt dremio, we are also interested in fast metadata refreshes. Is there a ticket or an issue I could follow to stay up to date on the progress?
Thank you, Igor
Its tracked via an internal story. We will be bringing support in phases and will send release notes