I,ve read other thread where users have been complaining of extremly slow S3 metadata refresh rates. Has this issue been resolved? If not, it seems to also affect AWS Glue.
I have also attempted to prevent metata refresh by running:
ALTER PDS REFRESH METADATA AVOID PROMOTION
I’m not sure what you mean by the " glue source metadata tab". If you mean the tab showing the datasets (tables), then I cannot since it contains proprietry data.
I’ve attached two screen shots of the job’s profile.
Judging by the Metadata Retrieva time it appears the query did a full metadata refresh on the table being queried (when I run ALTER PDS to REFRESH METADATA it takes about the same time).
In general, our use-case is that of a Data Lake with about 20 partitioned datasets (tables), each containing 100s to 1000s of files (S3 object). Said objects are replaced asynchronously by frequent batch jobs. Also, less frequently new objects are being created.
Hence, Dremio needs to be able to query data that is being refreshed & created asynchronously.
For our purposes, it is not acceptable to have a quey hang for minutes awaiting metadata refresh.
It is not clear from the Dremio documentation how to implement these types of use-cases. Specifically, it is not clear how the system behaves when data sets mutate in between periodic metadata refresh cycles:
Can new/updated data be queried accuratly, or at all?
Will some queries perform a metadata refresh?
What triggers a metadata refresh by a query? Can it be prevented, and still produce valid results?
As other users have commented on various discussion topics, it seems that a mechansim for on-demand metadata update is needed. That is, an API that will allow a client to “index” specific S3 objects.
Does the Dremio roadmap include such a capability (client intiated, incremental metadata updates)? If so, when do you expect it will be released?
Just to clarify. Most queries have very good latencies. It’s now the occasional query that hangs on Metadata Retrieval.
Performing an explicit metadata refresh on an entire dataset does not seem to be a scalable approach since datasets can grow for months, or even years: say a table partitioned by date, i.e., daily partitions.
On demand metadata indexing, either implictly by a query (promotion) or explicitly using a client-initiate request, seems to be much more viable.
In any case, my main concrern right now is that query response times are unpredicatable b/c some query/job runs trigger very expensive metadata retrieval.
Btw, we have another Glue Catalog for which the fetch mode is set to Only Queried Datasets, and we are experiencing the exact same isssue.
Is it the case that a Metadata Refresh is a global, stop-the-world event? That is, if a query, or the background refresh mechanism, is performing Metadata Retrieval does that block concurrent queries on the same dataset? Or would concurrent queries trigger separate refresh events?
Metadata collection is not “stop-the-world-event”. I can think of three reasons why you are seeing inline refresh
The background metadata refresh takes more than 3 hours. To validate add this to logback.xml under the loggers section and restart coordinator. This should print every dataset name and time it took in milli seconds to collect metadata. It should also tell how much time the total source took, look for “sync ended” for each source. The logger will write to metadata_refresh.log under the Dremio log folder
The second reason could be that the command “ALTER PDS FORGET METADATA” is run on that dataset (highly unlikely)