Metadata Retrieval at query time (AWS Glue)

Have set up Dremio, latest community edition, on an EKS cluster with 3 executor pods on r5d.4xlarge nodes.

Successfully onboarded an AWS Catalog.

However, each and every query takes > 10 mins due to Metadata Retrieval, compounded by failed attempts.

Here’s an excerpt from a query’s 3rd attempt:


Job Summary

State:
COMPLETED

Coordinator:
dremio-master-0.dremio-cluster-pod.default.svc.cluster.local

Threads:
34

Command Pool Wait:
0ms

Total Query Time:
390,819ms

State Durations

Pending:
0ms

Metadata Retrieval:
389,873ms

Planning:
108ms

Engine Start:

Queued:
14ms

Execution Planning:
60ms

Starting:
10ms

Running:
754ms


I,ve read other thread where users have been complaining of extremly slow S3 metadata refresh rates. Has this issue been resolved? If not, it seems to also affect AWS Glue.

I have also attempted to prevent metata refresh by running:
ALTER PDS REFRESH METADATA AVOID PROMOTION

Which did not seem to help.

Please advise.

kubectl get pods

NAME READY STATUS RESTARTS AGE
dremio-executor-0 1/1 Running 0 11h
dremio-executor-1 1/1 Running 0 11h
dremio-executor-2 1/1 Running 0 11h
dremio-master-0 1/1 Running 0 11h
zk-0 1/1 Running 0 11h
zk-1 1/1 Running 0 11h
zk-2 1/1 Running 0 11h

@shragj

Please send us screenshot of your glue source metadata tab and also the job profile of the slow job that does metadata refresh during query runtime

Thanks
Bali

I’m not sure what you mean by the " glue source metadata tab". If you mean the tab showing the datasets (tables), then I cannot since it contains proprietry data.

I’ve attached two screen shots of the job’s profile.

Judging by the Metadata Retrieva time it appears the query did a full metadata refresh on the table being queried (when I run ALTER PDS to REFRESH METADATA it takes about the same time).

In general, our use-case is that of a Data Lake with about 20 partitioned datasets (tables), each containing 100s to 1000s of files (S3 object). Said objects are replaced asynchronously by frequent batch jobs. Also, less frequently new objects are being created.

Hence, Dremio needs to be able to query data that is being refreshed & created asynchronously.

For our purposes, it is not acceptable to have a quey hang for minutes awaiting metadata refresh.

It is not clear from the Dremio documentation how to implement these types of use-cases. Specifically, it is not clear how the system behaves when data sets mutate in between periodic metadata refresh cycles:

  1. Can new/updated data be queried accuratly, or at all?
  2. Will some queries perform a metadata refresh?
  3. What triggers a metadata refresh by a query? Can it be prevented, and still produce valid results?

As other users have commented on various discussion topics, it seems that a mechansim for on-demand metadata update is needed. That is, an API that will allow a client to “index” specific S3 objects.

  1. Does the Dremio roadmap include such a capability (client intiated, incremental metadata updates)? If so, when do you expect it will be released?

Cheers.

Just to clarify. Most queries have very good latencies. It’s now the occasional query that hangs on Metadata Retrieval.

Performing an explicit metadata refresh on an entire dataset does not seem to be a scalable approach since datasets can grow for months, or even years: say a table partitioned by date, i.e., daily partitions.

On demand metadata indexing, either implictly by a query (promotion) or explicitly using a client-initiate request, seems to be much more viable.

In any case, my main concrern right now is that query response times are unpredicatable b/c some query/job runs trigger very expensive metadata retrieval.

@shragj

Please send us screenshot of below tab when you edit the ASW Glue source in the Dremio UI

Initially it was “Only Queried Datasets”, then I changed it in an attempt to have metadata available for all queries.

Btw, we have another Glue Catalog for which the fetch mode is set to Only Queried Datasets, and we are experiencing the exact same isssue.

Is it the case that a Metadata Refresh is a global, stop-the-world event? That is, if a query, or the background refresh mechanism, is performing Metadata Retrieval does that block concurrent queries on the same dataset? Or would concurrent queries trigger separate refresh events?

@shragj

Metadata collection is not “stop-the-world-event”. I can think of three reasons why you are seeing inline refresh

  • The background metadata refresh takes more than 3 hours. To validate add this to logback.xml under the loggers section and restart coordinator. This should print every dataset name and time it took in milli seconds to collect metadata. It should also tell how much time the total source took, look for “sync ended” for each source. The logger will write to metadata_refresh.log under the Dremio log folder

  • The second reason could be that the command “ALTER PDS FORGET METADATA” is run on that dataset (highly unlikely)

http://docs.dremio.com/sql-reference/sql-commands/datasets.html#forgetting-physical-dataset-metadata

  • Finally the tables are getting dropped and recreated

Note: When you run the query second time, does the metadata part go away?

metadata-logger.zip (383 Bytes)