Refresh is very slowing time

Hello Guys.

I have a serious problem with refresh dataset in dremio.

My version is

With configuration

We use drive since the beginning of this year, but the last weeks, we are facing multiple coordinator failures, we identify that refresh routines are taking longer than usual, but we not found the cause this problem.

I need help, because Dremio is very important for our daily work
Exemples of problem

The coordinator failuring, more 90% of CPU

Maybe we are facing the same issue. Some refresh jobs never complete, the only thing we can do is to restart the engine.
The support said they have fix this issue in version 21.1.6

Hi guys, the exact same problem happened with us.

We are using Dremio via AWS Marketplace and this started last monday (2022-09-19) from nothing, we didnt change any configuration.

We were also using Dremio 21.0. We found the release note telling this was fix so we updated to 21.5 but the problem was still there. After one week investigating and trying to find the cause we went back to Dremio 15.0.0 which is stable and working fine.

Our production enviroment relies on Dremio and is very frustating to know that this can happen to it. After one week without data, the business is now questioning if Dremio is a good choice for us.

We are using/paying for Dremio Enterprise Edition via AWS Marketplace, which seems unstable and we dont have any kind of official support when this happens. We will now evaluate other options.

@tiibra Sorry you had to go through this rough experience, please give me one more chance to help you. Let us first make sure you are hitting the same issue. Do you have a profile that you can share?

Here is an example. (16.2 KB)

The query execution will remain in planning state and waiting for the metadata retrieval that never ends. And here is the last log found on server.log about that metadata retrieval job:

2022-10-06 17:49:35,200 [1cc0ead2-ec07-b5ac-3a4f-8fc6ab0dfd00:foreman] INFO - 1cc0ead2-ec07-b5ac-3a4f-8fc6ab0dfd00/1: Starting new attempt because of INVALID_DATASET_METADATA
2022-10-06 17:49:35,664 [1cc0ead2-ec07-b5ac-3a4f-8fc6ab0dfd00/1:foreman-planning] INFO - New job submitted. Job Id: JobId{id=1cc0eacf-c6ee-f345-e9ab-378216840600, name=null, sessionId=null} - Type: METADATA_REFRESH - Query: REFRESH DATASET "datalake_pottencial"."datalake_insurance_refined_prd"."policy_emissions_coverages"
2022-10-06 17:49:35,668 [1cc0eacf-c6ee-f345-e9ab-378216840600/0:foreman-planning] INFO  c.d.e.p.s.h.RefreshDatasetHandler - Initialised com.dremio.exec.planner.sql.handlers.RefreshDatasetHandler
2022-10-06 17:49:35,863 [1cc0eacf-c6ee-f345-e9ab-378216840600/0:foreman-planning] INFO  c.d.e.p.s.h.r.UnlimitedSplitsMetadataProvider - Table metadata found for datalake_pottencial.datalake_insurance_refined_prd.policy_emissions_coverages, at s3://dremio-me-2eaee806-07bb-4ada-a4d8-24c3cc0a7985-7526715a16fac352/dremio/metadata/e52abffa-8804-4ff5-bacf-a1c027beb515/metadata/00005-feb15738-d41a-40af-87dd-02ae1e4966df.metadata.json
2022-10-06 17:49:37,686 [1cc0eacf-c6ee-f345-e9ab-378216840600/0:foreman-planning] INFO  c.d.e.p.s.h.r.AbstractRefreshPlanBuilder - Writing metadata for datalake_pottencial.datalake_insurance_refined_prd.policy_emissions_coverages at /dremio-me-2eaee806-07bb-4ada-a4d8-24c3cc0a7985-7526715a16fac352/dremio/metadata/e52abffa-8804-4ff5-bacf-a1c027beb515

@tiibra Looks like you are on 21.5 and we back ported a fix to 21.6 Dremio

In your profile, you can see an empty ICEBERG_COMMIT_TIME:

This is the symptom of the Avro reader hang.