SOURCE_BAD_STATE ERROR: The source ["__metadata"] is currently unavailable

Hi guys, on our AWS Enterprise Edition I’m getting this kind error in server.log when trying to refresh metadata on some s3 resource.It’s continuously spamming in server.log and on UI I see error Unable to refresh dataset. on ALTER TABLE "cbs-etl-parquet-dremio" REFRESH METADATA FORCE UPDATE.

Dremio version is 21.0.0-202204051402590848-b4cdafb2

2022-04-29 10:20:01,117 [Fabric-RPC-Offload19] INFO  c.d.s.s.LocalSchedulerService - Cancelling task com.dremio.resource.wlm.scheduler.DremioWLMAllocator$QueryRunTimeWatcher@694b2835
2022-04-29 10:20:01,193 [Fabric-RPC-Offload23] WARN - Dropping request to move to COMPLETED state as query 1d94442f-7c88-1a38-0369-b8745cf53600 is already at FAILED state (which is terminal).
2022-04-29 10:20:01,194 [metadata-refresh-modifiable-scheduler-23:JobId{id=1d94442f-7c88-1a38-0369-b8745cf53600, name=null, sessionId=null}] INFO - Submitted job (JobID JobId{id=1d94442f-7c88-1a38-0369-b8745cf53600, name=null, sessionId=null}) has failed
2022-04-29 10:20:01,194 [out-of-band-observer] INFO  query.logger - Query: 1d94442f-7c88-1a38-0369-b8745cf53600; outcome: FAILED
2022-04-29 10:20:01,207 [metadata-refresh-modifiable-scheduler-23:JobId{id=1d94442f-7c88-1a38-0369-b8745cf53600, name=null, sessionId=null}] INFO - Submitted job (JobID JobId{id=1d94442f-7c88-1a38-0369-b8745cf53600, name=null, sessionId=null}) has failed
2022-04-29 10:20:01,750 [metadata-refresh-modifiable-scheduler-23] INFO - The SQL query REFRESH DATASET "dremio-campaigns"."51189"."1573"."12-03-2022"."vins_for_targetset"."1648752545" will be submitted on the same thread
2022-04-29 10:20:01,753 [1d94442d-a834-1713-2b11-a4b5c0005200/0:foreman-planning] INFO  c.d.e.p.s.h.RefreshDatasetHandler - Initialised com.dremio.exec.planner.sql.handlers.RefreshDatasetHandler
2022-04-29 10:20:01,753 [1d94442d-a834-1713-2b11-a4b5c0005200/0:foreman-planning] INFO  c.d.e.p.s.h.r.UnlimitedSplitsMetadataProvider - Table metadata found for "dremio-campaigns"."51189"."1573"."12-03-2022".vins_for_targetset."1648752545", at s3://dremio-me-704b4030-c493-438a-b4c5-ffa816f05ae1-adef02cd770ee811/dremio/metadata/1a1d3f25-34e2-4392-9d48-9db2eef6be98/metadata/00000-1a45e952-fea3-49d7-8e78-3a9ff3751afc.metadata.json
2022-04-29 10:20:02,084 [1d94442d-a834-1713-2b11-a4b5c0005200/0:foreman-planning] INFO  c.d.e.p.s.h.r.AbstractRefreshPlanBuilder - Writing metadata for "dremio-campaigns"."51189"."1573"."12-03-2022".vins_for_targetset."1648752545" at /dremio-me-704b4030-c493-438a-b4c5-ffa816f05ae1-adef02cd770ee811/dremio/metadata/1a1d3f25-34e2-4392-9d48-9db2eef6be98
2022-04-29 10:20:02,158 [metadata-refresh-modifiable-scheduler-23] INFO - New job submitted. Job Id: JobId{id=1d94442d-a834-1713-2b11-a4b5c0005200, name=null, sessionId=null} - Type: METADATA_REFRESH - Query: REFRESH DATASET "dremio-campaigns"."51189"."1573"."12-03-2022"."vins_for_targetset"."1648752545"
2022-04-29 10:20:02,160 [Fabric-RPC-Offload23] INFO  c.d.exec.maestro.FragmentTracker - Fragment 1d94442d-a834-1713-2b11-a4b5c0005200:0:0 failed, cancelling remaining fragments.
2022-04-29 10:20:02,161 [Fabric-RPC-Offload22] INFO  c.d.exec.maestro.FragmentTracker - Fragment 1d94442d-a834-1713-2b11-a4b5c0005200:1:22 failed, cancelling remaining fragments.
2022-04-29 10:20:02,162 [Fabric-RPC-Offload20] INFO - 1d94442d-a834-1713-2b11-a4b5c0005200: State change requested RUNNING --> FAILED, Exception com.dremio.common.exceptions.UserRemoteException: SOURCE_BAD_STATE ERROR: The source ["__metadata"] is currently unavailable. Info: []

Can you help me to fix this issue?


@VahagnBleyan Are you seeing a lot of “SOURCE_BAD_STATE ERROR: The source [”__metadata"] is currently unavailable" errors? If that is the case, the storage location where Dremio has to read and write metadata from and to are unavailable. Where have we configured to write metadata? Are we able to write files to the same location using another method?

Yes I see a lot of that kind of messages, storage configured on AWS S3 bucket and it’s accessible from dremio instance.
! While writing I realize that I’ve disabled public ip-s on Engine nodes and that errors should be because of that. Enabling public IP solve the problem, thanks to your reply).
Is there any way to start elastic engine instances in different subnet than coordinator?
Also it will be great to have more clear error descriptions to understand that there is connection error. Timeout or unreachable error maybe)


@VahagnBleyan Different subnet should be ok as long as they can communicate on the inter node communication port, latency might be an issue,

And how can I configure that?

@VahagnBleyan Are you able to ping and telnet back and forth between the coordinator and the executor?