Job freezes on metadata retrieval or planning phase

Several times I encountered a problem when a job in a dremo hangs in the metadata retrieval or planning status, and when trying to execute new requests, they are not added to the jobs list.

Restarting the corresponding pod helps to solve the problem.

I can’t find descriptions of these phases of query execution, so the question is: does dremio send any queries to the database at these stages? Or do these phases only affect the work with the metadata storage / memory and do not depend on the database in any way?

We are trying to find out if these problems are related to blocking at the DB level, or if it may be related to problems with our network drive \ metadata store (yes, we have network problems on this level in our infrastructure)

We have this problem from time to time and, unfortunately, it will not be possible to pull out the mission plan. Although when hovering on the metadata retrieval status, the job profile was empty, if I remember correctly

@Arol Does this usually happen on new datasets?

No I do not think so.

I tried to catch the same problem locally, but despite the set block on the table at the database level locally, the metadata retrieval and planning phases fly by without problems; the task due to the lock stops at the running status (lock is removed - the task ran)

Therefore, I am more inclined to believe that the named problem may be related to problems with our metadata store. Nevertheless, the question is still of interest: before the running phase of the query execution, does dremio send queries to the database (in the planning or metadata retrieval phase) or not? If yes, maybe we just need to add a limit on the execution time of this request to the code, if not, then the problem is not in the locks at the database level.

@Arol
Yes, both planning and metadata but for metadata it should have happened ahead of query runtime, do you have a completed job profile?

Sorry, I didn’t get it. With metadata retrieval and planning, dremio can send a query to the database, or not?
I cannot understand if the hang at these stages can be associated with some problems in the database (locks, structure violation, etc.), or they do not access the database and it cannot affect the state of the job at this stage

I’m trying to raise the history, it’s a little difficult and we won’t dig up the logs anymore

  1. There was a hangup on the metadata retrieval, the request plan was not available, new requests did not appear in the job spike, the interface worked.

    Выделение_077

At that moment, we had problems with the database with the table that was being accessed (a lot of patricians, something broke) - a direct sql query through other means also hung for a long time

Job for the planning phase could not be found.

But in general, we observed the following behavior: one task hangs on the planning or metadata retrieval status; the interface works, new tasks (even select 1) do not appear in jobs.

We ask analysts to check the status of the database, analysts stop their downloads (which can hang up a lock) - everything works. Or we restart dremio and everything works.

I have two options in my head:

  1. These phases are associated with working with the metadata storage and we fail at this level (does not depend on dremio)
  2. These phases are associated with access to the database and cannot be executed in parallel, but it is not clear why new tasks do not appear in the list of tasks.

@Arol
With metadata retrieval and planning, dremio can send a query to the database, or not? Yes

It looks like you are having to do inline refresh, are you able to send the screenshot of metadata tab of this source

We observed this behavior, it seems to me that it relates to the same problem:

  1. We sent a request, in the job it has the running status
    pending_0af26ae8-455a-4778-a154-d9f96ecea342.zip (123.9 КБ)
  2. The request to dremio SELECT 1 works and gets into jobs.
  3. We sent another request to the same space, it stuck in the metadata retireval status. If you send this request without other running ones, then it works fine.
    retrieval_ddcc409b-af7f-47f2-8b89-d376f05f7592.zip (5.4 КБ)
  4. After that, new requests (select 1) do not appear in jobs
  5. If we cancel the first request, then all sent requests will appear in the job and the request from point 3 will pass normally.

@Arol The issue is that your coordinator has only one core so the command pool size is 0 which means if one query is doing metadata retrieval all other queries will have to wait, can you please try to increase ot a 8 core box and repeat the same test?