Exception on 90% of queries

Hello, community.
We’ve been running dremio for a while now and since the weekend (no changes or logins during this period) we’re having a strange issue. Almost every query results with the following error:

FileNotFoundException: /data/nas/results/20c51ec7-a8e4-77e6-31fd-e09cf37c0c00/1_1_0.dremarrow1 (Remote I/O error)

The filename and query ID changes but the structure remains the same. Repeating the query several times sometimes yields results but most of the time, it’s marked as cancelled by the logs. Master node logs are not informative at all and executor nodes on debug only show the job as cancelled.
Memory and CPU usage during these queries looks ordinary and so does the disk space and shared NAS storage.

We had a chance to restore a backup for the virtual machines to Friday and everything seems to be working fine now, but we would like to know what the root of the problem is and if there’s less drastic measures we could take to prevent/solve this.

Thanks in advance.

I have noticed stale result sets will disappear once they have been replaced by newer result sets from running the same query, even if the data hasn’t changed. In the jobs list I could try to view the results directly and it would say the file does not exist. My guess is the executor can recognize the results are identical as the previous job then cancel the query. It should definitely say it thought the results were the same, and perhaps show a unique hash of each result set to prove that it’s not lying.

Otherwise perhaps look at a metadata backup export in json before the problem, then compare it to a metadata backup after the problem? I’m thinking of including that for logging metadata if an error arises. I’m not sure what the metadata store will have than the job profile related to the offending query, retrievable from the UI, but probably would store some logic or hash outputs.

It happened to me as well. I made few fixes as below which are under observation for now. The guess is that dremio cleans up old reflections, scratch, uploads, download, etc. folders periodically. It seems that periodic clean up of reflections is fine, however it seems there is a bug in clean up of other folders like scratch, downloads, etc. It seems that it deletes everything in the immediate parent directory of scratch, download, uploads, etc. folder (which exact it is trying to clean, don’t know). For example, if you have:

foo/reflections
foo/uploads
foo/download
foo/scratch
foo/results

Then, periodically you will find only:

foo/

everything deleted.

If you have db folder under foo:

foo/db

you will wonder why it did not disappear; db has all of dremio metadata thats why you see all of the objects in the UI.

foo/db doesn’t get dropped because dremio has a lock on the db folder by way of having open handles to the underlying files.

So, I have setup so that all of the dremio folders like reflections, downloads, uploads, scratch, etc. are on separate paths folders.

foo1/reflections
foo2/uploads
foo3/downloads
foo4/scratch
foo5/results
fooX/whatever

Try that, observe it per your cleanup schedule - check this: https://docs.dremio.com/advanced-administration/job-results-cleanup.html

@desi @datocrats-org @mdouda

Few things,

  • Dremio cleans results every 24 hours
  • Dremio cleans jobs every 30 days
  • Dremio does not delete uploads, downloads, scratch,
  • Accelerator files are dropped 4 hours after reflections are dropped
  • Results stored are only used when we click “open results” in the job details pane or we click preview 2nd time, hint is, reading from results will not create a job

Now coming to all the issues, one one

@mdouda

The below error seems to tell that we are unable to write results to the NAS folder. One thing you can try is run the same query via a JDBC tool like DBeaver and if that works then problem is writing to NAS as JDBC queries do not write results

FileNotFoundException: /data/nas/results/20c51ec7-a8e4-77e6-31fd-e09cf37c0c00/1_1_0.dremarrow1 (Remote I/O error)

@desi,

I am sorry to hear that your folders are getting deleted, as stated above this should not happen, unless the OS is clearing it or some other job is clearing.
Kindly send me your dremio.conf for review

Also storing reflectons/scratch/uploads on local is not recommended if planning to use in a production environment

Kindly let us know if you have any questions

Thanks
Bali

Hello, Balaji.
The problem persisted for a couple of days until we noticed the shared storage on a standby coordinator was not being mounted properly. Reviewing mount options and remounting the storage after restoring a VM backup of the full cluster fixed the issue.
Thanks for the debugging alternative and for your time. We’ll keep that option in mind if something new comes up.