Dremio 17 + ES7.8: IllegalStateException: Invalid response

Hello,

My Dremio 17 on ES7.8 was working fine and stable, with reflections refreshing every hour. Today, reflections stop refreshing and queries stop working on 2 of the 3 promoted ES indexes and all dependent datasets and reflections. No changes have been made in Dremio or ES lately. Indexes are OK in ES.

Error message is: IllegalStateException: Invalid response

Versions:
dremio-community-17.0.0-202107060524010627_31b5222b_1.noarch on CentOS 7
elasticsearch-7.8.0-1.x86_64 on CentOS 7

/var/log/dremio/queries.json:

> {"queryId":"1ebdfd6d-cfc2-16a9-fed3-0b68c5b53700","schema":"[***, al*****e]","queryText":"SELECT * FROM \"_doc\"","start":1631715985527,"finish":1631715987281,"outcome":"FAILED","outcomeReason":"SYSTEM ERROR: IllegalStateException: Invalid response\n\nSqlOperatorImpl ELASTICSEARCH_SUB_SCAN\nLocation 1:1:5\nFragment 1:0\n\n[Error Id: 6af94f73-8458-4720-8026-e9f94b79faff on gr*****.es:0]\n\n (java.lang.IllegalStateException) Invalid response\n com.google.common.base.Preconditions.checkState():508\n com.dremio.plugins.elastic.execution.ElasticSearch7JsonReader.getScrollAndTotalSizeThenSeekToHits():51\n com.dremio.plugins.elastic.execution.ElasticsearchRecordReader.getFirstPage():227\n com.dremio.plugins.elastic.execution.ElasticsearchRecordReader.next():272\n com.dremio.sabot.op.scan.ScanOperator.outputData():323\n com.dremio.sabot.driver.SmartOp$SmartProducer.outputData():521\n com.dremio.sabot.driver.StraightPipe.pump():56\n com.dremio.sabot.driver.Pipeline.doPump():108\n com.dremio.sabot.driver.Pipeline.pumpOnce():98\n com.dremio.sabot.exec.fragment.FragmentExecutor$DoAsPumper.run():345\n com.dremio.sabot.exec.fragment.FragmentExecutor.run():294\n com.dremio.sabot.exec.fragment.FragmentExecutor.access$1600():94\n com.dremio.sabot.exec.fragment.FragmentExecutor$AsyncTaskImpl.run():747\n com.dremio.sabot.task.AsyncTaskWrapper.run():112\n com.dremio.sabot.task.slicing.SlicingThread.mainExecutionLoop():243\n com.dremio.sabot.task.slicing.SlicingThread.run():171\n","username":"admin"}

Restarted ES and Dremio, but the problem remains.

Any help would be appreciated

Thank you
Marcos

@marcosllama Are you able to send the profile when this was working and failing?

da77f504-eda1-4297-8a79-77f1d6f8efcc.zip (30,0 KB)
935f6eb0-156f-44dd-848c-0e0ec7729d70.zip (30,5 KB)

Hello, this is a reflection job working on sep 14th ~3AM and failing on sep 15th ~3AM

Thank you
Marcos

Update: I installed a fresh instance of Dremio on a different machine, and I can query the promoted ES indexes just fine, while queries on the same ES indexes keep failing on my production Dremio.

@marcosllama From an Elasticseach client, are you able to run the below ES command, like say from Kibana and see if you get results back?

{
  "from" : 0,
  "size" : 4000,
  "query" : {
    "range" : {
      "@timestamp" : {
        "from" : "2021-09-14T02:29:02.351Z",
        "to" : null,
        "include_lower" : false,
        "include_upper" : true,
        "format" : "date_time",
        "boost" : 1.0
      }
    }
  }
}

ES.zip (107,4 KB)

Here it is. Sensitive data has been “Obfuscated”.

Thank you,
Marcos

@marcosllama Seems like intermittent, is this failure very consistent? are there multiple elasticsearch servers that one is probably giving back the “Invalid response”?

Hi,

No, it’s not intermittent. It fails every time I try to query 2 of the 3 promoted ES indexes. The remaining index and all dependent datasets and reflections work fine.

There is only one ES server.

Marcos

One particular thing about the indexes: My ES indexes are created and named on a monthly basis (i.e. indexABC-2021.08, indexABC-2021.09…) so in order to keep name consistency, I have created rolling aliases (alias-indexABC) containing the two last indexes: It means every month the aliases are modified in order to add the last index and delete the index from two months back. ¿Maybe this operation can affect Dremio?

Update: I performed a test on the remaining working index: I re-added an index from the previous month to the alias (which had been previously removed as per the procedure described above), and the alias index started failing on Dremio, exactly with the same error messages: VALIDATION ERROR: Table ‘json’ not found and IllegalStateException: Invalid response. I see now that on the other two failing indexes, something similar might have happened (previously removed
months were readded to the alias).

2nd update: Confirmed & SOLVED: In the original failing aliases, indexes from previous months (that had been initially correctly removed) had been erroneously re-added to the aliases, causing them to fail (does this make sense to you? HINT: Reflection refresh is set to “Incremental” using the ES @timestamp field). I removed these indexes from the aliases and THEY STARTED WORKING AGAIN. Though not immediately: I had to wait over 12 hours for the queries to start responding again in Dremio (I don’t completely understand this, as both reflection&metadata refresh periods are set to 1 hour.

@marcosllama It could be your background metadata refresh is taking > 1 hour?

That’s possibly the reason. Thank you for your help!