Failure executing Elastic request get next search result: HTTP 404 Not Found

When create a reflection from a physical data set off of Elasticsearch, I will sometimes get a 404, which causes the reflection to stop refreshing and forces all virtual dataset reflections to fail as well. I have tried making the data set smaller, making 10s of reflections, and nothing quite works. I have to manually enable/disable the reflection, and then it may work the first time. Sometime after that first refresh it will fail.

Dremio version: 4.0.4-20191021053580380-773b665
Elasticsearch version: 5.6.10

Verbose error:
INVALID_DATASET_METADATA ERROR: Failure executing Elastic request get next search result: HTTP 404 Not Found.

Request: http:///_search/scroll
Response Status 404
Response Reason Not Found
Response Body ���Aj�0E�"f�@�d���@B�&�@�� l�](�l�_̟?�?3�R�w��y�:�"�ё�/��P�#2�|�8x�k,4:a໨ͮ~��VZ�A���������|�����1J*�����w))�V79�.�ퟆ4��N��21�j�|v�Z�V f��!���R��x3�����u\jkϳ��6A?|�’��.���
SqlOperatorImpl ELASTICSEARCH_SUB_SCAN
Location 1:0:2
Fragment 1:0

[Error Id: 13a53ee9-052b-4621-94f8-1b9287efa11a on ]

(javax.ws.rs.NotFoundException) HTTP 404 Not Found
org.glassfish.jersey.client.JerseyInvocation.convertToException():1020
org.glassfish.jersey.client.JerseyInvocation.translate():819
org.glassfish.jersey.client.JerseyInvocation.access$700():92
org.glassfish.jersey.client.JerseyInvocation$2.call():701
org.glassfish.jersey.internal.Errors.process():315
org.glassfish.jersey.internal.Errors.process():297
org.glassfish.jersey.internal.Errors.process():228
org.glassfish.jersey.process.internal.RequestScope.runInScope():444
org.glassfish.jersey.client.JerseyInvocation.invoke():697
com.dremio.plugins.elastic.ElasticConnectionPool$ElasticConnection.execute():638
com.dremio.plugins.elastic.execution.ElasticsearchRecordReader.getNextPage():240
com.dremio.plugins.elastic.execution.ElasticsearchRecordReader.next():287
com.dremio.sabot.op.scan.ScanOperator.outputData():231
com.dremio.sabot.driver.SmartOp$SmartProducer.outputData():521
com.dremio.sabot.driver.StraightPipe.pump():56
com.dremio.sabot.driver.Pipeline.doPump():109
com.dremio.sabot.driver.Pipeline.pumpOnce():99
com.dremio.sabot.exec.fragment.FragmentExecutor$DoAsPumper.run():320
com.dremio.sabot.exec.fragment.FragmentExecutor.run():273
com.dremio.sabot.exec.fragment.FragmentExecutor.access$1200():87
com.dremio.sabot.exec.fragment.FragmentExecutor$AsyncTaskImpl.run():658
com.dremio.sabot.task.AsyncTaskWrapper.run():104
com.dremio.sabot.task.slicing.SlicingThread.mainExecutionLoop():226
com.dremio.sabot.task.slicing.SlicingThread.run():156

Request: http:///_search/scroll
Response Status 404
Response Reason Not Found
Response Body ���Aj�0E�"f�@�d���@B�&�@�� l�](�l�_̟?�?3�R�w��y�:�"�ё�/��P�#2�|�8x�k,4:a໨ͮ~��VZ�A���������|�����1J*�����w))�V79�.�ퟆ4��N��21�j�|v�Z�V f��!���R��x3�����u\jkϳ��6A?|�’��.���
SqlOperatorImpl ELASTICSEARCH_SUB_SCAN
Location 1:0:2
Fragment 1:0

org.glassfish.jersey.client.JerseyInvocation(JerseyInvocation.java:1020)
org.glassfish.jersey.client.JerseyInvocation(JerseyInvocation.java:819)
org.glassfish.jersey.client.JerseyInvocation(JerseyInvocation.java:92)
org.glassfish.jersey.client.JerseyInvocation$2(JerseyInvocation.java:701)
org.glassfish.jersey.internal.Errors(Errors.java:315)
org.glassfish.jersey.internal.Errors(Errors.java:297)
org.glassfish.jersey.internal.Errors(Errors.java:228)
org.glassfish.jersey.process.internal.RequestScope(RequestScope.java:444)
org.glassfish.jersey.client.JerseyInvocation(JerseyInvocation.java:697)
com.dremio.plugins.elastic.ElasticConnectionPool$ElasticConnection(ElasticConnectionPool.java:638)
com.dremio.plugins.elastic.execution.ElasticsearchRecordReader(ElasticsearchRecordReader.java:240)
com.dremio.plugins.elastic.execution.ElasticsearchRecordReader(ElasticsearchRecordReader.java:287)
com.dremio.sabot.op.scan.ScanOperator(ScanOperator.java:231)
com.dremio.sabot.driver.SmartOp$SmartProducer(SmartOp.java:521)
com.dremio.sabot.driver.StraightPipe(StraightPipe.java:56)
com.dremio.sabot.driver.Pipeline(Pipeline.java:109)
com.dremio.sabot.driver.Pipeline(Pipeline.java:99)
com.dremio.sabot.exec.fragment.FragmentExecutor$DoAsPumper(FragmentExecutor.java:320)
com.dremio.sabot.exec.fragment.FragmentExecutor(FragmentExecutor.java:273)
com.dremio.sabot.exec.fragment.FragmentExecutor(FragmentExecutor.java:87)
com.dremio.sabot.exec.fragment.FragmentExecutor$AsyncTaskImpl(FragmentExecutor.java:658)
com.dremio.sabot.task.AsyncTaskWrapper(AsyncTaskWrapper.java:104)
com.dremio.sabot.task.slicing.SlicingThread(SlicingThread.java:226)
com.dremio.sabot.task.slicing.SlicingThread(SlicingThread.java:156)

@jdingler

Would you be able to upload the profile for the failed job?

Share A Dremio Query Profile

@balaji.ramaswamy Here’s the query profile:

42849b33-a67c-444a-9cdc-05f9780e932a.zip (64.4 KB)

@jdingler

From the Dremio coordinator, can try to curl the index from server “10.4.8.122” on port 9200
{code}
{
“from” : 0,
“size” : 4000,
“query” : {
“match_all” : {
“boost” : 1.0
}
}
{code}

@balaji.ramaswamy

I was able to successfully receive results using this curl command for both the ip address and the url we use

@jdingler, Was the curl command executed from the Dremio Coordinator?

@balaji.ramaswamy Yes it was executed from the Dremio Coordinator.

Sometimes the reflection refresh will work, and other times it will not. Normally it will take ~50minutes to refresh, with it sometimes working the first time. It is set to refresh every hour, so 10 minutes later it will fail roughly 20-30mins from the time it started.

@jdingler

Does the data change every hour?

@balaji.ramaswamy

The data changes once a day at varying times between 9am-6pm EST.

What impact could the refresh time interval have on periodically failing refreshes?

@jdingler

If data changes once a day, any reason you have to refresh every hour?

@balaji.ramaswamy

Because the stakeholder SLA is to make the data available for analytics as soon as possible. Dremio only provides refresh intervals at the soonest 1 hour.

How does this impact the periodically failing refreshes?

@jdingler

You had mentioned that the refresh takes ~ 50 minutes and so was thinking if it is a timing issue. Can you try to space it once in 2 hours and see if you see the same frequency of errors?

Sure I will give this a try

@balaji.ramaswamy

It looks like this resolved the issue as there were no failures from any reflection overnight.

Is it somewhere in the documentation where refresh times that are close to the refresh interval cause failures in reflection refreshes?

@jdingler

Generally if your refresh takes 1 hour, it does not make sense to refresh every hour as your system is going be doing refreshes all day and not leave resources for other workloads