IndexWriter is closed error -> checksum failed for internal file, Dremio/Lucene

For some time (2+ weeks) Dremio started giving such errors during CREATE TABLE query (and some SELECT):

VALIDATION ERROR: this IndexWriter is closed

SQL Query CREATE TABLE etl_out."/folder1/folder2/folder3" AS SELECT * FROM etl_in."/folder4/results/400/data.parquet"  (org.apache.lucene.store.AlreadyClosedException) this IndexWriter is closed
    org.apache.lucene.index.IndexWriter.ensureOpen():749
    org.apache.lucene.index.IndexWriter.ensureOpen():763
    org.apache.lucene.index.IndexWriter.updateDocument():1567
    com.dremio.datastore.indexed.LuceneSearchIndex.update():296
    com.dremio.datastore.indexed.CoreIndexedStoreImpl.index():215
    com.dremio.datastore.indexed.CoreIndexedStoreImpl.put():209
    com.dremio.datastore.indexed.CoreIndexedStoreImpl.put():52
    com.dremio.datastore.CoreBaseTimedStore.put():77
    com.dremio.datastore.CoreBaseTimedStore$TimedIndexedStoreImplCore.put():143
    com.dremio.datastore.indexed.LocalIndexedStore.put():103
    com.dremio.service.namespace.NamespaceServiceImpl$DatasetMetadataSaverImpl.savePartitionChunk():538
    com.dremio.exec.catalog.SafeNamespaceService$1.lambda$savePartitionChunk$2():304
    com.dremio.exec.catalog.ManagedStoragePlugin$SafeRunner.doSafe():981
    com.dremio.exec.catalog.SafeNamespaceService$1.savePartitionChunk():304
    com.dremio.exec.catalog.DatasetSaver.save():107
    com.dremio.exec.catalog.DatasetSaver.save():154
    com.dremio.exec.catalog.DatasetManager.getTableFromPlugin():349
    com.dremio.exec.catalog.DatasetManager.getTable():209
    com.dremio.exec.catalog.CatalogImpl.getTable():130
    com.dremio.exec.catalog.SourceAccessChecker.lambda$getTable$3():103
    com.dremio.exec.catalog.SourceAccessChecker.checkAndGetTable():82
    com.dremio.exec.catalog.SourceAccessChecker.getTable():103
    com.dremio.exec.catalog.DelegatingCatalog.getTable():66
    com.dremio.exec.catalog.CachingCatalog.getTable():93
    com.dremio.exec.catalog.DremioCatalogReader.getTable():94
    com.dremio.exec.catalog.DremioCatalogReader.getTable():71
    org.apache.calcite.sql.validate.EmptyScope.getTableNamespace():76
    org.apache.calcite.sql.validate.DelegatingScope.getTableNamespace():197
    org.apache.calcite.sql.validate.IdentifierNamespace.resolveImpl():102
    org.apache.calcite.sql.validate.IdentifierNamespace.validateImpl():120
    org.apache.calcite.sql.validate.AbstractNamespace.validate():84
    org.apache.calcite.sql.validate.SqlValidatorImpl.validateNamespace():943
    org.apache.calcite.sql.validate.SqlValidatorImpl.validateQuery():924
    org.apache.calcite.sql.validate.SqlValidatorImpl.validateFrom():2971
    org.apache.calcite.sql.validate.SqlValidatorImpl.validateFrom():2956
    org.apache.calcite.sql.validate.SqlValidatorImpl.validateSelect():3197
    org.apache.calcite.sql.validate.SelectNamespace.validateImpl():60
    org.apache.calcite.sql.validate.AbstractNamespace.validate():84
    org.apache.calcite.sql.validate.SqlValidatorImpl.validateNamespace():943
    org.apache.calcite.sql.validate.SqlValidatorImpl.validateQuery():924
    org.apache.calcite.sql.SqlSelect.validate():226
    org.apache.calcite.sql.validate.SqlValidatorImpl.validateScopedExpression():899
    org.apache.calcite.sql.validate.SqlValidatorImpl.validate():609
    com.dremio.exec.planner.sql.SqlConverter.validate():229
    com.dremio.exec.planner.sql.handlers.PrelTransformer.validateNode():184
    com.dremio.exec.planner.sql.handlers.PrelTransformer.validateAndConvert():173
    com.dremio.exec.planner.sql.handlers.PrelTransformer.validateAndConvert():169
    com.dremio.exec.planner.sql.handlers.query.CreateTableHandler.getPlan():84
    com.dremio.exec.planner.sql.handlers.commands.HandlerToPreparePlan.plan():89
    com.dremio.exec.work.foreman.AttemptManager.plan():421
    com.dremio.exec.work.foreman.AttemptManager.lambda$run$0():324
    com.dremio.service.commandpool.CommandWrapper.run():62
    java.util.concurrent.ThreadPoolExecutor.runWorker():1149
    java.util.concurrent.ThreadPoolExecutor$Worker.run():624
    java.lang.Thread.run():748
  Caused By (org.apache.lucene.index.CorruptIndexException) checksum failed (hardware problem?) : expected=c72546fd actual=afb25909 (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/opt/dremio/data/db/search/metadata-dataset-splits/core/_6r_Lucene54_0.dvd")))
    org.apache.lucene.codecs.CodecUtil.checkFooter():419
    org.apache.lucene.codecs.CodecUtil.checksumEntireFile():526
    org.apache.lucene.codecs.lucene54.Lucene54DocValuesProducer.checkIntegrity():474
    org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsReader.checkIntegrity():366
    org.apache.lucene.codecs.DocValuesConsumer.merge():137
    org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge():153
    org.apache.lucene.index.SegmentMerger.mergeDocValues():167
    org.apache.lucene.index.SegmentMerger.merge():111
    org.apache.lucene.index.IndexWriter.mergeMiddle():4356
    org.apache.lucene.index.IndexWriter.merge():3931
    org.apache.lucene.index.ConcurrentMergeScheduler.doMerge():624
    org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run():661

The usual solution for this problem was to manually restart Dremio and for some time queries were working. But, lately, it got kinda worse (or I messed up something while trying to fix the issue :smile: )

After searching on the Internet found this thread - https://lucene.472066.n3.nabble.com/checksum-failed-hardware-problem-td4407328.html . It suggests that it can be a problem with disks health.

I have 3 disks mounted on a single Linux machine. smartctl short test shows no sign of bad sectors, etc. Also, I do not have any other issues with reading/writing on those disks. So, hardware health doesn’t seem like a problem.

From the error message above, you can see that Dremio (Lucene to be precise) is trying to read a .dvd file from the path: /opt/dremio/data/db/search/metadata-dataset-splits/core/_6r_Lucene54_0.dvd. On every failed query with IndexWriter error, the reason is failed checksum for this file (checksum’s actual and expected values from the error are the same for different queries).

Also, I spend some time reading Lucene Java code but didn’t find any clues (I am not a Java dev).

Did you have a similar issue? Do you have any ideas how I can investigate it further? Is it because of multiple used disks together or, maybe, filesystem? Or a bug in Lucene?

Thank you very much!

Apache Arrow is used for querying.
Dremio version: 4.1.7
Lucene version: 6.6.0 (https://github.com/apache/lucene-solr/tree/branch_6_6/lucene/core/src/java/org/apache/lucene)

@vmois

Next time this happens see if you log on to the Dremio coordinator host, then

cd <DREMIO_HOME>/bin
./dremio-admin clean -i

If this fixes your issue, probably due to disk issues or external factors your in index is getting corrupted

-i reindexes the data

Thanks
@balaji.ramaswamy

1 Like

Thanks for the response!

Before I run reindexing, we started to get a new (quite weird) error message when using Python Arrow client with Dremio. Maybe, you have an idea? Example:

SYSTEM ERROR: CompileException: Line 109, Column 30: No applicable constructor/method found for actual parameters "org.apache.arrow.vector.holders.UnionHolder"; candidates are: "public void com.dremio.exec.vector.complex.fn.JsonWriter.write(org.apache.arrow.vector.complex.reader.FieldReader) throws com.fasterxml.jackson.core.JsonGenerationException, java.io.IOException"

Web-version just had IndexWriter error.

After reindexing, this new strange error is still occurring but as for now, IndexWriter is gone.