Parquet File error

Wondering if anyone has seen an error like below. I am writing a parquet file using a python script and the pyarrow lib to a NAS where I have dremio reading the file - at first Dremio has no issue reading the file … but once I update the file via the schedule python script … this error pops up:

Any ideas if there is a way to fix this ?

IOException: file:/data_store/customer_base.parquet is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [24, 9, 54, 48]

SYSTEM ERROR: IOException: file:/data_store/customer_base.parquet is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [24, 9, 54, 48] Fragment 0:0 [Error Id: 390917e6-d60d-4b0b-a749-7800db1d416f on 0148f7f101bb:31010] (java.lang.RuntimeException) Failed to read parquet footer for file file:/data_store/customer_base.parquet com.dremio.exec.store.parquet.SingletonParquetFooterCache.getFooter():55 com.dremio.exec.store.parquet.ParquetOperatorCreator$1.apply():120 com.dremio.exec.store.parquet.ParquetOperatorCreator$1.apply():99 com.google.common.collect.Iterators$7.transform():750 com.google.common.collect.TransformedIterator.next():47 com.dremio.sabot.op.scan.ScanOperator.<init>():142 com.dremio.exec.store.parquet.ParquetOperatorCreator.create():152 com.dremio.exec.store.parquet.ParquetOperatorCreator.create():57 com.dremio.sabot.driver.OperatorCreatorRegistry.getProducerOperator():94 com.dremio.sabot.driver.UserDelegatingOperatorCreator$4.run():89 com.dremio.sabot.driver.UserDelegatingOperatorCreator$4.run():86 java.security.AccessController.doPrivileged():-2 javax.security.auth.Subject.doAs():422 org.apache.hadoop.security.UserGroupInformation.doAs():1836 com.dremio.sabot.driver.UserDelegatingOperatorCreator.getProducerOperator():86 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitSubScan():210 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitSubScan():115 com.dremio.exec.physical.base.AbstractSubScan.accept():77 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():247 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():115 com.dremio.exec.physical.base.AbstractPhysicalVisitor.visitLimit():109 com.dremio.exec.physical.config.Limit.accept():55 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():247 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():115 com.dremio.exec.physical.base.AbstractPhysicalVisitor.visitProject():84 com.dremio.exec.physical.config.Project.accept():53 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():247 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():115 com.dremio.exec.physical.base.AbstractPhysicalVisitor.visitProject():84 com.dremio.exec.physical.config.Project.accept():53 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():247 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():115 com.dremio.exec.physical.base.AbstractPhysicalVisitor.visitWriter():69 com.dremio.exec.physical.base.AbstractWriter.accept():37 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():247 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():115 com.dremio.exec.physical.base.AbstractPhysicalVisitor.visitWriterCommiter():59 com.dremio.exec.physical.config.WriterCommitterPOP.accept():74 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():247 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():115 com.dremio.exec.physical.base.AbstractPhysicalVisitor.visitProject():84 com.dremio.exec.physical.config.Project.accept():53 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitScreen():234 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitScreen():115 com.dremio.exec.physical.config.Screen.accept():43 com.dremio.sabot.driver.PipelineCreator.get():107 com.dremio.sabot.driver.PipelineCreator.get():101 com.dremio.sabot.exec.fragment.FragmentExecutor.setupExecution():333 com.dremio.sabot.exec.fragment.FragmentExecutor.run():234 com.dremio.sabot.exec.fragment.FragmentExecutor.access$800():86 com.dremio.sabot.exec.fragment.FragmentExecutor$AsyncTaskImpl.run():591 com.dremio.sabot.task.AsyncTaskWrapper.run():107 com.dremio.sabot.task.slicing.SlicingThread.run():102 Caused By (java.io.IOException) file:/data_store/customer_base.parquet is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [24, 9, 54, 48] com.dremio.exec.store.parquet.SingletonParquetFooterCache.checkMagicBytes():73 com.dremio.exec.store.parquet.SingletonParquetFooterCache.readFooter():114 com.dremio.exec.store.parquet.SingletonParquetFooterCache.readFooter():100 com.dremio.exec.store.parquet.SingletonParquetFooterCache.getFooter():53 com.dremio.exec.store.parquet.ParquetOperatorCreator$1.apply():120 com.dremio.exec.store.parquet.ParquetOperatorCreator$1.apply():99 com.google.common.collect.Iterators$7.transform():750 com.google.common.collect.TransformedIterator.next():47 com.dremio.sabot.op.scan.ScanOperator.<init>():142 com.dremio.exec.store.parquet.ParquetOperatorCreator.create():152 com.dremio.exec.store.parquet.ParquetOperatorCreator.create():57 com.dremio.sabot.driver.OperatorCreatorRegistry.getProducerOperator():94 com.dremio.sabot.driver.UserDelegatingOperatorCreator$4.run():89 com.dremio.sabot.driver.UserDelegatingOperatorCreator$4.run():86 java.security.AccessController.doPrivileged():-2 javax.security.auth.Subject.doAs():422 org.apache.hadoop.security.UserGroupInformation.doAs():1836 com.dremio.sabot.driver.UserDelegatingOperatorCreator.getProducerOperator():86 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitSubScan():210 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitSubScan():115 com.dremio.exec.physical.base.AbstractSubScan.accept():77 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():247 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():115 com.dremio.exec.physical.base.AbstractPhysicalVisitor.visitLimit():109 com.dremio.exec.physical.config.Limit.accept():55 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():247 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():115 com.dremio.exec.physical.base.AbstractPhysicalVisitor.visitProject():84 com.dremio.exec.physical.config.Project.accept():53 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():247 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():115 com.dremio.exec.physical.base.AbstractPhysicalVisitor.visitProject():84 com.dremio.exec.physical.config.Project.accept():53 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():247 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():115 com.dremio.exec.physical.base.AbstractPhysicalVisitor.visitWriter():69 com.dremio.exec.physical.base.AbstractWriter.accept():37 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():247 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():115 com.dremio.exec.physical.base.AbstractPhysicalVisitor.visitWriterCommiter():59 com.dremio.exec.physical.config.WriterCommitterPOP.accept():74 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():247 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():115 com.dremio.exec.physical.base.AbstractPhysicalVisitor.visitProject():84 com.dremio.exec.physical.config.Project.accept():53 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitScreen():234 com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitScreen():115 com.dremio.exec.physical.config.Screen.accept():43 com.dremio.sabot.driver.PipelineCreator.get():107 com.dremio.sabot.driver.PipelineCreator.get():101 com.dremio.sabot.exec.fragment.FragmentExecutor.setupExecution():333 com.dremio.sabot.exec.fragment.FragmentExecutor.run():234 com.dremio.sabot.exec.fragment.FragmentExecutor.access$800():86 com.dremio.sabot.exec.fragment.FragmentExecutor$AsyncTaskImpl.run():591 com.dremio.sabot.task.AsyncTaskWrapper.run():107 com.dremio.sabot.task.slicing.SlicingThread.run():102

Hi @gates_ma,

What version of Dremio CE are you using. There is a similar issue fixed in 2.0.10. IF your version is >= 2.0.10, a few things to try below

Are you able to do the below?

parquet-tools cat

Also another thing you can try is,

Create a Hive external table mapping to this parquet file, see if you can read through Hive CLI (Hive shell or Beeline) and then also try an extra step of adding the Hive source to Dremio and see if it works

Kindly do update us

Thanks
@balaji.ramaswamy

on 2.1.4-201808302048550610-0981242 - seems that when I remove the format and re-add the format in the UI it works… but I don’t want to have to that do each time I update the parquet file, rather Dremio just know I’m using a newer version of the input file of the same format and not have to remove / re-add the formatting

Sorry to hear you have issues.
Couple of questions:

  1. Is the script to generate and update is the same?
  2. Did you try to clean/refresh metadata after updating files?

This is something that bothered our team a lot too… if you change a parquet file, dremio will fail to query it since it caches some metadata that is no more valid.
We tried with alter pds <data_source> refresh metadata but often this command does nothing and returns Table ‘<data_source>’ read signature reviewed but source stated metadata is unchanged, no refresh occurred. since probably it just looks for new files with new filenames.
We tried with alter pds <data_source> forget metadata before refreshing and found that often works but not always: sometimes reflection depending on the data source start hanging and also queries on those files take forever and never end.
So, to update a file content, our best solution for now is to delete the old file and write a new file with a different name. We decided, when replacing a file, to append the current timestamp to the file name.
After this, you just need to refresh the metadata (without “forgetting”). If you change the file but you do not refresh metadata, your queries will fail almost immediately until automatic metadata refresh, but at least will not hang (leading some client application to hang too).

ps: this works if you format a directory containing parquet file(s), not if you format a single file (that would change name every time).

Hi @balaji.ramaswamy ,

I am facing a similar issue as OP - Parquet file was updated externally by a python script. The file is valid and upon removing the format and readding the format, I can read the parquet. Is there a way to not having to readd the format each time? The refresh metadata doesn’t work in this case.

Hi @mystic

we do not recommend to replace the files with the same name,every time you update the parquet file the parquet footer changes, though the file name remaines same,which requires a new parquet schema learning. If the files written with same name the dataset promotion is needed to recognise this as a new file with schema.
@Venugopal_Menda