Regression in parquet reader in version 3.3.1

Hi,

i’m currently testing dremio CE version 3.3.1 and just found out a regression on parquet reader.
Some of my parquet datasets can not be read and I get following message :

Error in parquet reader (complex). Message: Failure in setting up reader Parquet Metadata: ParquetMetaData{FileMetaData{schema: message schema { optional binary

Such parquet files have been created through python, using snappy compression and dictionary encoding.
They could be read successfully using Dremio 3.1.11

Any advise ?

Thanks
David

Log stack message :

com.dremio.common.exceptions.UserException: ParquetCompressionCodecException

at com.dremio.common.exceptions.UserException$Builder.build(UserException.java:776) ~[dremio-common-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.driver.SmartOp.contextualize(SmartOp.java:140) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.driver.SmartOp$SmartProducer.setup(SmartOp.java:567) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.driver.Pipe$SetupVisitor.visitProducer(Pipe.java:79) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.driver.Pipe$SetupVisitor.visitProducer(Pipe.java:63) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.driver.SmartOp$SmartProducer.accept(SmartOp.java:533) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.driver.StraightPipe.setup(StraightPipe.java:102) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.driver.StraightPipe.setup(StraightPipe.java:102) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.driver.StraightPipe.setup(StraightPipe.java:102) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.driver.StraightPipe.setup(StraightPipe.java:102) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.driver.StraightPipe.setup(StraightPipe.java:102) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.driver.StraightPipe.setup(StraightPipe.java:102) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.driver.Pipeline.setup(Pipeline.java:68) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.exec.fragment.FragmentExecutor.setupExecution(FragmentExecutor.java:381) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.exec.fragment.FragmentExecutor.run(FragmentExecutor.java:265) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.exec.fragment.FragmentExecutor.access$1300(FragmentExecutor.java:92) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.exec.fragment.FragmentExecutor$AsyncTaskImpl.run(FragmentExecutor.java:671) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.task.AsyncTaskWrapper.run(AsyncTaskWrapper.java:104) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.task.slicing.SlicingThread.mainExecutionLoop(SlicingThread.java:226) [dremio-ce-sabot-scheduler-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.task.slicing.SlicingThread.run(SlicingThread.java:156) [dremio-ce-sabot-scheduler-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

Caused by: org.apache.parquet.hadoop.DirectCodecFactory$DirectCodecPool$ParquetCompressionCodecException: null

2 Likes

@dfleckinger,

Can you supply a sample file?

Of, failing that, the output of:
$ parquet-tools meta <problematic file>

Also, what python library are you using to write the Parquet files?

hi @ben,

Here is the result of parquet-tools meta execution for a file that Dremio fails to open :
20190806-bt-KO_DREMIO_snappy_parquet_metadata.txt.zip (3.3 KB)

And other file generated the same day in the same way with other data that Dremio manages to open :
20190806-wtb-OK_DREMIO_snappy-parquet-metadata.txt.zip (3.8 KB)

Those files have been written using library pyarrow version 0.13 using method write_table of module pyarrow.parquet.
Both of them can be read with Dremio 3.1.11

I would be happy to provide you with those files in a private way.

Hi @dfleckinger

Just sent you an email

Thanks
@balaji.ramaswamy

Yeah. All my parquet data sources are showing the same error after the upgrade.

hi @balaji.ramaswamy,
have you found any reason that could cause the issue on reading parquet files ?
Thanks !

HI @dfleckinger, Not yet, we are looking into the issue and will get back to you shortly

Is this bug confirmed? We have the same problem with all our parquet files, no matter the content of the file or the pyarrow version (tried all major versions from 0.11 to 0.14.1). Since Dremio 3.3.1 we have the same error reported by the thread opener. 3.2.8 works fine and since our data warehouse is based on parquet, we definetely can’t move to 3.3.

Thanks

@Luca,

We are working on a fix, but it would be helpful to know if this is the same issue. Do you have a sample file that you could share?

As I said before, this happens with all parquet files, and only since version 3.3.1. You can try replicating the problem just by creating a parquet file with pyarrow and pandas:
pd.DataFrame({'a':[1,2,3], 'b':[10,20,30]}).to_parquet('data.parquet')
The output of this command is not readable by Dremio 3.3, but works fine with older releases.
Tried with pyarrow 11.1, 12.1, 13.0 and 14.1, with or without compression, on Intel or AMD machines.

Anyway, just after writing the first half of this post, I had an idea and hopefully I found the problem, at least in our case. We updated our machines to Ubuntu 18 in the last few months and the new default JRE is version 11. I installed openjdk 8 and forced Dremio to use that and guess what, it works! Looks like java 11 + Dremio 3.3 are not a good match after all, even if Dremio 3.2 works just fine with Java 11 (that’s why I didn’t think about downgrading Java at first). I’ll do more tests in the next hours…

Maybe this is the solution also to @dfleckinger problem?

hi @Luca, thanks for your interesting comment.
My case is different than yours, as some of my parquet files are readable, others are not, in the same dataset, one being generated a few minutes later than the other in the same program.
So in my case, I don’t think that the execution environment is playing a role, it seems to be related with the data content.

Thanks for the update @Luca. I we’ll try the JDK v11 setup and see if this produces similar behavior. With Dremio 3.3.1 + JDK 8, I have no problem with the simple file created from the frame.

@Luca, I should add that we do not officially support JDK 11 (though we working towards compatibility), so for the time being, JDK 8 should be used with Dremio (as you’ve already implemented)

I’m on RedHat 7.6 and Java 8 and I’ve seen both the MetaData exception and this new one:

SYSTEM ERROR: ParquetCompressionCodecException

Seems like we have problems with both dictionary encoding and snappy compression.

        org.apache.parquet.hadoop.DirectCodecFactory$FullDirectDecompressor(DirectCodecFactory.java:233)
        com.dremio.parquet.pages.BaseReaderIterator(BaseReaderIterator.java:152)
        com.dremio.parquet.pages.BaseReaderIterator(BaseReaderIterator.java:78)
        com.dremio.parquet.pages.IncrementalPageReaderIterator(IncrementalPageReaderIterator.java:162)
        com.dremio.parquet.pages.IncrementalPageReaderIterator(IncrementalPageReaderIterator.java:148)
        com.dremio.parquet.pages.MemoizingPageIterator(MemoizingPageIterator.java:49)
        com.dremio.parquet.pages.PageIterator(PageIterator.java:90)
        com.dremio.parquet.reader.column.generics.Float4SimpleReader(Float4SimpleReader.java:59)
        com.dremio.parquet.reader.column.ColumnReaderFactory(ColumnReaderFactory.java:745)
        com.dremio.parquet.reader.column.ColumnReaderFactory(ColumnReaderFactory.java:804)
        com.dremio.parquet.reader.SimpleReader(SimpleReader.java:52)
        com.dremio.extra.exec.store.dfs.parquet.ParquetVectorizedReader(ParquetVectorizedReader.java:190)
        com.dremio.exec.store.parquet.UnifiedParquetReader(UnifiedParquetReader.java:145)
        com.dremio.sabot.op.scan.ScanOperator$1(ScanOperator.java:225)
        com.dremio.sabot.op.scan.ScanOperator$1(ScanOperator.java:221)
        ...(:0)
        org.apache.hadoop.security.UserGroupInformation(UserGroupInformation.java:1836)
        com.dremio.sabot.op.scan.ScanOperator(ScanOperator.java:221)
        com.dremio.sabot.op.scan.ScanOperator(ScanOperator.java:194)
        com.dremio.sabot.op.scan.ScanOperator(ScanOperator.java:180)
        com.dremio.sabot.driver.SmartOp$SmartProducer(SmartOp.java:563)
        com.dremio.sabot.driver.Pipe$SetupVisitor(Pipe.java:79)
        com.dremio.sabot.driver.Pipe$SetupVisitor(Pipe.java:63)
        com.dremio.sabot.driver.SmartOp$SmartProducer(SmartOp.java:533)
        com.dremio.sabot.driver.StraightPipe(StraightPipe.java:102)
        com.dremio.sabot.driver.StraightPipe(StraightPipe.java:102)
        com.dremio.sabot.driver.StraightPipe(StraightPipe.java:102)
        com.dremio.sabot.driver.StraightPipe(StraightPipe.java:102)
        com.dremio.sabot.driver.StraightPipe(StraightPipe.java:102)
        com.dremio.sabot.driver.Pipeline(Pipeline.java:68)
        com.dremio.sabot.exec.fragment.FragmentExecutor(FragmentExecutor.java:381)
        com.dremio.sabot.exec.fragment.FragmentExecutor(FragmentExecutor.java:265)
        com.dremio.sabot.exec.fragment.FragmentExecutor(FragmentExecutor.java:92)
        com.dremio.sabot.exec.fragment.FragmentExecutor$AsyncTaskImpl(FragmentExecutor.java:671)
        com.dremio.sabot.task.AsyncTaskWrapper(AsyncTaskWrapper.java:104)
        com.dremio.sabot.task.slicing.SlicingThread(SlicingThread.java:226)
        com.dremio.sabot.task.slicing.SlicingThread(SlicingThread.java:156)

@david.lee, can you share any of these files?

Hello,

I just want to piggyback onto this thread and say that we are experiencing the same issues. Did not experience this in 3.2.3, but after upgrading to 3.3.2, Dremio stopped being able to read half of our parquet files while others are seemingly unaffected. Seeing a lot of both “Failure in setting up reader Parquet Metadata” and “ParquetCompressionCodecException” errors.