Regression in parquet reader in version 3.3.1

Hi,

i’m currently testing dremio CE version 3.3.1 and just found out a regression on parquet reader.
Some of my parquet datasets can not be read and I get following message :

Error in parquet reader (complex). Message: Failure in setting up reader Parquet Metadata: ParquetMetaData{FileMetaData{schema: message schema { optional binary

Such parquet files have been created through python, using snappy compression and dictionary encoding.
They could be read successfully using Dremio 3.1.11

Any advise ?

Thanks
David

Log stack message :

com.dremio.common.exceptions.UserException: ParquetCompressionCodecException

at com.dremio.common.exceptions.UserException$Builder.build(UserException.java:776) ~[dremio-common-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.driver.SmartOp.contextualize(SmartOp.java:140) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.driver.SmartOp$SmartProducer.setup(SmartOp.java:567) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.driver.Pipe$SetupVisitor.visitProducer(Pipe.java:79) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.driver.Pipe$SetupVisitor.visitProducer(Pipe.java:63) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.driver.SmartOp$SmartProducer.accept(SmartOp.java:533) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.driver.StraightPipe.setup(StraightPipe.java:102) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.driver.StraightPipe.setup(StraightPipe.java:102) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.driver.StraightPipe.setup(StraightPipe.java:102) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.driver.StraightPipe.setup(StraightPipe.java:102) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.driver.StraightPipe.setup(StraightPipe.java:102) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.driver.StraightPipe.setup(StraightPipe.java:102) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.driver.Pipeline.setup(Pipeline.java:68) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.exec.fragment.FragmentExecutor.setupExecution(FragmentExecutor.java:381) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.exec.fragment.FragmentExecutor.run(FragmentExecutor.java:265) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.exec.fragment.FragmentExecutor.access$1300(FragmentExecutor.java:92) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.exec.fragment.FragmentExecutor$AsyncTaskImpl.run(FragmentExecutor.java:671) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.task.AsyncTaskWrapper.run(AsyncTaskWrapper.java:104) [dremio-sabot-kernel-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.task.slicing.SlicingThread.mainExecutionLoop(SlicingThread.java:226) [dremio-ce-sabot-scheduler-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

at com.dremio.sabot.task.slicing.SlicingThread.run(SlicingThread.java:156) [dremio-ce-sabot-scheduler-3.3.1-201907291852280797-df23756.jar:3.3.1-201907291852280797-df23756]

Caused by: org.apache.parquet.hadoop.DirectCodecFactory$DirectCodecPool$ParquetCompressionCodecException: null

2 Likes

@dfleckinger,

Can you supply a sample file?

Of, failing that, the output of:
$ parquet-tools meta <problematic file>

Also, what python library are you using to write the Parquet files?

hi @ben,

Here is the result of parquet-tools meta execution for a file that Dremio fails to open :
20190806-bt-KO_DREMIO_snappy_parquet_metadata.txt.zip (3.3 KB)

And other file generated the same day in the same way with other data that Dremio manages to open :
20190806-wtb-OK_DREMIO_snappy-parquet-metadata.txt.zip (3.8 KB)

Those files have been written using library pyarrow version 0.13 using method write_table of module pyarrow.parquet.
Both of them can be read with Dremio 3.1.11

I would be happy to provide you with those files in a private way.

Hi @dfleckinger

Just sent you an email

Thanks
@balaji.ramaswamy

Yeah. All my parquet data sources are showing the same error after the upgrade.

hi @balaji.ramaswamy,
have you found any reason that could cause the issue on reading parquet files ?
Thanks !

HI @dfleckinger, Not yet, we are looking into the issue and will get back to you shortly

Is this bug confirmed? We have the same problem with all our parquet files, no matter the content of the file or the pyarrow version (tried all major versions from 0.11 to 0.14.1). Since Dremio 3.3.1 we have the same error reported by the thread opener. 3.2.8 works fine and since our data warehouse is based on parquet, we definetely can’t move to 3.3.

Thanks

@Luca,

We are working on a fix, but it would be helpful to know if this is the same issue. Do you have a sample file that you could share?

As I said before, this happens with all parquet files, and only since version 3.3.1. You can try replicating the problem just by creating a parquet file with pyarrow and pandas:
pd.DataFrame({'a':[1,2,3], 'b':[10,20,30]}).to_parquet('data.parquet')
The output of this command is not readable by Dremio 3.3, but works fine with older releases.
Tried with pyarrow 11.1, 12.1, 13.0 and 14.1, with or without compression, on Intel or AMD machines.

Anyway, just after writing the first half of this post, I had an idea and hopefully I found the problem, at least in our case. We updated our machines to Ubuntu 18 in the last few months and the new default JRE is version 11. I installed openjdk 8 and forced Dremio to use that and guess what, it works! Looks like java 11 + Dremio 3.3 are not a good match after all, even if Dremio 3.2 works just fine with Java 11 (that’s why I didn’t think about downgrading Java at first). I’ll do more tests in the next hours…

Maybe this is the solution also to @dfleckinger problem?

1 Like

hi @Luca, thanks for your interesting comment.
My case is different than yours, as some of my parquet files are readable, others are not, in the same dataset, one being generated a few minutes later than the other in the same program.
So in my case, I don’t think that the execution environment is playing a role, it seems to be related with the data content.

Thanks for the update @Luca. I we’ll try the JDK v11 setup and see if this produces similar behavior. With Dremio 3.3.1 + JDK 8, I have no problem with the simple file created from the frame.

@Luca, I should add that we do not officially support JDK 11 (though we working towards compatibility), so for the time being, JDK 8 should be used with Dremio (as you’ve already implemented)

I’m on RedHat 7.6 and Java 8 and I’ve seen both the MetaData exception and this new one:

SYSTEM ERROR: ParquetCompressionCodecException

Seems like we have problems with both dictionary encoding and snappy compression.

        org.apache.parquet.hadoop.DirectCodecFactory$FullDirectDecompressor(DirectCodecFactory.java:233)
        com.dremio.parquet.pages.BaseReaderIterator(BaseReaderIterator.java:152)
        com.dremio.parquet.pages.BaseReaderIterator(BaseReaderIterator.java:78)
        com.dremio.parquet.pages.IncrementalPageReaderIterator(IncrementalPageReaderIterator.java:162)
        com.dremio.parquet.pages.IncrementalPageReaderIterator(IncrementalPageReaderIterator.java:148)
        com.dremio.parquet.pages.MemoizingPageIterator(MemoizingPageIterator.java:49)
        com.dremio.parquet.pages.PageIterator(PageIterator.java:90)
        com.dremio.parquet.reader.column.generics.Float4SimpleReader(Float4SimpleReader.java:59)
        com.dremio.parquet.reader.column.ColumnReaderFactory(ColumnReaderFactory.java:745)
        com.dremio.parquet.reader.column.ColumnReaderFactory(ColumnReaderFactory.java:804)
        com.dremio.parquet.reader.SimpleReader(SimpleReader.java:52)
        com.dremio.extra.exec.store.dfs.parquet.ParquetVectorizedReader(ParquetVectorizedReader.java:190)
        com.dremio.exec.store.parquet.UnifiedParquetReader(UnifiedParquetReader.java:145)
        com.dremio.sabot.op.scan.ScanOperator$1(ScanOperator.java:225)
        com.dremio.sabot.op.scan.ScanOperator$1(ScanOperator.java:221)
        ...(:0)
        org.apache.hadoop.security.UserGroupInformation(UserGroupInformation.java:1836)
        com.dremio.sabot.op.scan.ScanOperator(ScanOperator.java:221)
        com.dremio.sabot.op.scan.ScanOperator(ScanOperator.java:194)
        com.dremio.sabot.op.scan.ScanOperator(ScanOperator.java:180)
        com.dremio.sabot.driver.SmartOp$SmartProducer(SmartOp.java:563)
        com.dremio.sabot.driver.Pipe$SetupVisitor(Pipe.java:79)
        com.dremio.sabot.driver.Pipe$SetupVisitor(Pipe.java:63)
        com.dremio.sabot.driver.SmartOp$SmartProducer(SmartOp.java:533)
        com.dremio.sabot.driver.StraightPipe(StraightPipe.java:102)
        com.dremio.sabot.driver.StraightPipe(StraightPipe.java:102)
        com.dremio.sabot.driver.StraightPipe(StraightPipe.java:102)
        com.dremio.sabot.driver.StraightPipe(StraightPipe.java:102)
        com.dremio.sabot.driver.StraightPipe(StraightPipe.java:102)
        com.dremio.sabot.driver.Pipeline(Pipeline.java:68)
        com.dremio.sabot.exec.fragment.FragmentExecutor(FragmentExecutor.java:381)
        com.dremio.sabot.exec.fragment.FragmentExecutor(FragmentExecutor.java:265)
        com.dremio.sabot.exec.fragment.FragmentExecutor(FragmentExecutor.java:92)
        com.dremio.sabot.exec.fragment.FragmentExecutor$AsyncTaskImpl(FragmentExecutor.java:671)
        com.dremio.sabot.task.AsyncTaskWrapper(AsyncTaskWrapper.java:104)
        com.dremio.sabot.task.slicing.SlicingThread(SlicingThread.java:226)
        com.dremio.sabot.task.slicing.SlicingThread(SlicingThread.java:156)

@david.lee, can you share any of these files?

Hello,

I just want to piggyback onto this thread and say that we are experiencing the same issues. Did not experience this in 3.2.3, but after upgrading to 3.3.2, Dremio stopped being able to read half of our parquet files while others are seemingly unaffected. Seeing a lot of both “Failure in setting up reader Parquet Metadata” and “ParquetCompressionCodecException” errors.

So… after all looks like we met the same, or similar, problem:

com.dremio.common.exceptions.UserException: ParquetCompressionCodecException
	at com.dremio.common.exceptions.UserException$Builder.build(UserException.java:776) ~[dremio-common-3.3.2-201908142136370993-d60145d.jar:3.3.2-201908142136370993-d60145d]
	at com.dremio.sabot.driver.SmartOp.contextualize(SmartOp.java:140) [dremio-sabot-kernel-3.3.2-201908142136370993-d60145d.jar:3.3.2-201908142136370993-d60145d]
	at com.dremio.sabot.driver.SmartOp$SmartProducer.setup(SmartOp.java:567) [dremio-sabot-kernel-3.3.2-201908142136370993-d60145d.jar:3.3.2-201908142136370993-d60145d]
	at com.dremio.sabot.driver.Pipe$SetupVisitor.visitProducer(Pipe.java:79) [dremio-sabot-kernel-3.3.2-201908142136370993-d60145d.jar:3.3.2-201908142136370993-d60145d]
	at com.dremio.sabot.driver.Pipe$SetupVisitor.visitProducer(Pipe.java:63) [dremio-sabot-kernel-3.3.2-201908142136370993-d60145d.jar:3.3.2-201908142136370993-d60145d]
	at com.dremio.sabot.driver.SmartOp$SmartProducer.accept(SmartOp.java:533) [dremio-sabot-kernel-3.3.2-201908142136370993-d60145d.jar:3.3.2-201908142136370993-d60145d]
	at com.dremio.sabot.driver.StraightPipe.setup(StraightPipe.java:102) [dremio-sabot-kernel-3.3.2-201908142136370993-d60145d.jar:3.3.2-201908142136370993-d60145d]
	at com.dremio.sabot.driver.StraightPipe.setup(StraightPipe.java:102) [dremio-sabot-kernel-3.3.2-201908142136370993-d60145d.jar:3.3.2-201908142136370993-d60145d]
	at com.dremio.sabot.driver.StraightPipe.setup(StraightPipe.java:102) [dremio-sabot-kernel-3.3.2-201908142136370993-d60145d.jar:3.3.2-201908142136370993-d60145d]
	at com.dremio.sabot.driver.StraightPipe.setup(StraightPipe.java:102) [dremio-sabot-kernel-3.3.2-201908142136370993-d60145d.jar:3.3.2-201908142136370993-d60145d]
	at com.dremio.sabot.driver.StraightPipe.setup(StraightPipe.java:102) [dremio-sabot-kernel-3.3.2-201908142136370993-d60145d.jar:3.3.2-201908142136370993-d60145d]
	at com.dremio.sabot.driver.StraightPipe.setup(StraightPipe.java:102) [dremio-sabot-kernel-3.3.2-201908142136370993-d60145d.jar:3.3.2-201908142136370993-d60145d]
	at com.dremio.sabot.driver.StraightPipe.setup(StraightPipe.java:102) [dremio-sabot-kernel-3.3.2-201908142136370993-d60145d.jar:3.3.2-201908142136370993-d60145d]
	at com.dremio.sabot.driver.Pipeline.setup(Pipeline.java:68) [dremio-sabot-kernel-3.3.2-201908142136370993-d60145d.jar:3.3.2-201908142136370993-d60145d]
	at com.dremio.sabot.exec.fragment.FragmentExecutor.setupExecution(FragmentExecutor.java:381) [dremio-sabot-kernel-3.3.2-201908142136370993-d60145d.jar:3.3.2-201908142136370993-d60145d]
	at com.dremio.sabot.exec.fragment.FragmentExecutor.run(FragmentExecutor.java:265) [dremio-sabot-kernel-3.3.2-201908142136370993-d60145d.jar:3.3.2-201908142136370993-d60145d]
	at com.dremio.sabot.exec.fragment.FragmentExecutor.access$1300(FragmentExecutor.java:92) [dremio-sabot-kernel-3.3.2-201908142136370993-d60145d.jar:3.3.2-201908142136370993-d60145d]
	at com.dremio.sabot.exec.fragment.FragmentExecutor$AsyncTaskImpl.run(FragmentExecutor.java:671) [dremio-sabot-kernel-3.3.2-201908142136370993-d60145d.jar:3.3.2-201908142136370993-d60145d]
	at com.dremio.sabot.task.AsyncTaskWrapper.run(AsyncTaskWrapper.java:104) [dremio-sabot-kernel-3.3.2-201908142136370993-d60145d.jar:3.3.2-201908142136370993-d60145d]
	at com.dremio.sabot.task.slicing.SlicingThread.mainExecutionLoop(SlicingThread.java:226) [dremio-ce-sabot-scheduler-3.3.2-201908142136370993-d60145d.jar:3.3.2-201908142136370993-d60145d]
	at com.dremio.sabot.task.slicing.SlicingThread.run(SlicingThread.java:156) [dremio-ce-sabot-scheduler-3.3.2-201908142136370993-d60145d.jar:3.3.2-201908142136370993-d60145d]
Caused by: org.apache.parquet.hadoop.DirectCodecFactory$DirectCodecPool$ParquetCompressionCodecException: null

We also partially identified the (or, even this time, just “our”?) problem. Apparently Dremio 3.3 fails to read parquet files (single file or partitioned data) if one or more columns are fully NaN. We overcame the problem by removing NaN columns or by adding a dummy parquet file (in case of partitioned data) with at least one not-NaN value for those NaN columns.
This is the relevant parquet-tools line for the problematic file, that Dremio can’t open:

threshold_3: DOUBLE SNAPPY DO:377504 FPO:377519 SZ:51/48/0.94 VC:48362 ENC:RLE,PLAIN,PLAIN_DICTIONARY ST:[num_nulls: 48362, min/max not defined]

This is the same line for the “corrected” parquet file (first value set to zero), that dremio can open:

threshold_3: DOUBLE SNAPPY DO:377492 FPO:377516 SZ:104/100/0.96 VC:48362 ENC:PLAIN_DICTIONARY,RLE,PLAIN ST:[min: -0.0, max: 0.0, num_nulls: 48361]

Except for the first value of column “threshold_3”, the two files are identical, forged with the same python script (pyarrow 0.14.1, pandas 0.25.0)
Hope this helps.

The problem seems related to snappy compression with fully NaN columns as suggested by @Luca

Doing a test with this simple code fragment starting by a snappy parquet file with NaN columns (test.parquet).

import pandas
df = pandas.read_parquet(‘test.parquet’)
df.to_parquet(‘test_gz.parquet’,compression=‘gzip’)
df.to_parquet(‘test_snappy.parquet’,compression=‘snappy’)
df.to_parquet(‘test_uncompressed.parquet’,compression=None)

gzip and uncompressed work correctly.

We are using Dremio version 3.3.2.

Thanks for bringing this to our attention @Luca,

Can you provide a code snippet for how you are generating the NaN-populated column?