Regression in parquet reader in version 3.3.1

Luca · August 28, 2019, 9:05am

Sure, I conducted other tests, looks like the problem is when a full-nan column is encoded using a dictionary. Here’s the code:

import pandas as pd
import numpy as np
import string

df1 = pd.DataFrame({'a':np.random.random(size=10), 
                    'b':np.random.random(size=10), 
                    'c':np.random.choice(list(string.ascii_lowercase), 10)})
print(df1)

      a         b  c
0 0.458737 0.219086 k
1 0.545871 0.680314 w
2 0.698508 0.369771 y
3 0.160838 0.617914 y
4 0.305686 0.738564 s
5 0.990604 0.603486 p
6 0.563542 0.784747 b
7 0.132095 0.687255 i
8 0.028130 0.148199 y
9 0.674921 0.045764 w

df1.to_parquet('df1.parquet') #WORKS

df2 = df1.copy()
df2.loc[0, 'b'] = np.nan
df2.to_parquet('df2.parquet') #WORKS

df3 = df1.copy()
df3.loc[0, 'c'] = None
df3.to_parquet('df3.parquet') #WORKS

df4 = df1.copy()
df4['b'] = np.nan
df4.to_parquet('df4_a.parquet') #FAIL
df4.to_parquet('df4_b.parquet', use_dictionary=False) #WORKS
df4.to_parquet('df4_c.parquet', use_dictionary=[b'a',b'c']) #WORKS

df1, df2 and df3 have no problems. df4 is the dataframe with a full-nan column. If I export it to parquet with no explicit options (df4_a), the dictionary is used on all columns and Dremio cannot open it. If I disable dictionary completely (df4_b), Dremio can read the file. But you just need to disable dictionary on the offending column (df4_c) to prevent Dremio to fail.
I also have to amend a thing I said in one of the last posts, fact is that If you have a faulty file in your collection, the whole collection mapping fails (or fails later, when a query needs to access to that file specifically). If I try to map the directory containing all (and only) the parquet files created by the script, Dremio fails. But if I remove df4_a.parquet (the offending file), then Dremio maps the folder and works just fine.

ben · August 28, 2019, 9:46pm

@luca, upon re-examining the error message you provided earlier it looks like we have a patch that will address the issue. That patch should be available in our next release. I will keep this thread updated when a release date is known.

dfleckinger · September 17, 2019, 2:01pm

hi @ben, I just saw that Dremio 4.0 was released. Do you think the patch for the ParquetReader is included in it ?

Thanks

ben · September 17, 2019, 4:19pm

@dfleckinger,

Yes, it should be resolved in Dremio 4.0. Please update this thread if you find that is not the case.

Luca · September 23, 2019, 7:51am

HI, I just installed Dremio 4.0.1 to check the bug and unfortunately, not only I still can’t open “df4_a.parquet”, but I can’t open ANY parquet at all.
I always get errors like these:

Error in parquet reader (complex). Message: Failure in setting up reader Parquet Metadata: ParquetMetaData{FileMetaData{schema: message schema { optional double a; optional double b; optional binary c (STRING); } , metadata: {pandas={“index_columns”: [{“kind”: “range”, “name”: null, “start”: 0, “stop”: 10, “step”: 1}], “column_indexes”: [{“name”: null, “field_name”: null, “pandas_type”: “unicode”, “numpy_type”: “object”, “metadata”: {“encoding”: “UTF-8”}}], “columns”: [{“name”: “a”, “field_name”: “a”, “pandas_type”: “float64”, “numpy_type”: “float64”, “metadata”: null}, {“name”: “b”, “field_name”: “b”, “pandas_type”: “float64”, “numpy_type”: “float64”, “metadata”: null}, {“name”: “c”, “field_name”: “c”, “pandas_type”: “unicode”, “numpy_type”: “object”, “metadata”: null}], “creator”: {“library”: “pyarrow”, “version”: “0.14.1”}, “pandas_version”: “0.25.1”}}}, blocks: [BlockMetaData{10, 302 [ColumnMetaData{SNAPPY [a] optional double a [RLE, PLAIN_DICTIONARY, PLAIN], 103}, ColumnMetaData{SNAPPY [b] optional double b [RLE, PLAIN], 258}, ColumnMetaData{SNAPPY [c] optional binary c (STRING) [RLE, PLAIN_DICTIONARY, PLAIN], 370}]}]}

EDIT: NVM It was a configuration problem by our side

dfleckinger · September 23, 2019, 8:08am

hi,
on my side in 4.0 I have no more issues reading parquet files.

Luca · September 23, 2019, 8:18am

Oh nvm, I found the problem… I also confirm the bug resolution, thank you.

Praveen_Kumar · October 1, 2019, 10:32am

Hi,

4.0 has the fix for this issue, upgrading should solve the issue.

As noted by others - the issue was with handling a column that was all NaN and was snappy compressed.

Apologies for the regression!

Topic		Replies	Views
Able to read parquet file with parquet-tools, but not dremio	11	3949	August 15, 2019
Parquet metadata error - is Parquet v2.0 file format supported?	3	4091	January 28, 2019
Dremio can not read Parquet produced by Arrow	0	1465	March 27, 2020
Error in parquet reader (complex)	2	952	December 26, 2022
Parquet Logical Type Support	25	3102	April 16, 2019

Regression in parquet reader in version 3.3.1

Related topics