Sure, I conducted other tests, looks like the problem is when a full-nan column is encoded using a dictionary. Here’s the code:
import pandas as pd
import numpy as np
import string
df1 = pd.DataFrame({'a':np.random.random(size=10),
'b':np.random.random(size=10),
'c':np.random.choice(list(string.ascii_lowercase), 10)})
print(df1)
a b c
0 0.458737 0.219086 k
1 0.545871 0.680314 w
2 0.698508 0.369771 y
3 0.160838 0.617914 y
4 0.305686 0.738564 s
5 0.990604 0.603486 p
6 0.563542 0.784747 b
7 0.132095 0.687255 i
8 0.028130 0.148199 y
9 0.674921 0.045764 w
df1, df2 and df3 have no problems. df4 is the dataframe with a full-nan column. If I export it to parquet with no explicit options (df4_a), the dictionary is used on all columns and Dremio cannot open it. If I disable dictionary completely (df4_b), Dremio can read the file. But you just need to disable dictionary on the offending column (df4_c) to prevent Dremio to fail.
I also have to amend a thing I said in one of the last posts, fact is that If you have a faulty file in your collection, the whole collection mapping fails (or fails later, when a query needs to access to that file specifically). If I try to map the directory containing all (and only) the parquet files created by the script, Dremio fails. But if I remove df4_a.parquet (the offending file), then Dremio maps the folder and works just fine.
@luca, upon re-examining the error message you provided earlier it looks like we have a patch that will address the issue. That patch should be available in our next release. I will keep this thread updated when a release date is known.
HI, I just installed Dremio 4.0.1 to check the bug and unfortunately, not only I still can’t open “df4_a.parquet”, but I can’t open ANY parquet at all.
I always get errors like these: