Parquet datasets/files created by CTAS (CREATE table as Select) are not readable by pyarrow

Hi,
I’m not able to read parquet dataset data stored in $scratch directory by CTAS command.

when i try to read the parquet file,
i always get the error when running:
import pyarrow.parquet as pq
pq.read_table(parquet_path)

ArrowIOError: Couldn’t deserialize thrift: TProtocolException: Invalid data
Deserializing page header failed.

I can read the metadata through pq.read_metadata
eg

<pyarrow._parquet.FileMetaData object at 0x312c30628>
created_by: parquet-mr version 1.8.1-fast-201712141648170019-ab0622b (build ab0622b4470fd0fdd2889f21f2d962077d35dfee)
num_columns: 53
num_rows: 122274
num_row_groups: 1
format_version: 1.0
serialized_size: 71561

I can also read the schema through pq.read_schema, only pq.read_table is failing

I’m using pyarrow last release, version 0.9
Any advice ?

2 Likes

Can you say why you are trying to access data this way rather than going over ODBC against a VDS?

There are a number of people in our company who would be likely to access data this way as well. One simple example would be if we generate something useful with a CTAS and then we want to export the file(s) to another environment that doesn’t know what Dremio is or isn’t able to query it for whatever reason.

It’s pretty easy to recreate this, but if it helps, here’s a more complete stack trace for the parquet read:

>>> foo = pq.read_table("/Users/desidero/Library/Dremio/data/pdfs/scratch/foo/0_0_0.parquet")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/desidero/tmp/pywhatever/lib/python2.7/site-packages/pyarrow/parquet.py", line 941, in read_table
    use_pandas_metadata=use_pandas_metadata)
  File "/Users/desidero/tmp/pywhatever/lib/python2.7/site-packages/pyarrow/parquet.py", line 150, in read
    nthreads=nthreads)
  File "_parquet.pyx", line 734, in pyarrow._parquet.ParquetReader.read_all
  File "error.pxi", line 79, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Couldn't deserialize thrift: TProtocolException: Invalid data
Deserializing page header failed.

hi @kelly,
I use parquet datasets extensively. All the log files from the 4 last years are stored in parquet format on S3, and I’m happy that Dremio enables to query them.
I expect all the parquet files on my datalake to be compatible, ie being readable by Dremio, PyArrow, AWS Athena, Apache Impala, Hive or other.
Also at the moment, I still face some issues with Dremio that prevents some queries to use the accelerations, so I need to have some aggregated datasets to be in Parquet format, so that I can get acceptable delays in reporting queries.
So at the moment, to generate a parquet file corresponding to my need, I use turbodbc to get data from a VDS in Arrow format, that I save in parquet format in S3

I am also hitting this. Why would you use ODBC for REPL work?

And I just tried upgrading to latest pandas and pyarrow but still same issue so far. There are sometimes issues with the deprecated timestamp formats etc in parquet writes but usually this is from pyarrow -> spark readers.

And for ref, this is data produced by the Dremio download parquet function

Any chance you could share an example so we can reproduce here?

I am just save some data as parquet from the UI, then trying to load via:

pd.read_parquet('filename.parquet')

This is using the pyarrow engine. Is there a test suite in Dremio? Could be a good time to add a UI->to_parquet->read_parquet->to_Dremio-> round trip test.

@doron can you take a look at this?

Coming up with the same issue.
Would you know if this is fixed in 3.0 with the new CTAs syntax?

With regards,
Dorian

Still experiencing the same issue

Apparently, as per release notes of 3.2.1, all the Parquet files generated by Dremio 3.2 are readable by PyArrow.
That’s a huge one ! Thanks