Parquet datasets/files created by CTAS (CREATE table as Select) are not readable by pyarrow

dfleckinger · May 18, 2018, 7:24am

Hi,
I’m not able to read parquet dataset data stored in $scratch directory by CTAS command.

when i try to read the parquet file,
i always get the error when running:
import pyarrow.parquet as pq
pq.read_table(parquet_path)

ArrowIOError: Couldn’t deserialize thrift: TProtocolException: Invalid data
Deserializing page header failed.

I can read the metadata through pq.read_metadata
eg

<pyarrow._parquet.FileMetaData object at 0x312c30628>
created_by: parquet-mr version 1.8.1-fast-201712141648170019-ab0622b (build ab0622b4470fd0fdd2889f21f2d962077d35dfee)
num_columns: 53
num_rows: 122274
num_row_groups: 1
format_version: 1.0
serialized_size: 71561

I can also read the schema through pq.read_schema, only pq.read_table is failing

I’m using pyarrow last release, version 0.9
Any advice ?

kelly · May 18, 2018, 6:12pm

Can you say why you are trying to access data this way rather than going over ODBC against a VDS?

desidero · May 20, 2018, 6:21pm

There are a number of people in our company who would be likely to access data this way as well. One simple example would be if we generate something useful with a CTAS and then we want to export the file(s) to another environment that doesn’t know what Dremio is or isn’t able to query it for whatever reason.

It’s pretty easy to recreate this, but if it helps, here’s a more complete stack trace for the parquet read:

>>> foo = pq.read_table("/Users/desidero/Library/Dremio/data/pdfs/scratch/foo/0_0_0.parquet")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/desidero/tmp/pywhatever/lib/python2.7/site-packages/pyarrow/parquet.py", line 941, in read_table
    use_pandas_metadata=use_pandas_metadata)
  File "/Users/desidero/tmp/pywhatever/lib/python2.7/site-packages/pyarrow/parquet.py", line 150, in read
    nthreads=nthreads)
  File "_parquet.pyx", line 734, in pyarrow._parquet.ParquetReader.read_all
  File "error.pxi", line 79, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Couldn't deserialize thrift: TProtocolException: Invalid data
Deserializing page header failed.

dfleckinger · May 22, 2018, 5:40am

hi @kelly,
I use parquet datasets extensively. All the log files from the 4 last years are stored in parquet format on S3, and I’m happy that Dremio enables to query them.
I expect all the parquet files on my datalake to be compatible, ie being readable by Dremio, PyArrow, AWS Athena, Apache Impala, Hive or other.
Also at the moment, I still face some issues with Dremio that prevents some queries to use the accelerations, so I need to have some aggregated datasets to be in Parquet format, so that I can get acceptable delays in reporting queries.
So at the moment, to generate a parquet file corresponding to my need, I use turbodbc to get data from a VDS in Arrow format, that I save in parquet format in S3

cottreld · June 27, 2018, 3:58pm

I am also hitting this. Why would you use ODBC for REPL work?

cottreld · June 27, 2018, 3:59pm

And I just tried upgrading to latest pandas and pyarrow but still same issue so far. There are sometimes issues with the deprecated timestamp formats etc in parquet writes but usually this is from pyarrow -> spark readers.

cottreld · June 27, 2018, 4:03pm

And for ref, this is data produced by the Dremio download parquet function

kelly · June 27, 2018, 5:20pm

Any chance you could share an example so we can reproduce here?

cottreld · July 3, 2018, 7:23pm

I am just save some data as parquet from the UI, then trying to load via:

pd.read_parquet('filename.parquet')

This is using the pyarrow engine. Is there a test suite in Dremio? Could be a good time to add a UI->to_parquet->read_parquet->to_Dremio-> round trip test.

kelly · July 3, 2018, 7:30pm

@doron can you take a look at this?

dorianb · November 6, 2018, 6:20pm

Coming up with the same issue.
Would you know if this is fixed in 3.0 with the new CTAs syntax?

With regards,
Dorian

dorianb · November 27, 2018, 11:44am

Still experiencing the same issue

dfleckinger · May 21, 2019, 7:46am

Apparently, as per release notes of 3.2.1, all the Parquet files generated by Dremio 3.2 are readable by PyArrow.
That’s a huge one ! Thanks

Topic		Replies	Views
Parquet errors (from other thread but am blocked from posting again there due to 3 reply limit !?)	2	4625	January 28, 2019
Reading Dremio's parquet files from python	1	2186	July 29, 2019
Unable to read Parquet footer with file generated with turbodbc	12	8137	November 21, 2017
Dremio can not read Parquet produced by Arrow	0	1470	March 27, 2020
Able to read parquet file with parquet-tools, but not dremio	11	3979	August 15, 2019

Parquet datasets/files created by CTAS (CREATE table as Select) are not readable by pyarrow

Related topics