There are a number of people in our company who would be likely to access data this way as well. One simple example would be if we generate something useful with a CTAS and then we want to export the file(s) to another environment that doesn’t know what Dremio is or isn’t able to query it for whatever reason.
It’s pretty easy to recreate this, but if it helps, here’s a more complete stack trace for the parquet read:
>>> foo = pq.read_table("/Users/desidero/Library/Dremio/data/pdfs/scratch/foo/0_0_0.parquet")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/desidero/tmp/pywhatever/lib/python2.7/site-packages/pyarrow/parquet.py", line 941, in read_table
use_pandas_metadata=use_pandas_metadata)
File "/Users/desidero/tmp/pywhatever/lib/python2.7/site-packages/pyarrow/parquet.py", line 150, in read
nthreads=nthreads)
File "_parquet.pyx", line 734, in pyarrow._parquet.ParquetReader.read_all
File "error.pxi", line 79, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Couldn't deserialize thrift: TProtocolException: Invalid data
Deserializing page header failed.
hi @kelly,
I use parquet datasets extensively. All the log files from the 4 last years are stored in parquet format on S3, and I’m happy that Dremio enables to query them.
I expect all the parquet files on my datalake to be compatible, ie being readable by Dremio, PyArrow, AWS Athena, Apache Impala, Hive or other.
Also at the moment, I still face some issues with Dremio that prevents some queries to use the accelerations, so I need to have some aggregated datasets to be in Parquet format, so that I can get acceptable delays in reporting queries.
So at the moment, to generate a parquet file corresponding to my need, I use turbodbc to get data from a VDS in Arrow format, that I save in parquet format in S3
And I just tried upgrading to latest pandas and pyarrow but still same issue so far. There are sometimes issues with the deprecated timestamp formats etc in parquet writes but usually this is from pyarrow -> spark readers.
I am just save some data as parquet from the UI, then trying to load via:
pd.read_parquet('filename.parquet')
This is using the pyarrow engine. Is there a test suite in Dremio? Could be a good time to add a UI->to_parquet->read_parquet->to_Dremio-> round trip test.