Reading Dremio's parquet files from python


Over the last year, I’ve been successfully generating parquet from from python and issuing queries on them using Dremio, all this works perfectly.

However some of these tables are large denormalized files and take forever to create in python. I was considering delegating the file creation to Dremio, and was hoping to be able to read them back both from Dremio and python.

Hence I’ve created a table from Dremio using

This generates a folder CTAS_V_COUNTRY on the filesystem, which contains two files : 0_0_0.parquet and 0_0_0.parquet.crc

I was hoping to read back the file using pyarrow. I can access the schema and the metadata using
f = pq.ParquetFile(os.path.join(ctas_path, ‘CTAS_V_COUNTRY’, ‘0_0_0.parquet’))
f.metadata, f.schema

However, reading the file
pq.read_table(os.path.join(ctas_path, ‘CTAS_V_COUNTRY’, ‘0_0_0.parquet’))
fails and yields the following error
ArrowIOError: Couldn’t deserialize thrift: TProtocolException: Invalid data
Deserializing page header failed.

Of course, I could read the file using pyodbc, but this is painfully slow. I was hoping that the the newer Flight api could help, but I couldn’t make what you refer to in your blog work :

import pyarrow.flight as flt
c = flt.FlightClient.connect(“localhost”, 47470)

TypeError: expected bytes, int found

I’m not too sure how to pass the port, and which port I should give, anyway I have guessed wrong since using
c = flt.FlightClient.connect(“localhost”, (47470).to_bytes(2, byteorder=‘big’))
later fails on
fi = c.get_flight_info(fd)
ArrowIOError: gRPC failed with error code 14 and message: DNS resolution failed

What would be your recommendation to achieve this objective of creating parquet using Dremio and reading it back fast in python (both Dremio and my python code run on the same server) ?

Thanks for your help,

PS : I use Dremio 3.1.11-201904261857420193-c674472 and pyarrow 0.14.0

You should try

c = flight.FlightClient.connect(‘grpc+tcp://localhost:47470’)