Reading Dremio's parquet files from python

romain · July 18, 2019, 9:18am

Hi

Over the last year, I’ve been successfully generating parquet from from python and issuing queries on them using Dremio, all this works perfectly.

However some of these tables are large denormalized files and take forever to create in python. I was considering delegating the file creation to Dremio, and was hoping to be able to read them back both from Dremio and python.

Hence I’ve created a table from Dremio using
CREATE TABLE “TESTS_CTAS”.“CTAS_V_COUNTRY”
AS SELECT * FROM PATH_TO_VDS.V_COUNTRY

This generates a folder CTAS_V_COUNTRY on the filesystem, which contains two files : 0_0_0.parquet and 0_0_0.parquet.crc

I was hoping to read back the file using pyarrow. I can access the schema and the metadata using
f = pq.ParquetFile(os.path.join(ctas_path, ‘CTAS_V_COUNTRY’, ‘0_0_0.parquet’))
f.metadata, f.schema

However, reading the file
pq.read_table(os.path.join(ctas_path, ‘CTAS_V_COUNTRY’, ‘0_0_0.parquet’))
fails and yields the following error
ArrowIOError: Couldn’t deserialize thrift: TProtocolException: Invalid data
Deserializing page header failed.

Of course, I could read the file using pyodbc, but this is painfully slow. I was hoping that the the newer Flight api could help, but I couldn’t make what you refer to in your blog work : https://www.dremio.com/is-time-to-replace-odbc-jdbc/

import pyarrow.flight as flt
c = flt.FlightClient.connect(“localhost”, 47470)

yields
TypeError: expected bytes, int found

I’m not too sure how to pass the port, and which port I should give, anyway I have guessed wrong since using
c = flt.FlightClient.connect(“localhost”, (47470).to_bytes(2, byteorder=‘big’))
later fails on
fi = c.get_flight_info(fd)
->
ArrowIOError: gRPC failed with error code 14 and message: DNS resolution failed

What would be your recommendation to achieve this objective of creating parquet using Dremio and reading it back fast in python (both Dremio and my python code run on the same server) ?

Thanks for your help,
Romain

PS : I use Dremio 3.1.11-201904261857420193-c674472 and pyarrow 0.14.0

dfleckinger · July 29, 2019, 1:37pm

You should try

c = flight.FlightClient.connect(‘grpc+tcp://localhost:47470’)

Topic		Replies	Views
Parquet datasets/files created by CTAS (CREATE table as Select) are not readable by pyarrow	12	5124	May 21, 2019
Parquet errors (from other thread but am blocked from posting again there due to 3 reply limit !?)	2	4588	January 28, 2019
Issue with executing queries in Dremio from python	2	1080	March 25, 2022
Unable to read Parquet footer with file generated with turbodbc	12	8073	November 21, 2017
Able to read parquet file with parquet-tools, but not dremio	11	3925	August 15, 2019

Reading Dremio's parquet files from python

Related topics