Over the last year, I’ve been successfully generating parquet from from python and issuing queries on them using Dremio, all this works perfectly.
However some of these tables are large denormalized files and take forever to create in python. I was considering delegating the file creation to Dremio, and was hoping to be able to read them back both from Dremio and python.
Hence I’ve created a table from Dremio using
CREATE TABLE “TESTS_CTAS”.“CTAS_V_COUNTRY”
AS SELECT * FROM PATH_TO_VDS.V_COUNTRY
This generates a folder CTAS_V_COUNTRY on the filesystem, which contains two files : 0_0_0.parquet and 0_0_0.parquet.crc
I was hoping to read back the file using pyarrow. I can access the schema and the metadata using
f = pq.ParquetFile(os.path.join(ctas_path, ‘CTAS_V_COUNTRY’, ‘0_0_0.parquet’))
However, reading the file
pq.read_table(os.path.join(ctas_path, ‘CTAS_V_COUNTRY’, ‘0_0_0.parquet’))
fails and yields the following error
ArrowIOError: Couldn’t deserialize thrift: TProtocolException: Invalid data
Deserializing page header failed.
Of course, I could read the file using pyodbc, but this is painfully slow. I was hoping that the the newer Flight api could help, but I couldn’t make what you refer to in your blog work : https://www.dremio.com/is-time-to-replace-odbc-jdbc/
import pyarrow.flight as flt
c = flt.FlightClient.connect(“localhost”, 47470)
TypeError: expected bytes, int found
I’m not too sure how to pass the port, and which port I should give, anyway I have guessed wrong since using
c = flt.FlightClient.connect(“localhost”, (47470).to_bytes(2, byteorder=‘big’))
later fails on
fi = c.get_flight_info(fd)
ArrowIOError: gRPC failed with error code 14 and message: DNS resolution failed
What would be your recommendation to achieve this objective of creating parquet using Dremio and reading it back fast in python (both Dremio and my python code run on the same server) ?
Thanks for your help,
PS : I use Dremio 3.1.11-201904261857420193-c674472 and pyarrow 0.14.0