Hey there! I posted about this under another topic I started last week, but want to make sure it gets noticed.
I’m running into an issue where I’m unable to retrieve the results of a query that returns a relatively large number of records (~1.9 million) from our Dremio Cloud instance when using Arrow Flight in Python.
I’ve tried the available functions (read_all(), read_pandas(), read_chunk()), but no matter which one I use, they eventually fail with the same basic error message. The output below is from my last attempt to retrieve results using read_chunk() in a loop–it failed 170 chunks in. It has sometimes made it nearly 500 chunks in before failing, and I was once able to get read_pandas() to return the entire dataset.
---------------------------------------------------------------------------
FlightServerError Traceback (most recent call last)
Input In [11], in <cell line: 4>()
4 try:
5 print("Reading chunk #",chunk_num)
----> 6 batch, _ = reader.read_chunk()
7 batches.append(batch)
8 chunk_num += 1
File /lib/python3.8/site-packages/pyarrow/_flight.pyx:903, in pyarrow._flight._MetadataRecordBatchReader.read_chunk()
File /lib/python3.8/site-packages/pyarrow/_flight.pyx:60, in pyarrow._flight.check_flight_status()
FlightServerError: Flight RPC failed with message: Stream removed. gRPC client debug context: {"created":"@1658325230.254288460","description":"Error received from peer ipv4:34.149.92.66:443","file":"/opt/vcpkg/buildtrees/grpc/src/85a295989c-6cf7bf442d.clean/src/core/lib/surface/call.cc","file_line":903,"grpc_message":"Stream removed","grpc_status":2}. Client context: OK
In our Dremio Cloud instance, I can see the jobs that were kicked off to handle the query, and they are largely successful. I see a couple of jobs that failed on “Connection reset by peer” errors, but the number of successes at this end far outnumbers the number of successes I’ve experienced from the Python script.
The code I’m running is more or less example.py but tweaked to accommodate whichever function I’m using to retrieve records.
So what’s happening here? Can I do something to prevent this “Stream removed” error and retrieve my full dataset?