Connect to Dremio Cloud using Arrow Flight, from Jupyter Notebook?

Hi, all. First time using the new Arrow Flight driver to connect to Dremio instead of ODBC and I’m having some problems. Hoping someone can help me understand what’s going on.

I was able to download and run the example script, no problem:

$ ./example.py -host data.dremio.cloud -port 443 -tls -user ‘$token’ -pat $(cat ~/Documents/pwvault/dremio_pat.txt) -query ‘SELECT * FROM “Jeremy Experiments”.“UJJ-QuestionAnswer_decoding”’
[INFO] Enabling TLS connection
[INFO] Trusted certificates provided
[INFO] Authentication skipped until first request
[INFO] Query: SELECT * FROM “Jeremy Experiments”.“UJJ-QuestionAnswer_decoding”
[INFO] GetSchema was successful
[INFO] Schema: <pyarrow._flight.SchemaResult object at 0x7f98cc5beb90>
[INFO] GetFlightInfo was successful
[INFO] Ticket: <Ticket b’\n@SELECT * FROM “Jeremy Experiments”.“UJJ-QuestionAnswer_decoding”\x12Z\nX\n@SELECT * FROM “Jeremy Experiments”.“UJJ-QuestionAnswer_decoding”\x10\x04\x1a\x12\t\xca\xef\xdd\xa9\x94\xea0\x1d\x11\x00hz\xdb/\xb1\x1f\x95’>
[INFO] Reading query results from Dremio
QuestionId Answer Decoded_Answer LanguageId … Sequence AnsPrecode ShortAnswer AnswerDesc
[data redacted :slight_smile: ]
[247892 rows x 9 columns]

But I’m not able to import this script and use these functions, either in a Jupyter Notebook (my preference) or from an interactive Python session. The Jupyter Notebook kernel crashes when it tries to run the connect function, with no output or error messages. From an interactive Python session, running the same code, I get:

$ python
Python 3.8.10 (default, Mar 15 2022, 12:22:08)
[GCC 9.4.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import sys
t certif>>> import certifi
ownloaded from https://github.com/dremio-hub/arrow-flight-client-examples/blob/main/python/example.py
import example>>>

Downloaded from https://github.com/dremio-hub/arrow-flight-client-examples/blob/main/python/example.py

import example

Confirmed these parameters by printing some debugging output from example.py

hostname = ‘data.dremio.cloud’
port = 443
username = ‘$token’
password = ‘dremio123’ # default; this setup will use PAT instead
th open(>>> with open(‘/home/jeremy/Documents/pwvault/dremio_pat.txt’,‘r’) as pat_file:
… pat_or_auth_token = pat_file.read()
ls = Tr…
ue
trusted_cert>>> tls = True
trusted_certificates = certifi.where() # retrieves a site certificate, cacert.pem
_server_verification =>>> disable_server_verification = False
engine = None
session_properties = None
query = ‘SELECT * FROM “Jeremy Experiments”.“UJJ-QuestionAnswer_decoding”’

Connect to Dremio Arrow Flight server endpoint.

xample.c>>> example.connect_to_dremio_flight_server_endpoint(hostname, port, username, password,
… query, tls, trusted_certificates,
… disable_server_verification, pat_or_auth_token,
… engine, session_properties)
[INFO] Enabling TLS connection
[INFO] Trusted certificates provided
[INFO] Authentication skipped until first request
[INFO] Query: SELECT * FROM “Jeremy Experiments”.“UJJ-QuestionAnswer_decoding”
E0713 13:54:39.115737425 4281 call.cc:783] validate_metadata: {“created”:“@1657738479.115716429”,“description”:“Illegal header value”,“file”:“/opt/vcpkg/buildtrees/grpc/src/85a295989c-6cf7bf442d.clean/src/core/lib/surface/validate_metadata.cc”,“file_line”:55,“offset”:71,“raw_bytes”:“42 65 61 72 65 72 20 57 64 69 7a 68 48 58 70 51 63 69 4f 68 47 61 75 34 6e 71 35 6d 49 4b 70 65 69 7a 4f 39 6a 65 46 35 65 34 6e 67 69 75 32 45 36 4e 52 57 62 53 32 4d 39 69 34 31 51 55 6e 45 70 6f 66 4e 41 3d 3d 0a ‘Bearer WdizhHXpQciOhGau4nq5mIKpeizO9jeF5e4ngiu2E6NRWbS2M9i41QUnEpofNA==.’\u0000”}
E0713 13:54:39.115793665 4281 call_op_set.h:980] assertion failed: false
Aborted

Does it not like how I’m defining the variables directly in the interactive version vs. the example script, which is parsing values from the command line? Otherwise I’m lost…

Just glancing at your output, it looks like it might be related to the value for the token being used. Perhaps the value for pat_or_auth_token is malformed (note that .'0000’ after the base64 string)

Thanks, that was it–trailing newline in the file containing my PAT. Apparently that doesn’t hurt the Python script when run from a command line, but does affect it elsewhere. I created a new file for my (new) PAT with no trailing newline, and everything runs just fine in my notebook now.

Also, thank you for pointing out I accidentally sent my PAT to the entire Dremio community. :scream: That PAT is no longer valid!

Next issue–I am often receiving a “Stream removed” error when I submit queries. I don’t have a single query or notebook in which this error consistently happens.

I adapted Dremio’s example.py into a script I can import and use in a Jupyter Notebook. This works well, unless I receive this error:

[INFO] Enabling TLS connection
[INFO] Trusted certificates provided
[INFO] Authentication skipped until first request
[INFO] Query:  [query]
[INFO] GetSchema was successful
[INFO] Schema:  <pyarrow._flight.SchemaResult object at 0x7efbaf78bb50>
[INFO] GetFlightInfo was successful
[INFO] Ticket:  <Ticket b'\n4[query]\x12N\nL\n4[query]\x10\x04\x1a\x12\t\xeeM\x8c\xc3\x06G)\x1d\x11\x00x\x87\xdd~rL\xc8'>
[INFO] Reading query results from Dremio
[ERROR] Exception: FlightServerError('Flight RPC failed with message: Stream removed. gRPC client debug context: {"created":"@1658239277.030677703","description":"Error received from peer ipv4:34.149.92.66:443","file":"/opt/vcpkg/buildtrees/grpc/src/85a295989c-6cf7bf442d.clean/src/core/lib/surface/call.cc","file_line":903,"grpc_message":"Stream removed","grpc_status":2}. Client context: OK')

(query text replaced with “[query]”)

What’s happening here?

This query would return a large amount of data–do I need to chunk it out?