Arrow flight protocol

According to @tshiran blog post (https://www.dremio.com/is-time-to-replace-odbc-jdbc/) , it’s time to replace ODBC/JDBC by arrow flight.
Any insights when we will be able to call Dremio using this protocol instead of ODBC, for instance using pyarrow as shown in the example ?

Thanks

@rymurr implemented Arrow Flight access to Dremio here: https://github.com/dremio-hub/dremio-flight-connector

Try it out and let us know what you think.

2 Likes

@dfleckinger thanks for your interest! Please let us know how you get on with the Flight plugin. It is still in beta so any feedback is extremely useful!

Thanks, I saw this implementation.

However I could not make it work and I was wondering which versions of Dremio have implemented the arrow flight server protocol and on which port flight server is supposed to be listening.
I have tested on Dremio CE version 3.1.11 and 3.3.1 , i don’t see 47470 in the list of listening ports.

The flight plugin currently works on 3.3.1 and above only. You would have to manually:

  • checkout the repo: git clone https://github.com/dremio-hub/dremio-flight-connector.git
  • build the plugin: mvn clean install
  • move the plugin (the shaded jar that is created by maven) to dremio/jars

After those steps you should see the flight connector listening on 47470.

This manual process is only while the flight connectivity is in beta. We hope to make it a part of the product by the end of this year.

hi @rymurr, thanks a lot for your contribution !
After reading the blog post, I thought that Flight server was included in Dremio build out of the box, which is apparently not the case.
So thanks for your clarification on that.

I gave this a try today and it is failing because Dremio isn’t supporting a http2 listener. Is there a way to turn that on?

---------------------------------------------------------------------------
ArrowIOError                              Traceback (most recent call last)
<ipython-input-2-55d446eff957> in <module>
     51 
     52 client = flight.FlightClient.connect('grpc+tcp://' + dremio_coordinator + ':47470')
---> 53 client.authenticate(HttpDremioClientAuthHandler(username, password))
     54 start_time = time.time()
     55 cmd = Command(query=sql, parallel=False, coalesce=False, ticket=b'')

/usr/local/Anaconda3-5.3.1-Linux-x86_64/envs/pyspark-3-6/lib/python3.6/site-packages/pyarrow/_flight.pyx in pyarrow._flight.FlightClient.authenticate()

/usr/local/Anaconda3-5.3.1-Linux-x86_64/envs/pyspark-3-6/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowIOError: gRPC failed with error code 14 and message: Trying to connect an http1.x server

Hey David,

Thanks for trying out Flight! Are you by any chance running behind a firewall or proxy? Perhaps on a kubernetes cluster with ingress set up? Not all reverse proxies and ingress controllers play nice with http2.

By default the Flight listener (which uses grpc under the hood) is set up to listen on the same port for both http2/http1.x but not all reverse proxies handle that well. If you share a bit of info on your environment I can point you towards the relevant documentation.

I have a use case where I want to construct my own Apache Arrow Table and then give this table to Dremio. Is this use case one that will be supported? If this is request in scope, is there an example that I can look at.

Hello, David,

I ran into the same problem. For me it was the SQL statement. I had the VDS in my private space and didn’t address the table correctly.

Then I created a new space (e.g. analytics) and copied the VDS to it. And then it went as expected: select * from analytics.table

Could you try that?

Best regards
Oliver

Hey Joseph. Welcome!

Currently the best way to do that is to write the Arrow Table to a parquet file (https://arrow.apache.org/docs/python/parquet.html) in your chosen storage location (data lake, NAS, etc). You can then manually add it as a Datasource to Dremio or alternatively use the REST API (docs here: https://docs.dremio.com/rest-api/) to let Dremio know you added some data.

Adding support to the Flight connector for creating datasets (doPut in flight notation) is on the roadmap. That will save the steps above but first we want to get a GA version of flight released.

Thanks for a quick response. I was trying to save the parquet step if at all possible.

Would you be willing to take a pull request that implements the feature? And would such a change even be possible without digging into https://github.com/dremio/dremio-oss?

The Dremio flight connector is currently Apache-2 licensed on our Dremio Hub https://github.com/dremio-hub/dremio-flight-connector and distinct from the dremio-oss project. Technically speaking one should be able to implement doPut fairly easily on that connector however it will require some knowledge of the CTAS functionality in Dremio https://docs.dremio.com/sql-reference/sql-commands/tables.html to make sure the Arrow buffer from doPut ends up in the right place. Note also that the signature for doPut is changing https://www.mail-archive.com/dev@arrow.apache.org/msg13191.html

Thank you for the overview.

I’ll start with the parquet path you suggested and if I can get time allocated, I’ll work on the creation of Arrow Tables as this unblocks a large amount of our stories.

Hi @rymurr
i have similar issue, running dremio in a k8 cluster with ingress setup
I do see flight connector running in dremiio-master container on port 47470, but its not accessible from external python client
i did port forward k8 dremio-master pod as below

  • created tcp/47470 of dremio-master-pod in service disovery as a nodeport
  • configured cluster tcp ingress service for above service/47470

when i connect form external python client getting below error

pyarrow._flight.FlightUnavailableError: gRPC returned unavailable error, with message: Connect Failed

Hey smora,

Have you made sure flight is listening on the external interface for the docker container? I realised I have introduced a bug which requires you set -e 'JAVA_EXTRA_OPTS="$JAVA_EXTRA_OPTS -Ddremio.flight.host=0.0.0.0'. This will be fixed in the next release.

For reference the dockerfile I test with is:

FROM dremio/dremio-oss:latest

COPY target/dremio-flight-connector-0.11.0-SNAPSHOT-shaded.jar /opt/dremio/jars/

ENV JAVA_EXTRA_OPTS="$JAVA_EXTRA_OPTS -Ddremio.flight.host=0.0.0.0"
ENTRYPOINT ["bin/dremio", "start-fg"]

Hi @rymurr

thanks for the suggestion, now i see the port is accessible via telnet, but python clinet is still not able to connect, any suggestion on further debugging?

remio@dremio-master-0:/opt/dremio$ ss -tuln
Netid  State      Recv-Q Send-Q                                                                              Local Address:Port                                                                                             Peer Address:Port              
tcp    LISTEN     0      128                                                                                            :::31010                                                                                                      :::*                  
tcp    LISTEN     0      1                                                                                ::ffff:127.0.0.1:41610                                                                                                      :::*                  
tcp    LISTEN     0      128                                                                                            :::47470                                                                                                      :::*                  
tcp    LISTEN     0      128                                                                                            :::45678                                                                                                      :::*                  
tcp    LISTEN     0      50                                                                                             :::9047                                                                                                       :::*                  
dremio@dremio-master-0:/opt/dremio$ 

python client error:

Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm 2018.1\helpers\pydev\pydevd.py", line 1664, in <module>
    main()
  File "C:\Program Files\JetBrains\PyCharm 2018.1\helpers\pydev\pydevd.py", line 1658, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "C:\Program Files\JetBrains\PyCharm 2018.1\helpers\pydev\pydevd.py", line 1068, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm 2018.1\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:/Users/smora/PycharmProjects/o2p/flighttest.py", line 45, in <module>
    client.authenticate(HttpDremioClientAuthHandler(username, password))
  File "pyarrow\_flight.pyx", line 660, in pyarrow._flight.FlightClient.authenticate
  File "pyarrow\error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: gRPC failed with error code 16 and message: 

on linux

Traceback (most recent call last):
  File "./flighttest.py", line 47, in <module>
    client.authenticate(HttpDremioClientAuthHandler(username, password))
  File "pyarrow/_flight.pyx", line 944, in pyarrow._flight.FlightClient.authenticate
  File "pyarrow/_flight.pyx", line 68, in pyarrow._flight.check_flight_status
pyarrow._flight.FlightUnavailableError: gRPC returned unavailable error, with message: Connect Failed

Hey @smora it looks like an authentication problem. How are you authenticating?

I recommend using the python client https://github.com/rymurr/dremio_client:

from dremio_client.flight import query
query(sql, hostname=hostname, port=port, username=username, password=password)

yes @rymurr
Thank you for looking into it, i did use the class provided by you in dremio_client, connection fails but can’t get any more error details other than “error code 16”

log from k8 ingress:

[28/Oct/2019:15:55:44 +0000]TCP2001304490.073

code:

from pyarrow import flight
import pyarrow as pa
import base64
from pyarrow.flight import ClientAuthHandler
from pyarrow.compat import tobytes

class HttpDremioClientAuthHandler(ClientAuthHandler):

    def __init__(self, username, password):
        ClientAuthHandler.__init__(self)
        self.username = tobytes(username)
        self.password = tobytes(password)
        self.token = None

    def authenticate(self, outgoing, incoming):
        outgoing.write(base64.b64encode(self.username + b':' + self.password))
        self.token = incoming.read()
        print(self.token)

    def get_token(self):
        return self.token

username = 'user'
password = 'pwd
sql = '''select * from sysobj.options'''
hostname = 'host'
port = 47470

client = flight.FlightClient.connect('grpc+tcp://<hostname>:47470')
client.authenticate(HttpDremioClientAuthHandler(username, password))
info = client.get_flight_info(flight.FlightDescriptor.for_command(sql))
reader = client.do_get(info.endpoints[0].ticket)
batches = []
while True:
    try:
        batch, metadata = reader.read_chunk()
        batches.append(batch)
    except StopIteration:
        break
data = pa.Table.from_batches(batches)
df = data.to_pandas()

using dremio_client library:

from dremio_client.flight import query
output = query(sql =sql, hostname=hostname, port=port, username=username, password=password)
print(output)

I am using dremio_client, but get this error:
ArrowException: Unknown error: gRPC returned unknown error, with message: Stream removed

I have no idea how to further debug this.