Arrow flight protocol

According to @tshiran blog post (https://www.dremio.com/is-time-to-replace-odbc-jdbc/) , it’s time to replace ODBC/JDBC by arrow flight.
Any insights when we will be able to call Dremio using this protocol instead of ODBC, for instance using pyarrow as shown in the example ?

Thanks

@rymurr implemented Arrow Flight access to Dremio here: https://github.com/dremio-hub/dremio-flight-connector

Try it out and let us know what you think.

1 Like

@dfleckinger thanks for your interest! Please let us know how you get on with the Flight plugin. It is still in beta so any feedback is extremely useful!

Thanks, I saw this implementation.

However I could not make it work and I was wondering which versions of Dremio have implemented the arrow flight server protocol and on which port flight server is supposed to be listening.
I have tested on Dremio CE version 3.1.11 and 3.3.1 , i don’t see 47470 in the list of listening ports.

The flight plugin currently works on 3.3.1 and above only. You would have to manually:

  • checkout the repo: git clone https://github.com/dremio-hub/dremio-flight-connector.git
  • build the plugin: mvn clean install
  • move the plugin (the shaded jar that is created by maven) to dremio/jars

After those steps you should see the flight connector listening on 47470.

This manual process is only while the flight connectivity is in beta. We hope to make it a part of the product by the end of this year.

hi @rymurr, thanks a lot for your contribution !
After reading the blog post, I thought that Flight server was included in Dremio build out of the box, which is apparently not the case.
So thanks for your clarification on that.

I gave this a try today and it is failing because Dremio isn’t supporting a http2 listener. Is there a way to turn that on?

---------------------------------------------------------------------------
ArrowIOError                              Traceback (most recent call last)
<ipython-input-2-55d446eff957> in <module>
     51 
     52 client = flight.FlightClient.connect('grpc+tcp://' + dremio_coordinator + ':47470')
---> 53 client.authenticate(HttpDremioClientAuthHandler(username, password))
     54 start_time = time.time()
     55 cmd = Command(query=sql, parallel=False, coalesce=False, ticket=b'')

/usr/local/Anaconda3-5.3.1-Linux-x86_64/envs/pyspark-3-6/lib/python3.6/site-packages/pyarrow/_flight.pyx in pyarrow._flight.FlightClient.authenticate()

/usr/local/Anaconda3-5.3.1-Linux-x86_64/envs/pyspark-3-6/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowIOError: gRPC failed with error code 14 and message: Trying to connect an http1.x server

Hey David,

Thanks for trying out Flight! Are you by any chance running behind a firewall or proxy? Perhaps on a kubernetes cluster with ingress set up? Not all reverse proxies and ingress controllers play nice with http2.

By default the Flight listener (which uses grpc under the hood) is set up to listen on the same port for both http2/http1.x but not all reverse proxies handle that well. If you share a bit of info on your environment I can point you towards the relevant documentation.

I have a use case where I want to construct my own Apache Arrow Table and then give this table to Dremio. Is this use case one that will be supported? If this is request in scope, is there an example that I can look at.

Hello, David,

I ran into the same problem. For me it was the SQL statement. I had the VDS in my private space and didn’t address the table correctly.

Then I created a new space (e.g. analytics) and copied the VDS to it. And then it went as expected: select * from analytics.table

Could you try that?

Best regards
Oliver

Hey Joseph. Welcome!

Currently the best way to do that is to write the Arrow Table to a parquet file (https://arrow.apache.org/docs/python/parquet.html) in your chosen storage location (data lake, NAS, etc). You can then manually add it as a Datasource to Dremio or alternatively use the REST API (docs here: https://docs.dremio.com/rest-api/) to let Dremio know you added some data.

Adding support to the Flight connector for creating datasets (doPut in flight notation) is on the roadmap. That will save the steps above but first we want to get a GA version of flight released.

Thanks for a quick response. I was trying to save the parquet step if at all possible.

Would you be willing to take a pull request that implements the feature? And would such a change even be possible without digging into https://github.com/dremio/dremio-oss?

The Dremio flight connector is currently Apache-2 licensed on our Dremio Hub https://github.com/dremio-hub/dremio-flight-connector and distinct from the dremio-oss project. Technically speaking one should be able to implement doPut fairly easily on that connector however it will require some knowledge of the CTAS functionality in Dremio https://docs.dremio.com/sql-reference/sql-commands/tables.html to make sure the Arrow buffer from doPut ends up in the right place. Note also that the signature for doPut is changing https://www.mail-archive.com/dev@arrow.apache.org/msg13191.html

Thank you for the overview.

I’ll start with the parquet path you suggested and if I can get time allocated, I’ll work on the creation of Arrow Tables as this unblocks a large amount of our stories.