How to use Arrow flight to read data from distributed Dremio cluster

anupam · December 24, 2020, 1:25am

We are trying to create a Python client to execute a SQL query and read the result from Dremio cluster. We were able to fetch the desired result successfully from a single endpoint.

We referred the Python client program from this Git repo: https://github.com/dremio-hub/arrow-flight-client-examples/tree/main/python

However we need some insight on how to read the data from multiple Endpoints i.e. Non-Distributed Client + Distributed Server scenario.

Could you please share some details on the same.

balaji.ramaswamy · December 26, 2020, 7:30pm

@anupam

I am not able to follow your question, even though Dremio is distributed, you always hit the coordinator

anupam · December 29, 2020, 10:07pm

As in this client example which you shared, line 117 [ https://github.com/dremio-hub/arrow-flight-client-examples/blob/main/python/example.py ],

reader = client.do_get(flight_info.endpoints[0].ticket)

we are reading the test result from one endpoint.

I was referring the scenario : Message Flow: Non-Distributed Client + Distributed Server [ https://www.dremio.com/is-time-to-replace-odbc-jdbc/ ]

If the data is available at multiple endpoints, do we need to iterate through all the “flight_info.endpoints” and get the tickets and and fetch the result using each ticket? And if so, then how do we do that, any reference for that scenario would be helpful.

anupam · January 4, 2021, 10:15pm

Could you please provide an update on this.

Tiffany_Lam · January 6, 2021, 12:28am

@anupam When you referred to “multiple endpoints”, did you mean the Dremio cluster has multiple coordinators? Could you briefly describe what your Dremio cluster setup is like? Thank you.

anupam · January 6, 2021, 1:41am

Our Dremio cluster has only a single coordinator.

In my current use case, the Python client is reading the query result from a single endpoint which is returned by the FlightInfo object. This is working fine.

However I want to explore the scenario described in this section --> Message Flow: Non-Distributed Client + Distributed Server [ https://www.dremio.com/is-time-to-replace-odbc-jdbc/ ].
As per the description, the FlightInfo object returns multiple Endpoint locations and their respective tickets.

So in context of this scenario,

does the client connect to multiple endpoint locations using the respective ticket?
should the client take care of reading the streams in parallel from each location?
Or should the client only talk to coordinator and read the stream from one endpoint?

Tiffany_Lam · January 6, 2021, 7:00pm

In the scenario where there are multiple coordinators, you do not need to iterate through all the endpoints. If the query is executed on a coordinator different from the one that generated the query plan, the query plan will be generated again before the query is executed. This incurs double query planning work but does not prevent the query from being executed.

To prevent incurring double query planning work, we recommend session affinity to be set to true when Dremio is deployed with Kubernetes with multiple coordinators. Such that all the TCP connections from a particular client IP will be routed to a specific Dremio coordinator. This is to prevent the query plan to be generated twice. You can refer to the recommendation written in dremio-cloud-tools Github repository: https://github.com/dremio/dremio-cloud-tools/blob/master/charts/dremio_v2/docs/setup/Important-Setup-Considerations.md. Hope this helps.

anupam · January 6, 2021, 7:47pm

Thanks for your response.
So the client will always connect to a single coordinator and will read the result from that coordinator end point.
Does that mean

The client will always read the streams sequentially from the coordinator end point OR
The flight can be internally composed of parallel streams OR
The client can read the streams in parallel; if so then how

Tiffany_Lam · January 8, 2021, 1:15am

We have not implemented stream parallelization in Dremio’s Flight Server Endpoint yet, we plan to add this feature in the future though.

anupam · January 8, 2021, 5:46pm

It seems there is some confusion here.
What I understood from your comment is that, we can’t read the streams in parallel from one single end point, this part is clear.

But I was asking for reading the result streams in parallel from different executor locations by the Python client application as depicted in the attached screenshot. Please confirm if this scenario is supported today and if so, I would like to get some additional reference or a sample client program.

Topic		Replies	Views
Client Cluster Communication Site Feedback	1	1150	August 16, 2021
Outdated Python Arrow Flight Client Example	0	193	May 4, 2024
Fetching data from a couple different Flight/FlightSQL endpoints at the same time	1	892	July 26, 2022
We need datasource with arrow flight rpc support	2	877	September 7, 2020
Accessing Dremio via flight in java	3	1638	September 23, 2022

How to use Arrow flight to read data from distributed Dremio cluster

Related topics