How to use Arrow flight to read data from distributed Dremio cluster

We are trying to create a Python client to execute a SQL query and read the result from Dremio cluster. We were able to fetch the desired result successfully from a single endpoint.

We referred the Python client program from this Git repo:

However we need some insight on how to read the data from multiple Endpoints i.e. Non-Distributed Client + Distributed Server scenario.

Could you please share some details on the same.


I am not able to follow your question, even though Dremio is distributed, you always hit the coordinator

As in this client example which you shared, line 117 [ ],

reader = client.do_get(flight_info.endpoints[0].ticket)

we are reading the test result from one endpoint.

I was referring the scenario : Message Flow: Non-Distributed Client + Distributed Server [ ]

If the data is available at multiple endpoints, do we need to iterate through all the “flight_info.endpoints” and get the tickets and and fetch the result using each ticket? And if so, then how do we do that, any reference for that scenario would be helpful.

Could you please provide an update on this.

@anupam When you referred to “multiple endpoints”, did you mean the Dremio cluster has multiple coordinators? Could you briefly describe what your Dremio cluster setup is like? Thank you.

Our Dremio cluster has only a single coordinator.

In my current use case, the Python client is reading the query result from a single endpoint which is returned by the FlightInfo object. This is working fine.

However I want to explore the scenario described in this section --> Message Flow: Non-Distributed Client + Distributed Server [ ].
As per the description, the FlightInfo object returns multiple Endpoint locations and their respective tickets.

So in context of this scenario,

  • does the client connect to multiple endpoint locations using the respective ticket?
  • should the client take care of reading the streams in parallel from each location?
  • Or should the client only talk to coordinator and read the stream from one endpoint?

In the scenario where there are multiple coordinators, you do not need to iterate through all the endpoints. If the query is executed on a coordinator different from the one that generated the query plan, the query plan will be generated again before the query is executed. This incurs double query planning work but does not prevent the query from being executed.

To prevent incurring double query planning work, we recommend session affinity to be set to true when Dremio is deployed with Kubernetes with multiple coordinators. Such that all the TCP connections from a particular client IP will be routed to a specific Dremio coordinator. This is to prevent the query plan to be generated twice. You can refer to the recommendation written in dremio-cloud-tools Github repository: Hope this helps.

Thanks for your response.
So the client will always connect to a single coordinator and will read the result from that coordinator end point.
Does that mean

  • The client will always read the streams sequentially from the coordinator end point OR
  • The flight can be internally composed of parallel streams OR
  • The client can read the streams in parallel; if so then how

We have not implemented stream parallelization in Dremio’s Flight Server Endpoint yet, we plan to add this feature in the future though.

It seems there is some confusion here.
What I understood from your comment is that, we can’t read the streams in parallel from one single end point, this part is clear.

But I was asking for reading the result streams in parallel from different executor locations by the Python client application as depicted in the attached screenshot. Please confirm if this scenario is supported today and if so, I would like to get some additional reference or a sample client program.