ODBC vs Arrow flight for python?

Hello everyone,
I was thinking about replacing some pyODBC connection by Arrow Flight used in dremio_client, I am not sure I setup that correctly.

The pyodbc returns 10k rows in ~1s, while the one passing through dremio_client last 20~ sec…
I was kind of expecting the reverse result :slight_smile:

Have you any idea? the code is basically copy/paste from the sample:

Many thanks!

import argparse

import sys

from pyarrow import flight

import dremio_client

import pandas as pd

import streamlit as st

import pyodbc

import time

class DremioClientAuthMiddlewareFactory(flight.ClientMiddlewareFactory):

"""A factory that creates DremioClientAuthMiddleware(s)."""

def __init__(self):

    self.call_credential = []

def start_call(self, info):

    return DremioClientAuthMiddleware(self)

def set_call_credential(self, call_credential):

    self.call_credential = call_credential

class DremioClientAuthMiddleware(flight.ClientMiddleware):

def __init__(self, factory):

    self.factory = factory

def received_headers(self, headers):

    auth_header_key = 'authorization'

    authorization_header = []

    for key in headers:

      if key.lower() == auth_header_key:

        authorization_header = headers.get(auth_header_key)

    self.factory.set_call_credential([

        b'authorization', authorization_header[0].encode("utf-8")])

def connectDremioODBC():

port = 31010

driver = "Dremio Connector"

cnxn = pyodbc.connect("Driver={};ConnectionType=Direct;HOST={};PORT={};AuthenticationType=Plain;UID={};PWD={}".format(driver, hostname,port,username,password),autocommit=True)

df = pd.read_sql(sqlquery,cnxn)

print(df)

def connectDremio():

flightport = "32010"

scheme = "grpc+tcp"

connection_args = {}

initial_options = flight.FlightCallOptions(headers=[

        (b'routing-tag', b'test-routing-tag'),

        (b'routing-queue', b'Low Cost User Queries')

    ])

client_auth_middleware = DremioClientAuthMiddlewareFactory()

client = flight.FlightClient("{}://{}:{}".format(scheme, hostname, flightport),

        middleware=[client_auth_middleware], **connection_args)

bearer_token = client.authenticate_basic_token(username, password, initial_options)

flight_desc = flight.FlightDescriptor.for_command(sqlquery)

options = flight.FlightCallOptions(headers=[bearer_token])

schema = client.get_schema(flight_desc, options)

flight_info = client.get_flight_info(flight.FlightDescriptor.for_command(sqlquery), options)


reader = client.do_get(flight_info.endpoints[0].ticket, options)

print(reader.read_pandas())

if name == “main”:

print('Begin')

start = time.time()

username = "yen"

password = "secret"

sqlquery = "SELECT * FROM sql1.SQL1.dbo.my_great_table"

hostname = "localhost"

connectDremio()

# connectDremioODBC()

end = time.time()

print(end - start)

Hi, @YEN !
We’re seeing a similar behavior, but not as bad as your results (20 seconds). We found that both the Flight Java Client as well as the Python client takes “ages” to do a handshake with Dremio (tested with 11, 12, 13 OSS and Enterprise). Ages means in the order of 180-250 ms only for handshaking.
Especially for queries with small result sets, the handshake time often takes longer than actual query execution and fetching the results. Have you tried to analyze in details which of the step(s) in connectDremio() take so much time?

We’ve tried to re-use existing connections, but it turns out that the Load Balancer (Azure LB in that case) kills idle connections after 4 minutes.
Have you found any way to let the Flight client perform a keep-alive (gRPC supports that, but it seems that the Flight client doesn’t expose it)? Is there a way to check the current state of a Flight connection, i.e. if it is still valid or not?

Best, Tim

Thanks a lot for sharing @tid

Hi Tid,

Thanks for your feedback too.

My issue was because I was trying to use the dremio_client, this one didn’t successfully connect to Flight then rollback to ODBC (And I don’t know which one but a slow one).

If I use the flight from pyarrow it seems to works fine:
image

My comparative is not perfect because I only duplicate the random first 10k rows but at least the result is in the right direction…

I will still try to have a workable dremio_client with Flight, it seems to be the flightport who change from 47470 to 32010 but I am not able to specify it in the config file. will keep you updated!

Kind regards,
Y.

2 Likes

One last feedback, my issues of performance was not directly related to the driver but was due to my setup…

(Dremio 13 on docker, a Windows server on HyperV as client and a MSSQL on docker as an external connection)

Sometimes I had results in 1.5s for 10k rows, sometimes in more than 20s … exact same code…
The issue was in fact some DNS resolution problem through the different virtual components. Passing through the direct IP resolved my issue.

Regarding the 2 Flights:

image

Hope it may help others.

Many thanks to Viktor!

Y.

Many thanks, Yen!

We so far have just used Dremio’s sample clients for Java and Python (with legacy auth in that case). We found no statistically relevant performance difference between the two auth modes.
The handshake in our setup consistently takes 180-250ms against Dremio 13 on Azure (deployed via AKS, Azure LB in front).

I think we should repeat the same exercise as you and try a localhost Docker setup… We have a support ticket for our Enterprise subscription open in parallel, but I still wanted to share our experience here :wink:

Best, Tim