ODBC Client unaware that not all rows were returned when a timeout occurs

Hey all, this feels like a bug report or enhancement request – if this isn’t the right place to discuss this let me know.

The problem I see is that the dremio RunQueryResponseHandler (link) manages a timeout and stops making data available to consumers once the timeout has occurred. As far as I can tell, this is completely opaque to odbc clients (and maybe others?) – once a timeout has occurred, no new data will be made available to the client, but they can continue reading and receive no error. As a client, it feels strange (if not wrong) to issue a query, not get all the rows, but also not receive any error.

At work I’m using Dremio to query some fairly large datasets – on the order of 200k rows up to 1 million. We’re using an odbc driver with elixir – a language that promotes high availability for real-time systems.

What I observed this week is that one of our processes to query our Dremio was stuck – it was issuing the same query in a loop. It was trying to query a large-ish dataset (200k rows) and keep track of the last value of an updated_at column that we use as a cursor. However, because of a bulk update on the underlying table, we had 122k rows with the same updated_at time, and Dremio was marking our job as failed after roughly ~80k rows read and never giving us more than these ~80k rows. Because we never managed to read the full 122k rows with the same value, our process’ cursor stayed the same, and it tried over and over again to query these 122k rows without ever managing to succeed. A very strange and difficult bug to track down, this was.

What was surprising was that even after the job was cancelled, our odbc client was still able to continue reading data. With some close investigation I was able to see that the TCP socket was still open and that we were still reading data from Dremio.

Are there any recommendations for how we can address this, or work around it ?

@leftsaidtim The timeout default is at 50 seconds

Is it possible to investigate what is happening on the client side first. By arbitrarily changing this value without understanding the client-side root cause, we could be sweeping something important under the rug.

50 seconds is a long time to process each batch returned by the server. Increasing flight.client.readiness.timeout.millis delays the notification of the timeout and the release of the connection should something be amiss on the client side.

Do you have the same issue even on an aggregated query that reurns less than say 10k rows?

Thanks
Bali

Thanks for your response @balaji.ramaswamy, I do appreciate it. To answer your question, we do not observe the same issue for aggregated queries that return fewer than 10k rows.

Regarding the time to process a batch, I agree that 50 seconds seems quite long to process a batch. Perhaps there’s a problem here between how the underlying apache flight query is being handled and our ODBC driver are interacting that’s causing this.

From the perspective of ODBC I don’t believe there is a notion of batches, or if there is, it would be just one big batch. It seems overly optimistic for us to be able to do the kinds of streaming with our workload with 100ks of rows in 50 seconds.

@leftsaidtim As a troubleshooting step, you can increase the timeout and see fif the query succeeds