Hey all, this feels like a bug report or enhancement request – if this isn’t the right place to discuss this let me know.
The problem I see is that the dremio RunQueryResponseHandler
(link) manages a timeout and stops making data available to consumers once the timeout has occurred. As far as I can tell, this is completely opaque to odbc clients (and maybe others?) – once a timeout has occurred, no new data will be made available to the client, but they can continue reading and receive no error. As a client, it feels strange (if not wrong) to issue a query, not get all the rows, but also not receive any error.
At work I’m using Dremio to query some fairly large datasets – on the order of 200k rows up to 1 million. We’re using an odbc driver with elixir – a language that promotes high availability for real-time systems.
What I observed this week is that one of our processes to query our Dremio was stuck – it was issuing the same query in a loop. It was trying to query a large-ish dataset (200k rows) and keep track of the last value of an updated_at
column that we use as a cursor. However, because of a bulk update on the underlying table, we had 122k rows with the same updated_at
time, and Dremio was marking our job as failed after roughly ~80k rows read and never giving us more than these ~80k rows. Because we never managed to read the full 122k rows with the same value, our process’ cursor stayed the same, and it tried over and over again to query these 122k rows without ever managing to succeed. A very strange and difficult bug to track down, this was.
What was surprising was that even after the job was cancelled, our odbc client was still able to continue reading data. With some close investigation I was able to see that the TCP socket was still open and that we were still reading data from Dremio.
Are there any recommendations for how we can address this, or work around it ?