Failed to deserialize on-disk vector batch

Fail to query job result with offset greater than ~ ?offset=1200000

The job input is about 73GB and 432M rows…
The job completed successfuly.

It is very similar to UI Run - Why does the SELECT * row count not match COUNT(*)? where the statement mentios 1M result limit for a REST api calls…

Yet I’m not convinced that we have a truncated output problem, as I can see on dremio-executor a ‘result’ file of 73GB which I assume correct, and I think has all query records (I believe - looking at it’s size - it is no truncated, at least shouldn’t be…).

So we use REST to get result data with an offset

curl -X GET --header ‘Content-Type: application/json’ --header ‘Accept: application/json’ -H “Authorization: Bearer $TOKEN” -v ${IP}:9047/api/v3/job/17b76125-2cbb-2d8c-b48a-56085d87d800/results?offset=5600000&limit=500

It woks well untill something like offset=1200000

Then dremio-master throws

Caused by: java.io.IOException: Failed to deserialize on-disk vector batch

com.google.protobuf.InvalidProtocolBufferException$InvalidWireTypeException: Protocol message tag had invalid wire type.

Why I think the problem could be something else than 1M result truncate ?

  • Well, the file has 73GB which is consistent with all expected data.
  • I did update the 2x planner.output_limit_size from 1M to 2M (2000000) - but it did not affect at all the offset=12000000 data cut-off, it still fails at the same data size…

So what other issue (or limit) could cause the error while fetching the results of a dremio job?

Tested on dremio 25.x and 26.0.0 (with mostly default settings)

Workaround
We run multiple SQL queries with OFFSET and LIMIT… and glue the result together…

To be considered:
Would no-truncate call with ODBC/JDBC driver work at all, since planner.output_limit_size did not affect current behavior (while it should if the thesis for REST API call 1M result truncate were correct). Maybe the issue is somewhere else…

Have a nice day
Jakub

Hi Jakub

Both REST and UI queries will spool their results to disk so that you can use the REST API to later go back and query the job results. There’s also background clean up for these job results. If your use case is to query 1M+ records and especially wide records, I think using Flight SQL would be faster and more efficient. You don’t have the serialization overhead with legacy ODBC/JDBC and more importantly you aren’t spooling any intermediary results to disk.

You can also see details about the job results spooling in the execution plans. For the REST job, you’ll see an Arrow writer that writes multiples files up to around 1M total records. Whereas with ODBC/JDBC, the results are streamed back without the Arrow writer through the top level screen operator.