Pagination query does not return as requested records

the result has 2677 ,but when i wanna got the 1000 to 1500 just return 91 records

1 Like

@WJustHM this is a known issue with the API when loading results from some datasets. Using the developer tools in your browser, do you see the same behavior when you run the same query from the Dremio UI?

@ben The result is correct in browser,Exists only in code

hello @ben

How should i solve this problem

Best Regards

@WJustHM, if it is a physical dataset, you might try forgetting the metadata and reformatting, but this will effect downstream virtual datasets.

Could you share the profile for the job you are returning the results from?

@ben Sorry,The problem has always been,I don’t know why,Please

Best Regardsdb2f5277-fda7-4c20-acf8-55bf8b407ff8.zip (41.4 KB)

This bug is still a problem. Running Sql in the dremio web UI returns 6700 rows, but running the same SQL using the REST api using row offsets only returns back 4000 rows total.

Is there anyway to change the rest api limit from 500?

  • offset - Sets where you want to start from. Default: 0
  • limit - Determines how many rows are returned (maximum of 500). Default: 100

I am also facing the same issue.

I have a query like this :

select distinct columnA "value" from "spaceA"."datasetA" where columnA is not null and columnA not in (select distinct "value" from "spaceA"."datasetB" where "type" = 'someType')

This query returns 27,000+ results.

I am using SQL rest API to execute this.

Then I use jobs API to retrieve results in batches, each of size 500 (max limit), by explicitly setting offset and limit in each rest call like this :

I start from offset 0 and keep shifting in each call.

http://localhost:9047/api/v3/job/21365044-f881-d591-7e91-3dc1284fb400/results?offset=2000&limit=500

The rows list returned after offset 2000 is empty… In all, I don’t get results more than 2.5k…

Response with empty result:

{
    "rowCount": 27527,
    "schema": [
        {
            "name": "value",
            "type": {
                "name": "VARCHAR"
            }
        }
    ],
    "rows": []
}

I am expecting close to millions of records like this, to be retrieved sequentially.

What should be the resolution for this, please advice.

This was fixed in Dremio 4.5.0.

Hi, my team has recently upgraded our Dremio to version 4.7.2 (Community Edition).

However, we are still facing this issue when retrieving results using the REST API, even after clearing/refreshing the metadata of the physical datasets.

From the screenshot below, although there are 2341 results, only 1871 results are retrieved.

We hope that this bug can be resolved soon.

For those who are facing the same issue, the current workaround that my team did was to check the number of results retrieved. If the number of results is lesser than what was expected, re-retrieve the results in that offset range using smaller limits. If the number of results retrieved in that offset range is lesser than what was expected, re-retrieve the results using limit=1. So far, we managed to retrieve all the results using this implementation (Although this will increase the number of Dremio API calls needed and slow down the retrieval speed).

The screenshot below contains the snippet of the Python code and results of the above-mentioned implementation.

Hope this helps!

1 Like

Thanks for pointing this out @edksk,

How many executors are you running, and where are you storing your results? The latter is configured in dremio.conf under results

Hi @ben,

We are running Dremio on cluster deployment (on AWS EC2) with a single coordinator node and a single executor node, and the results are stored locally (/var/lib/dremio/pdfs/results).

Below is the config we are using for the executor node:

paths: {
  # the local path for dremio to store data.
  local: "/var/lib/dremio",
  spilling: ["/var/ephemeral/dremio_spill"]

  # the distributed path Dremio data including job results, downloads, uploads, etc
  #dist: "pdfs://"${paths.local}"/pdfs"
}

Thank you for sharing the workaround.

I think the issue comes from putting the job result in locally stored PDFS.

Essentially, your code is searching for the boundary offset by increasing one at a time. Maybe we can do better with fewer API calls by implementing a binary search for that offset.

1 Like

@edksk, @dli16,

We have reproduced this issue internally and we are working on a fix that should be available in the near future.

Hi @dli16,

You are welcome. Thanks for your suggestion. Indeed, using a binary search method could likely reduce the number of API calls performed (e.g. setting the limits for retrieval/re-retrieval to be 100, 50, 25, 12, 6, 3, 1).

When we were debugging the issue on our end, we observed that setting a limit of 1 will return exactly 1 result (observed till now that this is true) while any limit > 1 may return less results than intended. However, it would not be practical to retrieve the results one by one.

Thus, for my team, we fixed the limits for retrieval/re-retrieval to be 100, 10 and 1 for the ease of implementation as our actual implementation for the data retrieval is not in Python but using a software. The python code snippet I have shared previously was to illustrate this bug and workaround.

For those facing the same issue, you can choose other values as your limits that would suit your requirements. Just note that your chosen limit for the final re-retrieval attempt should be 1 to ensure that all your results can be successfully retrieved.

Hi @ben,

Thank you for your prompt response. We look forward to the bug fix and other new features. Thank you!