Why Dremio is Only Returning a Subset of Results?


Is there a way to change this behavior?

only result subset returned

Ui is only for preview or test your queries, if yo need fetch all records please use odbc or jdbc or any IDE as Datagrip

Thanks, and I agree with you, although it seems natural to me that while building and curating my VDS to keep validating my work incrementally via observing the (actual) number of rows returned from queries, returning a subset might be misleading in this case.

Also what if I want to export (download) the results locally, I want the whole set to work with not just part of it. Why Dremio is availing this download feature in the first place when they expect you to go for an external tool?

I understand your point, I can suggest you Use IntelliJ or Datagrip, or dbeaver, you can connect direct to dremio and run your querys here.

Thank you dacopan for sharing that. I think Dremio team will consider this at some point because they know a well-designed semantic layer is a key entry point to the hearts and minds of their customers (specially those with low maturity in data).

Hope also they give me strong grounds to promote this to my customers in turn :smiley:

@yalmasri For data correctness from UI, please count instead select * as the UI results truncation is done due to a few reasons and currently there are no plans to change this behavior

  • As @dacopan mentioned populating over a million rows on the UI is not very useful
  • For data correctness use count
  • UI queries generate arrow results files and as rows increase can fill up the local disk
  • If local disk is slow then writes can be slow

Thank you Balaji again.

Then in this case, you might want to change the behavior in two ways:

  1. When you return truncated results, mention how much is that out of the whole total like: A subset (170,624) of total 1,000,000 rows has been…
  2. When I want to download the results, it should download the whole set

@yalmasri : why would you like to download such a hugh number of rows? Would you like to further process using Excel or python or any other tooling? Then you can use one of the suggested connectors. Or are you checking query results using notepad?

Thank you @MrJava.

We have a lot of legacy-minded customers (data/business analysts) who are familiar with Excel only, and are very good at it. Until they get “modernized”, they demand a no-change to their existing processes at the beginning, which tells me that my analytics engineers need to prepare data and push it to them on Excel format for analysis. I believe this goes in line with the non-invasive approach Dremio is adopting.

Creating a VDS, version it, annotate it, and collaborate over it is a very advanced stage to them and won’t come in a day. They cannot for example connect from Excel to Dremio over views that are context-less for them. Someone has to resolve that for them first.

I want to add also a third reason, which is query optimization, I’m not sure how business domain owners (we are talking about data mesh here) will meet their SLA’s without actually measuring up the query total time (as opposed to subset query time), you know relation is not linear.

Ok, not sure if I understood it correctly but we use Excel too.
Our users use the Excel Data Connection (aka PowerQuery using ODBC) to retrieve the data comming from a VDS in Dremio and we provide the VDS as they request them.

BTW: be aware that Excel will geht really slow when presenting more that 500 mil. rows.