How to increase download limit

I tried to export 1.5 million records of csv as JSON but I am getting pop-up like below screenshot

csv_ - Dremio

Hi,

Dremio has a hardcoded limit of one million right now and we don’t have a way to change that right now (it is on our list of things to do).

In the meantime you could create 2 virtual datasets and split the dataset between then.

Could you say more about why you want to export data? One of our goals with Dremio is removing the need to make copies of data and instead to query any dataset through Dremio with any tool.

Does the restriction of 1 million applies to JDBC connected third party tools like Talend also which other team want to use to connect to this virtualization layer.

The limit only applies to downloads via the browser.

Connections via odbc, jdbc, and rest do not have this limit.

1 Like

Kelly (specific) need was to create a parquet file where i would materialize some stuff e.g. the “rowNumber” to create a primary key. I was wondering if there if you guys have it on your roadmap at some point to turn a classic dremio query into an actual saved source table.

Other people we talk to need the full export for analysis in specialized tools (E.g. for advanced numerical analysis , machine learning etc). Valid use case i guess? :slight_smile: cheers

@Giovanni_Tummarello “Turn a classic Dremio query into an actual saved source table” = https://docs.dremio.com/sql-reference/sql-commands/create-table-as.html

Another option I would actually recommend first is to create and use our Reflections (basically an accelerated physically optimized materialized representation of the data). More about that here - https://www.dremio.com/tutorials/getting-started-with-data-reflections/

Ah right yes, i saw that but i thought that given it had no security etc. Is this going to become secure and all (e.g. per user etc) ? if so do you see it as a short/medium term thing? thanks!

Users don’t access reflections directly. They would access the anchor dataset, which you can secure like any other dataset in Dremio Enterprise Edition. In the Community Edition everyone is an admin.

Regarding other tools, why wouldn’t they be able to access the data via odbc/jdbc/rest?

Kelly was referring to securing the $scretch space. I don’t feel reflections would work as a replacement for staging as you would do after a geocoding or remote service enrichment (to have full control on how many times the remote service is invoked )

Agreed. Reflections are not for staging. See the other thread for an example of using files for staging per the geocode example.

Files… which are limited to 1M row, so we got a new great use case here for unlimited row download :slight_smile:

Files are not limited to 1M. Only downloads are.

You can write files of any size to S3, ADLS, HDFS, NAS, etc and then read them through Dremio.

Does that make sense? Maybe I don’t understand what you’re trying to do?

For a moment here i thought dremio could SAVE a materialized table of any size to S3 HDFS NAS etc, what you mean instead is that one can put a file of any size in there and read it trough Dremio - ok i see that.

Anyway, my case is still lingering i guess.

If i could download a file of arbitrary size i could put it BACK into dremio e.g. S3 HTDFS NAS and this would effectively do a “staging” which i need e.g. when i want to do operations like “create a primary key from a row count” or do an expensive field computation (E.g. UDF for NLP to extract entities) or when i want to enrich via remote service lookup (via UDF again i guess)

At the moment the only way for this staging would be via the $scratch space. I guess if this functionality gets powered up e.g. via security you could have the best of both late / last mile ETL (or ELT) and classic pipelines. Thanks for the interaction :slight_smile:

Maybe the confusion is regarding what is meant by download.

In Dremio you can only downloads via the browser. You have to click a button. This doesn’t work for any kind of automated processing.

I’m suggesting your script would save intermediate results back to a shared file system or object store, in a folder called staging if you like. You would then call the rest API to add it as a new dataset that Dremio understands, then you can access this new file as a dataset through Dremio.

You don’t want to perform big, complex ETL through Dremio because the system is designed for low latency workloads. Jobs that run for many hours for example may experience node failures or network partitions that might cause the job to fail and you would need to start over. Dremio doesn’t perform any check pointing or other measures to recover from mid job failures like this.

Thanks Kelly i get it : use a script + JDBC to save a file and then via API use it as a staged file. Could work.

I think a ui action for that would be useful in many cases e.g. analysts working on one off operations and not wanting a script. e.g. an issue like “ability to “save to” HDFS/NAS directly in the ui”

would be cool if it was possible to open issues on github for Dremio :slight_smile:
cheers

1 Like