Dataset split limits errors

Number of splits (345476) in dataset <dataset.path> exceeds dataset split limit of 300000

Anyone have an explanation of what dataset splitting is in the above context? I get this error when trying to format a large number of directories that all contain a large number of json data structures.

Dremio Build 4.0.5-201911202046080257-19b10938

@noah

If the dataset is Parquet files, this refers to the number of Row Groups across all files in the dataset

If the dataset is some other kind of file, then is just the number of files.

Can the dataset split limit be increased and if so, what resources need to be considered when increasing them? (e.g. RAM, CPU, disk, OS limits?)

In newer versions of Dremio (4.1.x or higher) this value has been increased to 60 000:
https://docs.dremio.com/advanced-administration/limits.html

Mainly this effects RAM on the Dremio coordinator, as on-heap objects are created for each split while Dremio plans queries.

I am hitting the 300k limit in my case. Is that because the files in my case are on an S3-like object store source?

Is the 60k/300k limit a fixed limit or can it be changed in dremio.conf?

It cannot be changed with dremio.conf. It would be better to alter the query so as to avoid the split error in the first place.

1 Like