noah
April 19, 2020, 6:04am
1
Number of splits (345476) in dataset <dataset.path> exceeds dataset split limit of 300000
Anyone have an explanation of what dataset splitting is in the above context? I get this error when trying to format a large number of directories that all contain a large number of json data structures.
noah
April 19, 2020, 7:01pm
2
Dremio Build 4.0.5-201911202046080257-19b10938
ben
April 20, 2020, 2:33am
3
@noah
If the dataset is Parquet files, this refers to the number of Row Groups across all files in the dataset
If the dataset is some other kind of file, then is just the number of files.
noah
April 20, 2020, 4:53am
5
Can the dataset split limit be increased and if so, what resources need to be considered when increasing them? (e.g. RAM, CPU, disk, OS limits?)
ben
April 20, 2020, 5:52pm
6
In newer versions of Dremio (4.1.x or higher) this value has been increased to 60 000:
https://docs.dremio.com/advanced-administration/limits.html
Mainly this effects RAM on the Dremio coordinator, as on-heap objects are created for each split while Dremio plans queries.
noah
April 20, 2020, 9:02pm
7
I am hitting the 300k limit in my case. Is that because the files in my case are on an S3-like object store source?
Is the 60k/300k limit a fixed limit or can it be changed in dremio.conf?
ben
April 22, 2020, 1:57pm
8
It cannot be changed with dremio.conf
. It would be better to alter the query so as to avoid the split error in the first place.
1 Like