Hit a system limit..... how do I work around it?

I’ve added a datasource that is an Azure/S3 bucket

This bucket hierarchy is structured as

Data (this is selected as the ‘root’ of the datasource)

and I have multiple files under HH for each year/month/day/hh only.

The files are type .json

When I do this, Dremio gives me this error…

Number of splits (1044559) in datase exceeds dataset split limit of 300000
I realise that 1044559 is the number of json files.

How can I work around this?

Is it simply a case of creating a dataset per year-month, and then trying to ‘union’ them in to a VDS?


What is the size of each JSON file? You can create one per year and I assume your queries will have a filter and will only query one year at a time as querying multiple years might return too many records


Hi Bali,

Well here’s the thing… I have multiple JSON files per hour, so the number I quoted at the beginning turns out to be just one months worth. These are coming from a third party system, so I have no control over them. I had hoped Dremio would give me a quick win here.

For what it’s worth, average size is 80KB.

I suspect my only help here is going to be to preprocess them in something else, before accessing from Dremio.


Have you considered moving them to Parquet?

They do get migrated to parquet in the pipeline eventually, but I had hoped to process the data a little sooner before I got that far. But if that’s the only way I can do it…

Thanks mate.