Hit a system limit..... how do I work around it?

I’ve added a datasource that is an Azure/S3 bucket

This bucket hierarchy is structured as

Data (this is selected as the ‘root’ of the datasource)
YYYY
MM
DD
HH

and I have multiple files under HH for each year/month/day/hh only.

The files are type .json

When I do this, Dremio gives me this error…

Number of splits (1044559) in datase exceeds dataset split limit of 300000
I realise that 1044559 is the number of json files.

How can I work around this?

Is it simply a case of creating a dataset per year-month, and then trying to ‘union’ them in to a VDS?

@surreynorthern

What is the size of each JSON file? You can create one per year and I assume your queries will have a filter and will only query one year at a time as querying multiple years might return too many records

Thanks
Bali

Hi Bali,

Well here’s the thing… I have multiple JSON files per hour, so the number I quoted at the beginning turns out to be just one months worth. These are coming from a third party system, so I have no control over them. I had hoped Dremio would give me a quick win here.

For what it’s worth, average size is 80KB.

I suspect my only help here is going to be to preprocess them in something else, before accessing from Dremio.

@surreynorthern

Have you considered moving them to Parquet?

They do get migrated to parquet in the pipeline eventually, but I had hoped to process the data a little sooner before I got that far. But if that’s the only way I can do it…

Thanks mate.