Hit a system limit..... how do I work around it?

surreynorthern · September 29, 2020, 10:07am

I’ve added a datasource that is an Azure/S3 bucket

This bucket hierarchy is structured as

Data (this is selected as the ‘root’ of the datasource)
YYYY
MM
DD
HH

and I have multiple files under HH for each year/month/day/hh only.

The files are type .json

When I do this, Dremio gives me this error…

Number of splits (1044559) in datase exceeds dataset split limit of 300000
I realise that 1044559 is the number of json files.

How can I work around this?

Is it simply a case of creating a dataset per year-month, and then trying to ‘union’ them in to a VDS?

balaji.ramaswamy · October 1, 2020, 3:26am

@surreynorthern

What is the size of each JSON file? You can create one per year and I assume your queries will have a filter and will only query one year at a time as querying multiple years might return too many records

Thanks
Bali

surreynorthern · October 1, 2020, 8:25am

Hi Bali,

Well here’s the thing… I have multiple JSON files per hour, so the number I quoted at the beginning turns out to be just one months worth. These are coming from a third party system, so I have no control over them. I had hoped Dremio would give me a quick win here.

For what it’s worth, average size is 80KB.

I suspect my only help here is going to be to preprocess them in something else, before accessing from Dremio.

balaji.ramaswamy · October 2, 2020, 5:57am

@surreynorthern

Have you considered moving them to Parquet?

surreynorthern · October 5, 2020, 10:08am

They do get migrated to parquet in the pipeline eventually, but I had hoped to process the data a little sooner before I got that far. But if that’s the only way I can do it…

Thanks mate.

Topic		Replies	Views
Dataset split limits errors	6	2825	April 22, 2020
"Number of splits (432658) in dataset XXX.2019 exceeds dataset split limit of 300000" when trying to use the "Format Folder" feature on aws load balancer access logs in s3 at the load balancer level and the year level s3 folder	4	1417	March 29, 2023
"Number of splits" error	1	1286	June 20, 2022
Query exceeds the query split limit of 300000	1	2146	December 29, 2020
Problem on create large dataset	2	997	May 28, 2020

Hit a system limit..... how do I work around it?

Related topics