I’ve added a datasource that is an Azure/S3 bucket
This bucket hierarchy is structured as
Data (this is selected as the ‘root’ of the datasource)
YYYY
MM
DD
HH
and I have multiple files under HH for each year/month/day/hh only.
The files are type .json
When I do this, Dremio gives me this error…
Number of splits (1044559) in datase exceeds dataset split limit of 300000
I realise that 1044559 is the number of json files.
How can I work around this?
Is it simply a case of creating a dataset per year-month, and then trying to ‘union’ them in to a VDS?
@surreynorthern
What is the size of each JSON file? You can create one per year and I assume your queries will have a filter and will only query one year at a time as querying multiple years might return too many records
Thanks
Bali
Hi Bali,
Well here’s the thing… I have multiple JSON files per hour, so the number I quoted at the beginning turns out to be just one months worth. These are coming from a third party system, so I have no control over them. I had hoped Dremio would give me a quick win here.
For what it’s worth, average size is 80KB.
I suspect my only help here is going to be to preprocess them in something else, before accessing from Dremio.
@surreynorthern
Have you considered moving them to Parquet?
They do get migrated to parquet in the pipeline eventually, but I had hoped to process the data a little sooner before I got that far. But if that’s the only way I can do it…
Thanks mate.