Dremio Cloud: JSON to Iceberg materialization fails on planning (max_splits)

Hi there,

We have issues with converting a raw table (jsonlines.gz) into an iceberg table. I’ll try to sketch the situation to see if we’re on the right track.

Question

  • How do we increase or disable the max_splits=300000 value in the cloud configuration? I’ve found documentation on how to do so in the software, not in the cloud. see this link for DBT recommendation

Context

From an external party, we import on a daily basis into S3 about 60GB (± 800 Million rows) spread over 4 partition levels. For 1 day of data, we import about 50k files. the folder structure of the raw data looks as follows:

/year=2023/month=01/day=10/source1/data_file_US.json.gz
/year=2023/month=01/day=10/source1/data_file_CA.json.gz
...
/year=2023/month=01/day=10/source2/data_file_US.json.gz
/year=2023/month=01/day=10/source2/data_file_GB.json.gz
...

We want to make this data queryable in a table in Dremio. For this, we’re attempting to use DBT to create a incremental materialization of this raw table (json.gz) into an Iceberg table. To not overrun the ‘max_splits’ variable we do partition pruning dynamically for a day and thus only access 50k files). However, we get the following error:

Number of splits (2376473) in dataset S3 Data Lake Staging.some_bucket_raw.streams exceeds dataset split limit of 300000

after googling a bit we found the following explanation:

As partition pruning is done during planning, partition values have to be known before execution..

We use the following query

create table "S3 Data Lake Staging"."some_bucket_processed"."streams"
partition by ("dir0","dir1","dir2","dir3")
as (
    
  with 
​
  streams as (
​
      select * from "S3 Data Lake Staging"."some_bucket_raw"."streams" 
      
  ),
​
  incremental_selection as (
​
      select * 
      from streams
      where to_date(concat(split_part(dir0, '=', 2), split_part(dir1, '=', 2), split_part(dir2, '=', 2)), 'YYYYMMDD') > (select max(to_date(concat(split_part(dir0, '=', 2), split_part(dir1, '=', 2), split_part(dir2, '=', 2)), 'YYYYMMDD')) from "S3 Data Lake Staging"."some_bucket_processed"."streams")
​
  )
​
  select * from incremental_selection
​
)

On Dremio Cloud, the 3 support keys mentioned in the DBT link are enabled by default.

You may be running into the planner.dataset_max_split_limit or the planner.query_max_split_limit. Unfortunately, we don’t support increasing these beyond the default limits as it can cause significant instability.

The calculation of the number of splits seems to work as expected when providing literal values, like where partition_column = '2023'.

However, when we make this dynamic, by using for example where partition_column = extract(year from current_date()) the planner is not evaluating the extract statement before calculating the number of splits.

Wouldn’t it make sense make more sense to first perform this evaluation? Because the error raised is not accurate in this scenario and is really a limiting factor.