Dremio Cloud: JSON to Iceberg materialization fails on planning (max_splits)

sandervandorsten · January 19, 2023, 9:32am

Hi there,

We have issues with converting a raw table (jsonlines.gz) into an iceberg table. I’ll try to sketch the situation to see if we’re on the right track.

Question

How do we increase or disable the max_splits=300000 value in the cloud configuration? I’ve found documentation on how to do so in the software, not in the cloud. see this link for DBT recommendation

Context

From an external party, we import on a daily basis into S3 about 60GB (± 800 Million rows) spread over 4 partition levels. For 1 day of data, we import about 50k files. the folder structure of the raw data looks as follows:

/year=2023/month=01/day=10/source1/data_file_US.json.gz
/year=2023/month=01/day=10/source1/data_file_CA.json.gz
...
/year=2023/month=01/day=10/source2/data_file_US.json.gz
/year=2023/month=01/day=10/source2/data_file_GB.json.gz
...

We want to make this data queryable in a table in Dremio. For this, we’re attempting to use DBT to create a incremental materialization of this raw table (json.gz) into an Iceberg table. To not overrun the ‘max_splits’ variable we do partition pruning dynamically for a day and thus only access 50k files). However, we get the following error:

Number of splits (2376473) in dataset S3 Data Lake Staging.some_bucket_raw.streams exceeds dataset split limit of 300000

after googling a bit we found the following explanation:

As partition pruning is done during planning, partition values have to be known before execution..

We use the following query

create table "S3 Data Lake Staging"."some_bucket_processed"."streams"
partition by ("dir0","dir1","dir2","dir3")
as (
    
  with 

  streams as (

      select * from "S3 Data Lake Staging"."some_bucket_raw"."streams" 
      
  ),

  incremental_selection as (

      select * 
      from streams
      where to_date(concat(split_part(dir0, '=', 2), split_part(dir1, '=', 2), split_part(dir2, '=', 2)), 'YYYYMMDD') > (select max(to_date(concat(split_part(dir0, '=', 2), split_part(dir1, '=', 2), split_part(dir2, '=', 2)), 'YYYYMMDD')) from "S3 Data Lake Staging"."some_bucket_processed"."streams")

  )

  select * from incremental_selection

)

hope · January 19, 2023, 4:45pm

On Dremio Cloud, the 3 support keys mentioned in the DBT link are enabled by default.

You may be running into the planner.dataset_max_split_limit or the planner.query_max_split_limit. Unfortunately, we don’t support increasing these beyond the default limits as it can cause significant instability.

casperdamen · January 20, 2023, 9:30am

The calculation of the number of splits seems to work as expected when providing literal values, like where partition_column = '2023'.

However, when we make this dynamic, by using for example where partition_column = extract(year from current_date()) the planner is not evaluating the extract statement before calculating the number of splits.

Wouldn’t it make sense make more sense to first perform this evaluation? Because the error raised is not accurate in this scenario and is really a limiting factor.

Topic		Replies	Views
Hit a system limit..... how do I work around it?	4	1449	October 5, 2020
Upcoming support for Apache Iceberg partition transforms?	9	2858	January 26, 2023
Dbt tables stored in Apache Iceberg	8	2338	August 31, 2023
Create Table NOT as iceberg	9	112	April 5, 2025
Dataset split limits errors	6	2816	April 22, 2020

Dremio Cloud: JSON to Iceberg materialization fails on planning (max_splits)

Question

Context

Related topics