Hi there,
We have issues with converting a raw table (jsonlines.gz
) into an iceberg table. I’ll try to sketch the situation to see if we’re on the right track.
Question
- How do we increase or disable the
max_splits=300000
value in the cloud configuration? I’ve found documentation on how to do so in the software, not in the cloud. see this link for DBT recommendation
Context
From an external party, we import on a daily basis into S3 about 60GB (± 800 Million rows) spread over 4 partition levels. For 1 day of data, we import about 50k files. the folder structure of the raw data looks as follows:
/year=2023/month=01/day=10/source1/data_file_US.json.gz
/year=2023/month=01/day=10/source1/data_file_CA.json.gz
...
/year=2023/month=01/day=10/source2/data_file_US.json.gz
/year=2023/month=01/day=10/source2/data_file_GB.json.gz
...
We want to make this data queryable in a table in Dremio. For this, we’re attempting to use DBT to create a incremental materialization of this raw table (json.gz) into an Iceberg table. To not overrun the ‘max_splits’ variable we do partition pruning dynamically for a day and thus only access 50k files). However, we get the following error:
Number of splits (2376473) in dataset S3 Data Lake Staging.some_bucket_raw.streams exceeds dataset split limit of 300000
after googling a bit we found the following explanation:
As partition pruning is done during planning, partition values have to be known before execution..
We use the following query
create table "S3 Data Lake Staging"."some_bucket_processed"."streams"
partition by ("dir0","dir1","dir2","dir3")
as (
with
streams as (
select * from "S3 Data Lake Staging"."some_bucket_raw"."streams"
),
incremental_selection as (
select *
from streams
where to_date(concat(split_part(dir0, '=', 2), split_part(dir1, '=', 2), split_part(dir2, '=', 2)), 'YYYYMMDD') > (select max(to_date(concat(split_part(dir0, '=', 2), split_part(dir1, '=', 2), split_part(dir2, '=', 2)), 'YYYYMMDD')) from "S3 Data Lake Staging"."some_bucket_processed"."streams")
)
select * from incremental_selection
)