Iceberg file size on dremio

Hi,

We have generated parquet file of size 1GiB and uploaded on azure storage.
Approximate data on this file is around 8460000 records with 30 columns.

After moving this data to iceberg table, I could see 20 files are generated of 256MB.

I understand the size of file is default to 256 mb on dremio, but the volume of data after moving to iceberg is 5x.

@balaji.ramaswamy : Am I missing any compression config. Need some help on this.

Thanks
Ajay

@ajay098 is data+metadata 5 GB?

Yes. File has 10 days of data and iceberg table is partitioned on day.

parquet file

After creating iceberg table

@ajay098 Validating, will get back to you, please ping if you do not hear back soon

@ajay098 How did you create the Iceberg tables? Via Dremio? or Via an external tool? If via Dremio, are you able to send us the profile for the DDL/DML job?

Hi,

Unfortunately job has retention of 7 days.

We created table via Dremio.

Files were created with following info.
library version 1.5.0
parquet-avro version 1.13.1

Parquet version V1
Compression Technique gzip
Encoding PLAIN_DICTIONARY

Thanks
Ajay

@ajay098 Is it possible to retrigger the CREATE table again and send us the profile?

f0ff05c3-0d09-45e6-931f-06c211596f1d.zip (30.0 KB)

Hi,

Thanks for the consideration.

I have attached the job profile.

Regards
Ajay

Thanks, will review this and get back to you

I have a similar situation, I run a MERGE statement each 15 minutes, that creates a new snapshoot, I’ve already run many OPTIMICE AND VACUUM but snapshoots are not expired and currently have 400GB of disk used I measured the real size that should be max 3GB with 13M rows

And this can be validated run this query

select SUM(record_count) from TABLE( table_files( ‘lake.prod.posts’ ) );

select SUM(file_size_in_bytes) from TABLE( table_files( ‘lake.prod.posts’ ) );

so I don’t understand why snapshots are not expired, and files not deleted

@dacopan What version of Dremio is this?

We using 25.0.0-202404051521110861-ed9515a8 and 24.1.0-202306130653310132-d30779f6

@balaji.ramaswamy Some data to help understand

SELECT COUNT(*) from lake.prod.posts;  
--returns 13522392

select * from TABLE( table_history( 'lake.prod.posts' ) );
-- returns 2117 snapshot with all in true in column is_current_ancestor

select * from TABLE( table_snapshot( 'lake.prod.posts' ) );
-- return same 2117 

select * from TABLE( table_manifests( 'lake.prod.posts' ) );
-- returns 16 rows all to same added_snapshot_id = 5075613704781762078

select * from TABLE( table_files( 'lake.prod.posts' ) );
-- return 4018 rows all points to folder file:///opt/dremio/data/prod/posts/19b739d4-15fd-eba8-5cb6-76777c0c0400 
-- so i think all points to last snapshot

select SUM(record_count) from TABLE( table_files( 'lake.prod.posts' ) );
-- returns 13522392

select SUM(file_size_in_bytes) from TABLE( table_files( 'lake.prod.posts' ) );
-- return 2297772830 in bytes appr 2.29 GB

as you can see in last query data apprx has 2.29GB but disk usage is 479G in data and 2.1G in metadata

Folder structure of data

tree.zip (249,3 KB)

@dacopan When you ran the Vacuum command, what were the values you gave to explire, like below SQL

VACUUM TABLE s3.SF_Incidents2016
    EXPIRE SNAPSHOTS older_than '2023-04-20 00:00:00.000' retain_last 20;

This comand returns zero in all columns
even I tried run optimize then refresh metadata and then vacuum again

I tried create again new table, make some merge sentences and behavior are similar snaphots are mantained, I make debug to dremio and detected that snapshot are mantained because is ancestor

AND more strange is that if I run create another table using Cta of this table the size of this table is apprx 0.5 of original table returned by table_files (not real size in this disk) so I think that each snaphot is a copy of all data

@dacopan If you dont mind, are you able to send the profile?

we have too many runs of this, I hope this is the correct profile (if not we can make some new scenario again and test again) because we already drop table and create again and beahvior continues
Vacuum:
4a71b7ff-e0e6-4226-adbe-c887bb31ebca.zip (12,4 KB)
Optimize:
9722d694-6851-421c-a47b-063ecddabda0.zip (31,5 KB)

Hi @balaji.ramaswamy we make some new test reproducing same behavior
I attach more information that can help you we run (before and after vacuum) also attached vacuum query profile

select * from TABLE( table_history( 'lake.prod.posts' ) );

select * from TABLE( table_snapshot( 'lake.prod.posts' ) );

select * from TABLE( table_files( 'lake.prod.posts' ) );

before vacuum 233GB on disk , table files > 4030

after vacuum 209G on disk

select SUM(file_size_in_bytes) from TABLE( table_files( ‘lake.prod.posts’ ) );

– return 2151709359

as you can see after vacuum we have 209G on disk that not correspond of real data, even not correspond with sum of file_size_in_bytes

also attach the folder hierachy before (test3.txt) and after vacuum (test3_after.txt)

vacuum3.zip (3,8 MB)

thank by your help, I think this is a good case to optimize dremio.

I’m on the lookout for anything you need, including access to our environment and data or a remote session if is necesary

@dacopan

Which catalog do you use to create your Iceberg table?

Thanks
Bali