Iceberg file size on dremio

ajay098 · April 26, 2024, 9:25am

Hi,

We have generated parquet file of size 1GiB and uploaded on azure storage.
Approximate data on this file is around 8460000 records with 30 columns.

After moving this data to iceberg table, I could see 20 files are generated of 256MB.

I understand the size of file is default to 256 mb on dremio, but the volume of data after moving to iceberg is 5x.

@balaji.ramaswamy : Am I missing any compression config. Need some help on this.

Thanks
Ajay

balaji.ramaswamy · May 3, 2024, 7:44pm

@ajay098 is data+metadata 5 GB?

ajay098 · May 4, 2024, 3:08am

Yes. File has 10 days of data and iceberg table is partitioned on day.

parquet file

After creating iceberg table

balaji.ramaswamy · May 5, 2024, 5:47pm

@ajay098 Validating, will get back to you, please ping if you do not hear back soon

balaji.ramaswamy · May 6, 2024, 4:11am

@ajay098 How did you create the Iceberg tables? Via Dremio? or Via an external tool? If via Dremio, are you able to send us the profile for the DDL/DML job?

ajay098 · May 6, 2024, 9:26am

Hi,

Unfortunately job has retention of 7 days.

We created table via Dremio.

Files were created with following info.
library version 1.5.0
parquet-avro version 1.13.1

Parquet version V1
Compression Technique gzip
Encoding PLAIN_DICTIONARY

Thanks
Ajay

balaji.ramaswamy · May 13, 2024, 6:47am

@ajay098 Is it possible to retrigger the CREATE table again and send us the profile?

ajay098 · May 15, 2024, 10:07am

f0ff05c3-0d09-45e6-931f-06c211596f1d.zip (30.0 KB)

Hi,

Thanks for the consideration.

I have attached the job profile.

Regards
Ajay

balaji.ramaswamy · May 19, 2024, 4:17am

Thanks, will review this and get back to you

dacopan · May 19, 2024, 4:36am

I have a similar situation, I run a MERGE statement each 15 minutes, that creates a new snapshoot, I’ve already run many OPTIMICE AND VACUUM but snapshoots are not expired and currently have 400GB of disk used I measured the real size that should be max 3GB with 13M rows

And this can be validated run this query

select SUM(record_count) from TABLE( table_files( ‘lake.prod.posts’ ) );

select SUM(file_size_in_bytes) from TABLE( table_files( ‘lake.prod.posts’ ) );

so I don’t understand why snapshots are not expired, and files not deleted

balaji.ramaswamy · May 19, 2024, 4:58am

@dacopan What version of Dremio is this?

dacopan · May 19, 2024, 4:59am

We using 25.0.0-202404051521110861-ed9515a8 and 24.1.0-202306130653310132-d30779f6

dacopan · May 19, 2024, 5:20am

@balaji.ramaswamy Some data to help understand

SELECT COUNT(*) from lake.prod.posts;  
--returns 13522392

select * from TABLE( table_history( 'lake.prod.posts' ) );
-- returns 2117 snapshot with all in true in column is_current_ancestor

select * from TABLE( table_snapshot( 'lake.prod.posts' ) );
-- return same 2117 

select * from TABLE( table_manifests( 'lake.prod.posts' ) );
-- returns 16 rows all to same added_snapshot_id = 5075613704781762078

select * from TABLE( table_files( 'lake.prod.posts' ) );
-- return 4018 rows all points to folder file:///opt/dremio/data/prod/posts/19b739d4-15fd-eba8-5cb6-76777c0c0400 
-- so i think all points to last snapshot

select SUM(record_count) from TABLE( table_files( 'lake.prod.posts' ) );
-- returns 13522392

select SUM(file_size_in_bytes) from TABLE( table_files( 'lake.prod.posts' ) );
-- return 2297772830 in bytes appr 2.29 GB

as you can see in last query data apprx has 2.29GB but disk usage is 479G in data and 2.1G in metadata

dacopan · May 19, 2024, 5:34am

Folder structure of data

tree.zip (249,3 KB)

balaji.ramaswamy · May 20, 2024, 4:42am

@dacopan When you ran the Vacuum command, what were the values you gave to explire, like below SQL

VACUUM TABLE s3.SF_Incidents2016
    EXPIRE SNAPSHOTS older_than '2023-04-20 00:00:00.000' retain_last 20;

dacopan · May 20, 2024, 9:40am

This comand returns zero in all columns
even I tried run optimize then refresh metadata and then vacuum again

I tried create again new table, make some merge sentences and behavior are similar snaphots are mantained, I make debug to dremio and detected that snapshot are mantained because is ancestor

AND more strange is that if I run create another table using Cta of this table the size of this table is apprx 0.5 of original table returned by table_files (not real size in this disk) so I think that each snaphot is a copy of all data

balaji.ramaswamy · May 20, 2024, 3:12pm

@dacopan If you dont mind, are you able to send the profile?

dacopan · May 20, 2024, 3:45pm

we have too many runs of this, I hope this is the correct profile (if not we can make some new scenario again and test again) because we already drop table and create again and beahvior continues
Vacuum:
4a71b7ff-e0e6-4226-adbe-c887bb31ebca.zip (12,4 KB)
Optimize:
9722d694-6851-421c-a47b-063ecddabda0.zip (31,5 KB)

dacopan · May 21, 2024, 2:39am

Hi @balaji.ramaswamy we make some new test reproducing same behavior
I attach more information that can help you we run (before and after vacuum) also attached vacuum query profile

select * from TABLE( table_history( 'lake.prod.posts' ) );

select * from TABLE( table_snapshot( 'lake.prod.posts' ) );

select * from TABLE( table_files( 'lake.prod.posts' ) );

before vacuum 233GB on disk , table files > 4030

after vacuum 209G on disk

select SUM(file_size_in_bytes) from TABLE( table_files( ‘lake.prod.posts’ ) );

– return 2151709359

as you can see after vacuum we have 209G on disk that not correspond of real data, even not correspond with sum of file_size_in_bytes

also attach the folder hierachy before (test3.txt) and after vacuum (test3_after.txt)

vacuum3.zip (3,8 MB)

thank by your help, I think this is a good case to optimize dremio.

I’m on the lookout for anything you need, including access to our environment and data or a remote session if is necesary

balaji.ramaswamy · May 22, 2024, 7:36pm

@dacopan

Which catalog do you use to create your Iceberg table?

Thanks
Bali

Topic		Replies	Views
Create Table NOT as iceberg	9	124	April 5, 2025
OPTIMIZE and VACUUM table cannot reduce iceberg metadata file size	11	132	March 22, 2025
Dremio s3 metadata storage	19	209	January 17, 2025
Dremio Iceberg table - DELETE splits files	4	81	January 20, 2025
Metadata.json remains after Iceberg vacuum statement Apache Iceberg	12	247	November 13, 2024

Iceberg file size on dremio

Related topics