Iceberg query performance with many parquet files

kyleahn · June 29, 2023, 4:05pm

Hi Dremio community,

I have an Iceberg table with Glue catalog. This Iceberg table has about 60 records per file (hourly partitioned), and the number of files have grown to about 300 files over the last couple of weeks. To be very exact, each record was written as a single parquet file, but was later compacted.

The query is currently taking about 20s for what seems to be very simple SELECT statement

It seems weird to me that Dremio cannot keep up with just a few hundreds of parquet files.
I currently have 3 executors with 3 CPU cores and 16 GB memory each

Where should I look into first in order to significantly improve the query latency so queries at this scale is definitely under 1s?

balaji.ramaswamy · July 2, 2023, 5:59pm

@kyleahn Are you able to send us the job profile, so we can see where in execution the time was spent>

How To Share A Query Profile | Dremio.

kyleahn · July 3, 2023, 6:47am

here! Thanks a lot!
c933dd01-e4e7-48e6-8c10-c3da2f639cba.zip (21.7 KB)

balaji.ramaswamy · July 5, 2023, 12:57am

@kyleahn Here is the problem

There are only 6043 records and hence Dremio is single threaded. For every 100K rows estimated, there is one thread planned. But these 6043 files are spread across 291 files, see NUM_READERS in operator metrics of TABLE_FUNCTION 00-00-09 and all the 16 seconds are spent waiting (IO wait) reading the 291 files. I see NUM_CACHE_HITS is zero, see operator metrics

What happens if you run the query 3 times in a row? Does the third one run fast?

kyleahn · July 5, 2023, 4:40pm

The third one does not necessarily run much faster. No matter how many queries are run, they all run in about 19s with += 3s.

Is there additional query optimization that I can implement such as raw data reflection? They didn’t seem to be helping much.

balaji.ramaswamy · July 6, 2023, 4:49am

@kyleahn Subsequent runs should have used C3 cache, can you please run it about 3 times in a row and send the 3 profiles?

kyleahn · July 6, 2023, 5:32am

Hi @balaji.ramaswamy

Ah, this isn’t exactly the same number of records anymore as there has been more records added, but here are the 3 profiles of the same iceberg table

It took more than 3 minutes the first time, and then about 15s in the second and the third query.

first
8314bc18-f5a5-449a-80ee-148d0dc118cb.zip (34.3 KB)
second
1cbd8ec2-ebde-48f8-87b8-ec4968aa2d7b.zip (34.4 KB)
third
3d6bd372-d0af-4f48-a736-07dbedb0314b.zip (33.6 KB)

I have a separate question though. These iceberg tables are partitioned by a “timestamp” column. If I were to filter by some date range on this “timestamp” column, shouldn’t this reduce the number of files scanned, and thus should reduce the overall query execution time? It does not seem to achieve that.

thanks!

balaji.ramaswamy · July 6, 2023, 5:47am

@kyleahn Partition pruning should happen if you filter on a partition column, is it in the same profile you think pruning is not happening?

kyleahn · July 6, 2023, 5:51am

I wasn’t seeing much performance improvement when WHERE clause was added to filter on a partitioned column.

Technically, if I were to filter on the only the latest hour which only contains less than 20 files, shouldn’t it take under 10 seconds?

by the way thank you so much for helping out!

balaji.ramaswamy · July 11, 2023, 5:07am

@kyleahn Maybe the time is spent elsewhere, are you able to send the profile where you used the partition column and read only 20 files?

kyleahn · July 11, 2023, 7:38am

Sure! The column name is “timestamp”. Does this read all 20 files or not?

d8fc1121-6e80-45a8-a4ab-d50903705143.zip (18.6 KB)

kyleahn · July 17, 2023, 8:09pm

@balaji.ramaswamy bump! if you haven’t seen my reply!

balaji.ramaswamy · July 22, 2023, 7:00am

@kyleahn Reads 3 datafiles, the column “timestamp” is a partition column and is getting partitioned pruned. Click on the planning tab and see last line in “Final Physical Transformation”, EPOCH time 1689019200000000 is pushed which is Monday, July 10, 2023 8:00:00 PM UTC

Topic		Replies	Views
Iceberg query performance	6	1365	March 25, 2022
Dremio seems to be scanning all the partitoned data from s3	5	1607	December 14, 2022
Understanding how the parquet reader works and a suggestion	10	3464	May 23, 2018
Dremio query duration takes too much time	5	673	January 12, 2024
Iceberg file size on dremio Dremio Cloud	26	696	April 5, 2025

Iceberg query performance with many parquet files

Related topics