Iceberg query performance with many parquet files

Hi Dremio community,

I have an Iceberg table with Glue catalog. This Iceberg table has about 60 records per file (hourly partitioned), and the number of files have grown to about 300 files over the last couple of weeks. To be very exact, each record was written as a single parquet file, but was later compacted.

The query is currently taking about 20s for what seems to be very simple SELECT statement

It seems weird to me that Dremio cannot keep up with just a few hundreds of parquet files.
I currently have 3 executors with 3 CPU cores and 16 GB memory each

Where should I look into first in order to significantly improve the query latency so queries at this scale is definitely under 1s?

@kyleahn Are you able to send us the job profile, so we can see where in execution the time was spent>

How To Share A Query Profile | Dremio.

here! Thanks a lot!
c933dd01-e4e7-48e6-8c10-c3da2f639cba.zip (21.7 KB)

@kyleahn Here is the problem

There are only 6043 records and hence Dremio is single threaded. For every 100K rows estimated, there is one thread planned. But these 6043 files are spread across 291 files, see NUM_READERS in operator metrics of TABLE_FUNCTION 00-00-09 and all the 16 seconds are spent waiting (IO wait) reading the 291 files. I see NUM_CACHE_HITS is zero, see operator metrics

What happens if you run the query 3 times in a row? Does the third one run fast?

The third one does not necessarily run much faster. No matter how many queries are run, they all run in about 19s with += 3s.

Is there additional query optimization that I can implement such as raw data reflection? They didn’t seem to be helping much.

@kyleahn Subsequent runs should have used C3 cache, can you please run it about 3 times in a row and send the 3 profiles?

Hi @balaji.ramaswamy

Ah, this isn’t exactly the same number of records anymore as there has been more records added, but here are the 3 profiles of the same iceberg table

It took more than 3 minutes the first time, and then about 15s in the second and the third query.

first
8314bc18-f5a5-449a-80ee-148d0dc118cb.zip (34.3 KB)
second
1cbd8ec2-ebde-48f8-87b8-ec4968aa2d7b.zip (34.4 KB)
third
3d6bd372-d0af-4f48-a736-07dbedb0314b.zip (33.6 KB)

I have a separate question though. These iceberg tables are partitioned by a “timestamp” column. If I were to filter by some date range on this “timestamp” column, shouldn’t this reduce the number of files scanned, and thus should reduce the overall query execution time? It does not seem to achieve that.

thanks!

@kyleahn Partition pruning should happen if you filter on a partition column, is it in the same profile you think pruning is not happening?

I wasn’t seeing much performance improvement when WHERE clause was added to filter on a partitioned column.

Technically, if I were to filter on the only the latest hour which only contains less than 20 files, shouldn’t it take under 10 seconds?

by the way thank you so much for helping out!

@kyleahn Maybe the time is spent elsewhere, are you able to send the profile where you used the partition column and read only 20 files?

Sure! The column name is “timestamp”. Does this read all 20 files or not?

d8fc1121-6e80-45a8-a4ab-d50903705143.zip (18.6 KB)

@balaji.ramaswamy bump! if you haven’t seen my reply!

@kyleahn Reads 3 datafiles, the column “timestamp” is a partition column and is getting partitioned pruned. Click on the planning tab and see last line in “Final Physical Transformation”, EPOCH time 1689019200000000 is pushed which is Monday, July 10, 2023 8:00:00 PM UTC

1 Like