Iceberg query performance

shragj · March 8, 2022, 7:31pm

Hi,

I have an Iceberg table (call it T) cataloged in AWS Glue and stored on S3.
The table has a simple schema (10 columns) and only a single partitioning column, call it k, using identity mapping.

Using a local Dremio engine (not a cluster) built from the master branch of the CE

When I run the following query

SELECT *
FROM T
WHERE k = some-value

It takes about 3 sec. Looking at the plan I noticed it used a hash partitioning operator.

Then, when I run

SELECT COUNT(*)
FROM T
WHERE k = some-value

It also takes about 3 secs.

Here I would expect it to take <1 sec, just using metadata, and specifically cached metadata, to compute the result.

Finally, when I run

SELECT *
FROM T
WHERE k = some-value
LIMIT 1

It takes about 16 sec. Looking at the plan I noticed it didn’t use a hash partitioning operator at all.
Here I would expect it to use hash partitioning (like in the original query) and take less than 3 sec b/c it only needs to return a single result record.

Please advise

balaji.ramaswamy · March 9, 2022, 4:37am

@shragj Are you able to share the 3 job profiles?

shragj · March 9, 2022, 5:40am

I uploaded the 3 profiles

fdc4a669-0cdc-4a4c-9e8c-8113dddd529b.zip (22.5 KB)
af1b131b-0e24-4873-8911-5c97c7ac53d3.zip (12.0 KB)
a5c0ed8f-dcc1-4a53-8ffa-67152faba3d3.zip (28.7 KB)

shragj · March 10, 2022, 9:02pm

P.S.

On a local (single node) deployment does Dremio “auto configure” the cloud cache? If so, where should I look for the data, presumably under the installation folder. If not, what the minimal needed configuration?

shragj · March 11, 2022, 1:20am

Another example: when running:

SELECT DISTINCT(K)
FROM T

I had to cancel the query after > 4 mins.

It seems the query planner is not using the Iceberg metadata effectively; given that K is the [only] partitioning key this query could be answered completely from just the latest snapshot metadata.

shragj · March 15, 2022, 1:47am

Hi,

Any input on why the Iceberg queries are behaving the way they are?

Also, I’ve looked more closely at the jobs and the ICEBERG scan operator does not seem to use any cached data.

I’m trying to understand if that’s b/c a local, single node setup does not use the Cloud Cache, or else.

balaji.ramaswamy · March 25, 2022, 3:56am

@shragj Out of the 3 profiles, 2 of them completed in 2.6s while one of them took 16s, the 16s one is a single threaded scan due to the limit 1

Caching only helps in reducing wait time but the bulk of the time was spent in processing. Also cloud cache is only supported for Parquet files while the table_function you see that took 15s is an AVRO scan, would you be able run the LIMIT 1 query and send us the profile and the server.log from the executor it ran?

Topic		Replies	Views
Iceberg query performance with many parquet files Dremio University	12	1367	July 22, 2023
Iceberg LOCALSORT and PARTITION can't speed up Dremio Queries Dremio University	3	401	January 23, 2024
Metadata Retrieval at query time (AWS Glue)	7	1725	January 16, 2021
HBase vs Iceberg ? Does it make sense?	0	1552	March 8, 2023
Inconsistent Iceberg predicate pushdowns	9	855	March 14, 2024

Iceberg query performance

When I run the following query

Then, when I run

Finally, when I run

Related topics