I have an Iceberg table (call it T) cataloged in AWS Glue and stored on S3.
The table has a simple schema (10 columns) and only a single partitioning column, call it k, using identity mapping.
Using a local Dremio engine (not a cluster) built from the master branch of the CE
When I run the following query
SELECT *
FROM T
WHERE k = some-value
It takes about 3 sec. Looking at the plan I noticed it used a hash partitioning operator.
Then, when I run
SELECT COUNT(*)
FROM T
WHERE k = some-value
It also takes about 3 secs.
Here I would expect it to take <1 sec, just using metadata, and specifically cached metadata, to compute the result.
Finally, when I run
SELECT *
FROM T
WHERE k = some-value
LIMIT 1
It takes about 16 sec. Looking at the plan I noticed it didn’t use a hash partitioning operator at all.
Here I would expect it to use hash partitioning (like in the original query) and take less than 3 sec b/c it only needs to return a single result record.
On a local (single node) deployment does Dremio “auto configure” the cloud cache? If so, where should I look for the data, presumably under the installation folder. If not, what the minimal needed configuration?
It seems the query planner is not using the Iceberg metadata effectively; given that K is the [only] partitioning key this query could be answered completely from just the latest snapshot metadata.
@shragj Out of the 3 profiles, 2 of them completed in 2.6s while one of them took 16s, the 16s one is a single threaded scan due to the limit 1
Caching only helps in reducing wait time but the bulk of the time was spent in processing. Also cloud cache is only supported for Parquet files while the table_function you see that took 15s is an AVRO scan, would you be able run the LIMIT 1 query and send us the profile and the server.log from the executor it ran?