VACUUM CATALOG failing with various errors

Hi there,

We’re attempting to utilize VACUUM CATALOG with our Dremio/Nessie/MinIO lakehouse, and we cannot get it to reliably succeed.

We don’t have a very big data lakehouse - about 250GB with a little under 600k objects in our S3 bucket. VACUUM CATALOG runs anywhere from 5 minutes to up to 2.5 hours and then fails.

We’ve had the following errors so far:

  • ExecutionSetupException: One or more nodes lost connectivity during query. (after repeated Full GCs)
  • Timeout waiting for an S3 connection from the pool
  • Query was cancelled because it exceeded the memory limits set by the administrator.

We’ve tried:

  • Increasing heap size in increments up to 48GB
  • Increasing fs.s3a.connection.maximum to 10000
  • Increasing fs.s3a.threads.max to 100
  • Adjusting JVM GC tuning starting with Dremio recs, can provide details if needed
  • Adjusting other S3 connection parameters

The documentation isn’t much help because VACUUM CATALOG’s processing profile seems very different than queries, as it’s entirely on-heap. We also noticed that one executor is much more heavily taxed than the others since it’s responsible for both Phase 5 and part of Phase 3 on the operation.

We also don’t understand why there’s so much memory usage for this operation, and we’re not sure how this is going to scale as the lakehouse grows. This operation seems to gobble up as much heap memory as we throw at it, at a much higher proportion than the data and metadata set it’s operating on.

Any help you can provide would be appreciated, thanks!

Can you please share the profile and a heap dump of the executors.

I can share a profile to start. We’re still working on getting a representative heap dump generated.

89c16077-42ef-4520-87d3-7f793e7a177b.zip (360.2 KB)

EDIT: We were finally able to get a vacuum to complete overnight, so here’s a profile of a success for comparison.

095d251c-6f59-4ccb-8856-7de066428024.zip (636.7 KB)

Thanks!

I’ve following observations so far

  • Both profiles processed almost same amount of data.
  • The operator in VACUUM, which parsed over commit history and identified live snapshots completed it’s task successfully.
  • Lot of instances of slow IO recorded while reading metadata json files from S3. Definitely some IO issues exist wrt S3 interactions.
  • We’ve noticed in past, that pooled S3 client can build up heap pressure on executors on high number of concurrent requests. The heap dump would help in confirming this theory. Reducing the parallelism and increasing the timeout is generally a measure for such cases. Increasing parallelism might worsen the condition if it is heap pressure.
    • Did you try setting fs.s3a.connection.timeout , fs.s3a.connection.establish.timeout , fs.s3a.attempts.maximum to more liberal values?
    • Were there any other queries running in parallel, which dealt with same S3 account?
  • There’s a debug handle to reduce parallelism solely on the expiry action operator. Set vacuum.expiry_action.parallelism in the S3 source > Source Settings > Advanced Options > Connection Properties to a lower value (eg 4). This defaults to 12. A tradeoff is that expiry might take longer to complete.

We can analyse the heap dump once it is available.

Thanks for the detailed response.

The heap dump is large at ~250MB, and it looks like the file upload limit here is 8 MB. Is there an alternate way to deliver it?

dremio-exec-2.zip (1.9 MB)

I’ll include the profile and our Grafana metrics as well. This run happened to be successful but it is still failing most of the time with heap exhaustion issues.

I’m glad you caught the slow metadata IO times, we’re seeing the same thing but are unsure where those are originating other than potentially parallelism via the Dremio S3 IO client operators. Specifically we see every next metadata IO time take slightly longer than previous, until something catches up. We’re seeing this happen on every run no matter what our s3a pool connection settings are. If fs.s3a.connection.maximum is set too low we timeout on waiting for an available pool connection during a Full GC or during when this IO latency is happening.

MinIO seems to have no trouble at all serving up these files, and we’re not seeing any latency on that side at all.

Answers to questions in-line:

  • Did you try setting fs.s3a.connection.timeout , fs.s3a.connection.establish.timeout , fs.s3a.attempts.maximum to more liberal values? Not yet but willing to do so.
  • Were there any other queries running in parallel, which dealt with same S3 account? No, all other queries are off while we’re completing vacuum catalog testing.

We tried setting vacuum.expiry_action.parallelism to 4 but didn’t see a substantial difference. As you mentioned the expiry action isn’t taking much time and is always succeeding. It seems to be the orphan file scan operations that are causing the time and memory pressure.

Apologies for delay.

From the heap dump, I wanted to take a look at the dominators. Is it possible to get a raw jmap, that we can inspect?

MinIO seems to have no trouble at all serving up these files, and we’re not seeing any latency on that side at all.

Can we rule-out network related aspects as well?

We’re running JFR via Cryostat on the executors, so I can send one of those over if that would be helpful. Can I send that via email rather than post it publicly?

From a network perspective, we’re running both Dremio execs and MinIO in the same kubernetes cluster, and the nodes are right next to each other network-wise. We’re still looking into this but nothing conclusive yet, we have not been able to get the metadata IO times to come down.

We’ve tried:

  • Scaling MinIO resources and pool members up
  • Scaling Dremio execs up in heap
  • Going highly parallel or highly serial with planner.slice_target - This primarily affects overall runtime by parallelizing Phase 3, which contributes the most to overall vacuum run time. But no effect on Phase 5 snapshot retrieval. The executor that runs Phase 5 and part of Phase 3 is slammed with heap pressure in highly parallel configs via the combined effects, and the memory monitor will kill the query or the exec will drop from ZK.

Reviewing the JFR we’ve noticed that all of the heap pressure comes from the Nessie Commit Reader phase (Phase 5 in the above profile) at the point where we’re retrieving the snapshot metadata files with 2k+ fragments. If we get past this phase, the vacuum will usually complete successfully. Often what will happen is, at some point, a full GC on the exec running this phase cascades into failure. It seems there’s a feedback loop between GC → IO latency → GC, etc until the executor drops the zookeeper connection.

Some more data from investigating further today.

We were able to reduce IO times by ~25-30% by re-architecting the MinIO nodes to protect network bandwidth and compute resources, so that’s a step in the right direction.

We’re still investigating the reason for this heap pressure on the early phases, and what we’re seeing is that it’s coming from heavy allocations by hadoop.fs.Path and its string manipulation via the Iceberg TableMetadataParser. TableMetadataParser is called from the NessieCommitRecordReader and IcebergLocationTableFinder classes in early phases.

The high fragmentation of this phase leads to extremely heavy heap allocations overall, leading to rapid GCs and eventually multiple full GC pauses even with GC tuned to mark extremely low at 15% and a 30 GB heap.

As I mention above, if we can get past this phase without the executor dropping out, then the rest of the vacuum proceeds normally as TableMetadataParser is not used for the Manifest Scan.

A couple of questions from here:

  1. How does Dremio run VACUUM CATALOG in your cloud offering? Wouldn’t you run into this same issue? We don’t really understand how it could function with out-of-the-box heap sizes, such as those specified in the Helm charts. We also don’t understand how this could scale to larger data lake sizes with terabytes of data and millions of files.

  2. If the root issue is potentially in an upstream dependency, are there any configuration options we can use to work around this? Maybe by constraining fragmentation in these phases? The fragments expect the S3 connection pool to keep up so we aren’t seeing any options yet for reducing parallelism to keep memory allocation in check.

Any updates?

We’re now at a 90 GB heap and still running out of memory, so I’m not sure we can outrun this scan with more resources. We may be forced to turn vacuuming off until there’s a way to reduce the allocation rate.

@batemanst Let me check internally and get back to you

Thanks
Bali

@batemanst Is it possible to attach the JFR?

I can send it over but would prefer to email it rather than post it publicly here. Is there a support email I can send it to?

Sent an email to you