RFC: Elastic executor auto-scaling for Kubernetes deployments
What this does
Elastic executor auto-scaling lets a Dremio coordinator dynamically increase Kubernetes executor pods when a query needs more compute. Scale-down is left to external tooling (KEDA, operator), so Dremio never interrupts running queries by scaling down under its feet.
I’ve implemented and tested this on a 3-node k3s cluster (1 control-plane + 2 workers). CLA signed. Branch ready. 10 unit tests green for the elastic module. Live end-to-end test verified.
The problem
In a Kubernetes deployment today, you must pre-provision executors statically. A cluster serving a mix of dashboard queries (cheap, low concurrency) and heavy analytical workloads (expensive, need several executors) must either:
- Over-provision permanently — paying for idle executors most of the time, or
- Under-provision — and let heavy queries queue or degrade
There is no mechanism in OSS Dremio to say “spin up N executors when a large query arrives.”
Relationship to Dremio Cloud Engines
Dremio Cloud offers a rich engine model: named compute pools (2XSmall–3XLarge), per-engine replica autoscaling managed by the control plane, and SQL-based routing rules evaluated before query planning (by user, group, job type, time-of-day, query label, and query attributes). This PR is a lightweight analog for self-managed OSS Kubernetes deployments—not a competing design, but a practical subset of the same idea.
| Dimension | Dremio Cloud | This PR |
|---|---|---|
| Tier selection | SQL routing rules, pre-planning | Plan cost threshold, post-planning |
| Scaling unit | Engine replica (multi-node group) | Individual executor pod |
| Scale-up | Async; query queues with Enqueued Time Limit | Synchronous; query blocks in waitForExecutors() |
| Scale-down | Control plane (Last Replica Auto-Stop, Drain Time Limit) | KEDA + metrics-exporter |
| Admin surface | Full UI + REST API | dremio.conf only |
| Multi-engine | Unlimited named engines per project | Two StatefulSets (small/large) |
| Concurrency cap | Per-replica, configurable | Pod-level resource limits only |
The ResourcePlatform interface maps naturally to Cloud’s engine concept: each implementation corresponds to one compute tier. The current PR ships two tiers driven by plan cost. A natural follow-up would add a RoutingPolicy SPI for rule-based routing (user, group, query type) that more closely mirrors Cloud’s WLM—making OSS a closer peer to the Cloud engine model over time.
Scaling hierarchy
Elastic executor scaling on Kubernetes operates in three layers with a clear separation of concerns:
Layer 1 — Dremio Java (imperative scale-up, query-triggered)
K8sPlatform.scaleExecutors(delta, tier) patches .spec.replicas on the
appropriate executor StatefulSet when a query arrives and insufficient
executors are available. The query blocks in waitForExecutors() until
executors register with ZK or the timeout expires.
Active when services.executor.elastic.enabled=true.
NEVER scales down — only adjusts replicas upward.
Layer 2 — KEDA + metrics-exporter (reactive scale-down, metric-driven)
A metrics-exporter sidecar polls Dremio's REST API for active jobs and
computes executor_desired_small/large. KEDA reads these integers and
scales StatefulSets down to 0 after a grace period (default 120s).
This is the ONLY mechanism for scale-down — Java never reduces replicas.
Can operate independently (Layer 1 disabled) or alongside it.
Layer 3 — Cluster Autoscaler (node-level, transparent)
When K8s cannot schedule new executor pods on existing nodes (resource
pressure), the Cluster Autoscaler provisions a new node automatically.
Operates transparently below Layers 1 and 2.
Why Java can’t delegate scale-up to KEDA
The allocate() contract is synchronous — it must not return until resources are assigned or denied. If Java set a metric and waited for KEDA to scale up, the query would add 25+ seconds of polling latency before any progress. KEDA has no mechanism to unblock a waiting thread. The direct K8sPlatform.scaleExecutors() call starts pod creation immediately.
The upstream PR contributes Layer 1. Layers 2 and 3 are reference operational configuration (metrics-exporter, KEDA ScaledObjects, Cluster Autoscaler setup) documented in docs/elastic-scaling-deployment.md.
How it works
Six new classes in services/resourcescheduler, wired into DACDaemonModule via conditional registry.bind().
1. ElasticAdmissionCalculator — pure logic, no I/O, easily testable
// Query cost -> executor count decision (thresholds are config-driven)
public int calculateRequiredExecutors(double planCost) {
if (planCost <= smallQueryThreshold) return 1; // <= 10M cost -> 1 executor
if (planCost <= mediumQueryThreshold) return 2; // <= 30M cost -> 2 executors
return 3; // > 30M cost -> 3 executors
}
Both thresholds are configurable via services.executor.elastic.small_query_threshold and medium_query_threshold.
2. ResourcePlatform — a minimal platform-agnostic interface (4 methods)
public interface ResourcePlatform {
int getReadyPodCount();
int getAvailableExecutors();
boolean waitForExecutors(int required, long timeout, TimeUnit unit) throws InterruptedException;
boolean scaleExecutors(int delta); // +N = scale up, -N = scale down
// tier-aware overload — default delegates to scaleExecutors(delta) for backward compat
default boolean scaleExecutors(int delta, ExecutorTier tier) { return scaleExecutors(delta); }
}
Today’s K8sPlatform implementation calls fabric8 to adjust .spec.replicas on a pre-existing StatefulSet. It does not create StatefulSets or ConfigMaps — the operator provisions those once, the Java code only touches the replica count.
3. ElasticResourceAllocator — extends BasicResourceAllocator, same allocate() contract
Query arrives
-> getQueryCost() from PhysicalPlan (null-safe: defaults to 0.0)
-> calculateRequiredExecutors(cost)
-> scaleDelta = max(0, required - available)
-> if scaleDelta > 0:
scaleExecutors(scaleDelta) // adjust StatefulSet replicas
waitForExecutors(required, timeout) // poll every 2s until ready or timeout
on timeout: log warning, proceed anyway (graceful degradation)
-> delegate to super.allocate() // BasicResourceAllocator
Key behaviors:
- Scale-up only from the Java side —
calculateScaleDeltais clamped tomax(0, ...)so it never returns a negative delta - Graceful timeout — if executors don’t become ready within the configurable timeout (default 5 min), the query still runs with whatever executors are available, rather than failing
- Scale-down is not done imperatively — left to KEDA (Layer 2) or an operator
4. ResourcePlatformProvider — lazy, cached, Closeable
@Singleton
public class ResourcePlatformProvider implements Provider<ResourcePlatform>, Closeable {
// Double-checked locking (volatile + synchronized)
// Returns NoOpResourcePlatform.INSTANCE when elastic.enabled=false
// Creates K8sPlatform on first access when elastic.enabled=true
// Closes KubernetesClient on coordinator shutdown via close()
}
5. K8sPlatform — StatefulSet-based, Closeable, tier-aware
public class K8sPlatform implements ResourcePlatform, Closeable {
// Constructor: (KubernetesClient, namespace, statefulSetNameSmall, statefulSetNameLarge,
// executorSet, maxExecutorsSmall, maxExecutorsLarge)
// scaleExecutors(delta): scales the SMALL StatefulSet, caps at maxExecutorsSmall
// scaleExecutors(delta, tier): routes to SMALL or LARGE StatefulSet, caps at tier-specific max
// getReadyPodCount(): lists pods via fabric8, filters by Running + containers ready
// getAvailableExecutors(): delegates to ClusterCoordinator's executor ListenableSet
// waitForExecutors(): polls both pod readiness and Dremio registration every 2s
// close(): shuts down KubernetesClient
}
Two StatefulSet names are configurable via pod_template (small) and pod_template_large (large). When pod_template_large is empty, it defaults to pod_template + "-large". Each tier has its own max_executors cap — the operator sets these to match node capacity.
Configuration — all off by default
services.executor.elastic {
enabled: false
min_executors: 0
max_executors: 10 # cap for the small tier
max_executors_large: 10 # cap for the large tier
scale_timeout_minutes: 5
kubernetes {
namespace: "dremio"
pod_template: "dremio-executor"
pod_template_large: "" # defaults to pod_template + "-large"
}
small_query_threshold: 10000000
medium_query_threshold: 30000000
}
Note: image, zookeeper_address, and resource requests/limits are not in the Java config — the StatefulSet manifest (created by the operator) already bakes these into the pod spec and ConfigMap. Java only patches .spec.replicas.
Resource allocation
Resources are defined entirely in the StatefulSet YAML — Dremio Java only adjusts .spec.replicas; it has no knowledge of CPU, RAM, or storage values. The pod_template / pod_template_large config keys tell Java which StatefulSet name corresponds to which tier. Operators size each StatefulSet for their node type independently of the Java config.
Example sizing for t3a.xlarge (small) and i4i.2xlarge (large, NVMe-backed):
| Key | small (t3a.xlarge) | large (i4i.2xlarge) |
|---|---|---|
| CPU request / limit | 1500m / 3500m | 6 / 7500m |
| RAM request / limit | 5 Gi / 6 Gi | 48 Gi / 56 Gi |
| JVM heap (auto ~60%) | ~3.6 GB | ~34 GB |
| Ephemeral storage req / lim | 10 Gi / 20 Gi | 200 Gi / 1800 Gi |
max_executors / _large |
2 (one node, 2 pods) | 8 (CA adds nodes) |
On a single-node k3s dev cluster (24 CPU / 118 GB), max_executors: 2 (small) and max_executors_large: 8 (large) are the binding constraints — not node capacity.
Scale-down: how KEDA works with the metrics-exporter
Java handles scale-up (Layer 1) imperatively. Scale-down is handled entirely by the metrics-exporter and KEDA (Layer 2). Java never reduces StatefulSet replicas.
metrics-exporter contract
The metrics-exporter is a lightweight Flask sidecar that:
- Polls
/apiv2/jobsevery 15s for active Dremio jobs, classifying them as user (human + platform service accounts) or system ($dremio$,ACCELERATION,dremio.ops— internal reflection and monitoring queries) - Does not split by tier —
/apiv2/jobshas no plan cost field, so both tiers react identically toactive_user_jobs - Reads current StatefulSet replica counts via the Kubernetes API to track when executors come online
- Exposes
executor_desired_smallandexecutor_desired_largeas integers atGET /json - Does not interact with the coordinator liveness endpoint — Java never emits elastic-specific Micrometer gauges; the exporter relies entirely on job counts and StatefulSet state
Scale-down state machine (per tier)
HOLD ── active_user_jobs > 0 (or active_reflection_jobs > 0 for small)
│ → hold at current replicas; reset idle timer
│
│ (jobs finish)
▼
DRAIN ── no active jobs; idle < SCALE_DOWN_GRACE_SECS (default 120s)
│ → hold current replicas so in-flight fragments can finish
│
│ (idle >= grace)
▼
ZERO ── executor_desired = 0
→ KEDA deactivates ScaledObject; StatefulSet replicas → 0
Grace period anchor: The idle timer resets not just when jobs become active, but also when a StatefulSet transitions from 0 → N replicas. This ensures freshly-scaled executors (brought up by Java’s scaleExecutors()) always receive the full 120s grace window, even if the metrics-exporter process restarted while the StatefulSet was at 0 replicas.
The 120s in-app grace period ensures executor_desired stays ≥ 1 while fragments drain. Combined with KEDA’s 300s stabilization window (active mode), total drain time is ~420s. Once executor_desired stays at 0 for KEDA’s 300s cooldown period, KEDA issues KEDAScaleTargetDeactivated and the StatefulSet scales to 0 immediately.
Concrete scale-down timeline
T+0s Query finishes. metrics-exporter sees active_user_jobs: 0
T+0s _compute_desired_large: user_jobs=0, but grace period started
when StatefulSet went 0→N → "large tier idle for 0s/120s, holding at 2"
T+120s Grace period expires → executor_desired_large = 0
T+120s KEDA sees metric=0; enters inactive mode after cooldownPeriod
T+420s cooldownPeriod (300s) expires → KEDAScaleTargetDeactivated
T+420s StatefulSet replicas: 2 → 0
T+420s Pods enter terminationGracePeriodSeconds (150s), preStop sleep (120s)
T+540s In-flight fragments drain; pods terminate cleanly
Concrete example (tested on 3-node k3s)
- Two executor StatefulSets exist at
replicas: 0(dremio-executor-small,dremio-executor-large) - Query with estimated cost
19,442,773arrives (cost > 10M → LARGE tier) getTier(19.4M)→ LARGE;calculateRequiredExecutors(19.4M)→ 2getAvailableExecutors()→ 0;calculateScaleDelta(2, 0)→ 2scaleExecutors(+2, LARGE)→dremio-executor-largeStatefulSet scaled 0 → 2 replicas- metrics-exporter anchors the 120s grace period to when the StatefulSet came online
- Pods become ready (~60s); executors register with ZooKeeper;
waitForExecutors()returnstrue - Query executes on 2 large executors; metrics-exporter holds
executor_desired_large = 2 - Query finishes → 120s grace →
executor_desired_large = 0→ KEDA 300s cooldown → StatefulSet 2 → 0 - Pods drain in-flight fragments during
preStopsleep (120s), then terminate cleanly
Verified in production: metrics-exporter logs showed the complete cycle: scale-up via Java → hold during query → 120s grace period drain → scale to 0 via KEDA.
What is NOT in the PR
- No cloud-provider implementations (AWS, GCP, Azure) — the
ResourcePlatforminterface is extensible but the PR only shipsK8sPlatformandNoOpResourcePlatform - No scale-down logic in Java — scale-down is handled entirely by the metrics-exporter and KEDA ScaledObjects, which are reference operational configuration outside the codebase; the Java side only scales up
- No changes to the query planner, queue system, or WLM
- No pod resource management — all resource requests/limits are defined by the operator in the StatefulSet manifest; Java only patches
.spec.replicas.
Known gaps
- No unit test for K8sPlatform: The live-cluster integration test was removed because it requires a running K8s API server. A mock-based test (using fabric8’s mock server or a KubernetesClient interface) would be a good follow-up.
- Multi-node shared storage: Solved by using StatefulSets with a shared ReadWriteMany PVC for
paths.dist. The PVC is backed by Longhorn (or NFS/EFS in production), ensuring all executor pods across all nodes read and write the same distributed storage. For production,paths.distcan also be set to an object store URI (s3://,gs://,abfss://).
Files changed
services/resourcescheduler/src/main/java/com/dremio/resource/elastic/
ElasticAdmissionCalculator.java (new — includes ExecutorTier enum and getTier())
ElasticResourceAllocator.java (new — uses getTier() to route scale calls)
ResourcePlatform.java (new — 4 methods + default scaleExecutors(delta, tier))
K8sPlatform.java (new — tier-aware: StatefulSet-based, per-tier max caps)
NoOpResourcePlatform.java (new, singleton)
ResourcePlatformProvider.java (new — reads pod_template_large, max_executors_large)
services/resourcescheduler/src/test/java/com/dremio/resource/elastic/
ElasticAdmissionCalculatorTest.java (new, 10 tests)
common/legacy/src/main/java/com/dremio/config/DremioConfig.java (new K8s-only constants)
common/legacy/src/main/resources/dremio-reference.conf (new elastic config block, disabled by default)
dac/backend/src/main/java/com/dremio/dac/daemon/DACDaemonModule.java
(binds ResourcePlatformProvider; conditionally binds ElasticResourceAllocator or BasicResourceAllocator)
pom.xml (fabric8 6.0.0 in BOM)
services/resourcescheduler/pom.xml (version-less fabric8 + mockito deps)
docs/elastic-scaling-deployment.md (new, step-by-step K8s deployment guide)
Questions for the Dremio team
- OSS appetite — Is there interest in elastic scaling in OSS, or is this planned as an Enterprise-only feature?
- SPI depth — The
ResourcePlatforminterface is intentionally minimal (4 methods). Would you prefer a richer SPI (async, health-check hooks, multi-tier) before merging? - K8s client — Is
fabric8 6.0.0an acceptable dependency, or is there a preferred K8s client already in the tree? - Extension point — Any concerns about conditionally binding
ElasticResourceAllocatorvsBasicResourceAllocatorinDACDaemonModule? - Cloud alignment — The
ResourcePlatforminterface maps to Cloud’s engine concept (one implementation per compute tier). Does the team see value in a follow-upRoutingPolicySPI for rule-based routing (user, group, query type) that would bring OSS closer to Cloud’s WLM model?
Verification and findings
The elastic scaling implementation was verified on a 3-node k3s cluster. The key findings:
-
RBAC is critical: After migrating from Deployments to StatefulSets, the
dremio-elastic-rolerequired explicitstatefulsetsandstatefulsets/scalepermissions. The RBAC was not re-applied after the StatefulSet migration, causing silent failures (queries attempted scale-up but K8s API rejected the PATCH calls). -
StatefulSet scaling: Once RBAC was corrected, the coordinator successfully scaled the
dremio-executor-smallStatefulSet from 0 → 1 when a query arrived. KEDA correctly scaled it down after the 120s grace period (verified byregistered_executors: 1dropping to 0). -
Two-tier routing: Queries with
planCost <= 10Mroute todremio-executor-small, queries withplanCost > 10Mroute todremio-executor-large.
Image registry and repository structure
The Dremio OSS images are hosted on GitHub Container Registry (GHCR):
| Image | URI | Source |
|---|---|---|
| Dremio OSS (coordinator + executor) | ghcr.io/faenx/dremio-oss:2026.05.0 |
Single binary with role determined by ConfigMap |
| KEDA Metrics Exporter | ghcr.io/faenx/dremio-keda-exporter:2026.05.0 |
github.com/FAenX/dremio-keda-exporter |
The metrics-exporter source is maintained in a separate public repository to enable independent development and versioning. The KEDA scaler and the Java elastic scaling implementation (this PR’s Layer 1) are designed to work together but are versioned separately.
Happy to rebase against current master, add more tests, or restructure the SPI — looking forward to the discussion.
PR: (will add link after discussion settles)