RFC: Elastic executor auto-scaling for Kubernetes deployments

RFC: Elastic executor auto-scaling for Kubernetes deployments

What this does

Elastic executor auto-scaling lets a Dremio coordinator dynamically increase Kubernetes executor pods when a query needs more compute. Scale-down is left to external tooling (KEDA, operator), so Dremio never interrupts running queries by scaling down under its feet.

I’ve implemented and tested this on a 3-node k3s cluster (1 control-plane + 2 workers). CLA signed. Branch ready. 10 unit tests green for the elastic module. Live end-to-end test verified.


The problem

In a Kubernetes deployment today, you must pre-provision executors statically. A cluster serving a mix of dashboard queries (cheap, low concurrency) and heavy analytical workloads (expensive, need several executors) must either:

  • Over-provision permanently — paying for idle executors most of the time, or
  • Under-provision — and let heavy queries queue or degrade

There is no mechanism in OSS Dremio to say “spin up N executors when a large query arrives.”


Relationship to Dremio Cloud Engines

Dremio Cloud offers a rich engine model: named compute pools (2XSmall–3XLarge), per-engine replica autoscaling managed by the control plane, and SQL-based routing rules evaluated before query planning (by user, group, job type, time-of-day, query label, and query attributes). This PR is a lightweight analog for self-managed OSS Kubernetes deployments—not a competing design, but a practical subset of the same idea.

Dimension Dremio Cloud This PR
Tier selection SQL routing rules, pre-planning Plan cost threshold, post-planning
Scaling unit Engine replica (multi-node group) Individual executor pod
Scale-up Async; query queues with Enqueued Time Limit Synchronous; query blocks in waitForExecutors()
Scale-down Control plane (Last Replica Auto-Stop, Drain Time Limit) KEDA + metrics-exporter
Admin surface Full UI + REST API dremio.conf only
Multi-engine Unlimited named engines per project Two StatefulSets (small/large)
Concurrency cap Per-replica, configurable Pod-level resource limits only

The ResourcePlatform interface maps naturally to Cloud’s engine concept: each implementation corresponds to one compute tier. The current PR ships two tiers driven by plan cost. A natural follow-up would add a RoutingPolicy SPI for rule-based routing (user, group, query type) that more closely mirrors Cloud’s WLM—making OSS a closer peer to the Cloud engine model over time.


Scaling hierarchy

Elastic executor scaling on Kubernetes operates in three layers with a clear separation of concerns:

Layer 1 — Dremio Java (imperative scale-up, query-triggered)
  K8sPlatform.scaleExecutors(delta, tier) patches .spec.replicas on the
  appropriate executor StatefulSet when a query arrives and insufficient
  executors are available. The query blocks in waitForExecutors() until
  executors register with ZK or the timeout expires.
  Active when services.executor.elastic.enabled=true.
  NEVER scales down — only adjusts replicas upward.

Layer 2 — KEDA + metrics-exporter (reactive scale-down, metric-driven)
  A metrics-exporter sidecar polls Dremio's REST API for active jobs and
  computes executor_desired_small/large. KEDA reads these integers and
  scales StatefulSets down to 0 after a grace period (default 120s).
  This is the ONLY mechanism for scale-down — Java never reduces replicas.
  Can operate independently (Layer 1 disabled) or alongside it.

Layer 3 — Cluster Autoscaler (node-level, transparent)
  When K8s cannot schedule new executor pods on existing nodes (resource
  pressure), the Cluster Autoscaler provisions a new node automatically.
  Operates transparently below Layers 1 and 2.

Why Java can’t delegate scale-up to KEDA

The allocate() contract is synchronous — it must not return until resources are assigned or denied. If Java set a metric and waited for KEDA to scale up, the query would add 25+ seconds of polling latency before any progress. KEDA has no mechanism to unblock a waiting thread. The direct K8sPlatform.scaleExecutors() call starts pod creation immediately.

The upstream PR contributes Layer 1. Layers 2 and 3 are reference operational configuration (metrics-exporter, KEDA ScaledObjects, Cluster Autoscaler setup) documented in docs/elastic-scaling-deployment.md.


How it works

Six new classes in services/resourcescheduler, wired into DACDaemonModule via conditional registry.bind().

1. ElasticAdmissionCalculator — pure logic, no I/O, easily testable

// Query cost -> executor count decision (thresholds are config-driven)
public int calculateRequiredExecutors(double planCost) {
    if (planCost <= smallQueryThreshold)  return 1;   // <= 10M cost  -> 1 executor
    if (planCost <= mediumQueryThreshold) return 2;   // <= 30M cost  -> 2 executors
    return 3;                                          // > 30M cost   -> 3 executors
}

Both thresholds are configurable via services.executor.elastic.small_query_threshold and medium_query_threshold.

2. ResourcePlatform — a minimal platform-agnostic interface (4 methods)

public interface ResourcePlatform {
    int getReadyPodCount();
    int getAvailableExecutors();
    boolean waitForExecutors(int required, long timeout, TimeUnit unit) throws InterruptedException;
    boolean scaleExecutors(int delta);  // +N = scale up, -N = scale down
    // tier-aware overload — default delegates to scaleExecutors(delta) for backward compat
    default boolean scaleExecutors(int delta, ExecutorTier tier) { return scaleExecutors(delta); }
}

Today’s K8sPlatform implementation calls fabric8 to adjust .spec.replicas on a pre-existing StatefulSet. It does not create StatefulSets or ConfigMaps — the operator provisions those once, the Java code only touches the replica count.

3. ElasticResourceAllocator — extends BasicResourceAllocator, same allocate() contract

Query arrives
  -> getQueryCost() from PhysicalPlan (null-safe: defaults to 0.0)
  -> calculateRequiredExecutors(cost)
  -> scaleDelta = max(0, required - available)
  -> if scaleDelta > 0:
        scaleExecutors(scaleDelta)          // adjust StatefulSet replicas
        waitForExecutors(required, timeout) // poll every 2s until ready or timeout
        on timeout: log warning, proceed anyway (graceful degradation)
  -> delegate to super.allocate()            // BasicResourceAllocator

Key behaviors:

  • Scale-up only from the Java side — calculateScaleDelta is clamped to max(0, ...) so it never returns a negative delta
  • Graceful timeout — if executors don’t become ready within the configurable timeout (default 5 min), the query still runs with whatever executors are available, rather than failing
  • Scale-down is not done imperatively — left to KEDA (Layer 2) or an operator

4. ResourcePlatformProvider — lazy, cached, Closeable

@Singleton
public class ResourcePlatformProvider implements Provider<ResourcePlatform>, Closeable {
    // Double-checked locking (volatile + synchronized)
    // Returns NoOpResourcePlatform.INSTANCE when elastic.enabled=false
    // Creates K8sPlatform on first access when elastic.enabled=true
    // Closes KubernetesClient on coordinator shutdown via close()
}

5. K8sPlatform — StatefulSet-based, Closeable, tier-aware

public class K8sPlatform implements ResourcePlatform, Closeable {
    // Constructor: (KubernetesClient, namespace, statefulSetNameSmall, statefulSetNameLarge,
    //              executorSet, maxExecutorsSmall, maxExecutorsLarge)
    // scaleExecutors(delta): scales the SMALL StatefulSet, caps at maxExecutorsSmall
    // scaleExecutors(delta, tier): routes to SMALL or LARGE StatefulSet, caps at tier-specific max
    // getReadyPodCount(): lists pods via fabric8, filters by Running + containers ready
    // getAvailableExecutors(): delegates to ClusterCoordinator's executor ListenableSet
    // waitForExecutors(): polls both pod readiness and Dremio registration every 2s
    // close(): shuts down KubernetesClient
}

Two StatefulSet names are configurable via pod_template (small) and pod_template_large (large). When pod_template_large is empty, it defaults to pod_template + "-large". Each tier has its own max_executors cap — the operator sets these to match node capacity.

Configuration — all off by default

services.executor.elastic {
  enabled: false
  min_executors: 0
  max_executors: 10          # cap for the small tier
  max_executors_large: 10    # cap for the large tier
  scale_timeout_minutes: 5
  kubernetes {
    namespace: "dremio"
    pod_template: "dremio-executor"
    pod_template_large: ""    # defaults to pod_template + "-large"
  }
  small_query_threshold: 10000000
  medium_query_threshold: 30000000
}

Note: image, zookeeper_address, and resource requests/limits are not in the Java config — the StatefulSet manifest (created by the operator) already bakes these into the pod spec and ConfigMap. Java only patches .spec.replicas.


Resource allocation

Resources are defined entirely in the StatefulSet YAML — Dremio Java only adjusts .spec.replicas; it has no knowledge of CPU, RAM, or storage values. The pod_template / pod_template_large config keys tell Java which StatefulSet name corresponds to which tier. Operators size each StatefulSet for their node type independently of the Java config.

Example sizing for t3a.xlarge (small) and i4i.2xlarge (large, NVMe-backed):

Key small (t3a.xlarge) large (i4i.2xlarge)
CPU request / limit 1500m / 3500m 6 / 7500m
RAM request / limit 5 Gi / 6 Gi 48 Gi / 56 Gi
JVM heap (auto ~60%) ~3.6 GB ~34 GB
Ephemeral storage req / lim 10 Gi / 20 Gi 200 Gi / 1800 Gi
max_executors / _large 2 (one node, 2 pods) 8 (CA adds nodes)

On a single-node k3s dev cluster (24 CPU / 118 GB), max_executors: 2 (small) and max_executors_large: 8 (large) are the binding constraints — not node capacity.


Scale-down: how KEDA works with the metrics-exporter

Java handles scale-up (Layer 1) imperatively. Scale-down is handled entirely by the metrics-exporter and KEDA (Layer 2). Java never reduces StatefulSet replicas.

metrics-exporter contract

The metrics-exporter is a lightweight Flask sidecar that:

  1. Polls /apiv2/jobs every 15s for active Dremio jobs, classifying them as user (human + platform service accounts) or system ($dremio$, ACCELERATION, dremio.ops — internal reflection and monitoring queries)
  2. Does not split by tier/apiv2/jobs has no plan cost field, so both tiers react identically to active_user_jobs
  3. Reads current StatefulSet replica counts via the Kubernetes API to track when executors come online
  4. Exposes executor_desired_small and executor_desired_large as integers at GET /json
  5. Does not interact with the coordinator liveness endpoint — Java never emits elastic-specific Micrometer gauges; the exporter relies entirely on job counts and StatefulSet state

Scale-down state machine (per tier)

 HOLD  ── active_user_jobs > 0 (or active_reflection_jobs > 0 for small)
  │       → hold at current replicas; reset idle timer
  │
  │  (jobs finish)
  ▼
 DRAIN ── no active jobs; idle < SCALE_DOWN_GRACE_SECS (default 120s)
  │       → hold current replicas so in-flight fragments can finish
  │
  │  (idle >= grace)
  ▼
 ZERO  ── executor_desired = 0
          → KEDA deactivates ScaledObject; StatefulSet replicas → 0

Grace period anchor: The idle timer resets not just when jobs become active, but also when a StatefulSet transitions from 0 → N replicas. This ensures freshly-scaled executors (brought up by Java’s scaleExecutors()) always receive the full 120s grace window, even if the metrics-exporter process restarted while the StatefulSet was at 0 replicas.

The 120s in-app grace period ensures executor_desired stays ≥ 1 while fragments drain. Combined with KEDA’s 300s stabilization window (active mode), total drain time is ~420s. Once executor_desired stays at 0 for KEDA’s 300s cooldown period, KEDA issues KEDAScaleTargetDeactivated and the StatefulSet scales to 0 immediately.

Concrete scale-down timeline

T+0s     Query finishes. metrics-exporter sees active_user_jobs: 0
T+0s     _compute_desired_large: user_jobs=0, but grace period started
         when StatefulSet went 0→N → "large tier idle for 0s/120s, holding at 2"
T+120s   Grace period expires → executor_desired_large = 0
T+120s   KEDA sees metric=0; enters inactive mode after cooldownPeriod
T+420s   cooldownPeriod (300s) expires → KEDAScaleTargetDeactivated
T+420s   StatefulSet replicas: 2 → 0
T+420s   Pods enter terminationGracePeriodSeconds (150s), preStop sleep (120s)
T+540s   In-flight fragments drain; pods terminate cleanly

Concrete example (tested on 3-node k3s)

  1. Two executor StatefulSets exist at replicas: 0 (dremio-executor-small, dremio-executor-large)
  2. Query with estimated cost 19,442,773 arrives (cost > 10M → LARGE tier)
  3. getTier(19.4M) → LARGE; calculateRequiredExecutors(19.4M) → 2
  4. getAvailableExecutors() → 0; calculateScaleDelta(2, 0) → 2
  5. scaleExecutors(+2, LARGE)dremio-executor-large StatefulSet scaled 0 → 2 replicas
  6. metrics-exporter anchors the 120s grace period to when the StatefulSet came online
  7. Pods become ready (~60s); executors register with ZooKeeper; waitForExecutors() returns true
  8. Query executes on 2 large executors; metrics-exporter holds executor_desired_large = 2
  9. Query finishes → 120s grace → executor_desired_large = 0 → KEDA 300s cooldown → StatefulSet 2 → 0
  10. Pods drain in-flight fragments during preStop sleep (120s), then terminate cleanly

Verified in production: metrics-exporter logs showed the complete cycle: scale-up via Java → hold during query → 120s grace period drain → scale to 0 via KEDA.


What is NOT in the PR

  • No cloud-provider implementations (AWS, GCP, Azure) — the ResourcePlatform interface is extensible but the PR only ships K8sPlatform and NoOpResourcePlatform
  • No scale-down logic in Java — scale-down is handled entirely by the metrics-exporter and KEDA ScaledObjects, which are reference operational configuration outside the codebase; the Java side only scales up
  • No changes to the query planner, queue system, or WLM
  • No pod resource management — all resource requests/limits are defined by the operator in the StatefulSet manifest; Java only patches .spec.replicas.

Known gaps

  • No unit test for K8sPlatform: The live-cluster integration test was removed because it requires a running K8s API server. A mock-based test (using fabric8’s mock server or a KubernetesClient interface) would be a good follow-up.
  • Multi-node shared storage: Solved by using StatefulSets with a shared ReadWriteMany PVC for paths.dist. The PVC is backed by Longhorn (or NFS/EFS in production), ensuring all executor pods across all nodes read and write the same distributed storage. For production, paths.dist can also be set to an object store URI (s3://, gs://, abfss://).

Files changed

services/resourcescheduler/src/main/java/com/dremio/resource/elastic/
  ElasticAdmissionCalculator.java      (new — includes ExecutorTier enum and getTier())
  ElasticResourceAllocator.java        (new — uses getTier() to route scale calls)
  ResourcePlatform.java                (new — 4 methods + default scaleExecutors(delta, tier))
  K8sPlatform.java                     (new — tier-aware: StatefulSet-based, per-tier max caps)
  NoOpResourcePlatform.java            (new, singleton)
  ResourcePlatformProvider.java        (new — reads pod_template_large, max_executors_large)

services/resourcescheduler/src/test/java/com/dremio/resource/elastic/
  ElasticAdmissionCalculatorTest.java (new, 10 tests)

common/legacy/src/main/java/com/dremio/config/DremioConfig.java   (new K8s-only constants)
common/legacy/src/main/resources/dremio-reference.conf             (new elastic config block, disabled by default)

dac/backend/src/main/java/com/dremio/dac/daemon/DACDaemonModule.java
  (binds ResourcePlatformProvider; conditionally binds ElasticResourceAllocator or BasicResourceAllocator)

pom.xml                               (fabric8 6.0.0 in BOM)
services/resourcescheduler/pom.xml    (version-less fabric8 + mockito deps)
docs/elastic-scaling-deployment.md    (new, step-by-step K8s deployment guide)


Questions for the Dremio team

  1. OSS appetite — Is there interest in elastic scaling in OSS, or is this planned as an Enterprise-only feature?
  2. SPI depth — The ResourcePlatform interface is intentionally minimal (4 methods). Would you prefer a richer SPI (async, health-check hooks, multi-tier) before merging?
  3. K8s client — Is fabric8 6.0.0 an acceptable dependency, or is there a preferred K8s client already in the tree?
  4. Extension point — Any concerns about conditionally binding ElasticResourceAllocator vs BasicResourceAllocator in DACDaemonModule?
  5. Cloud alignment — The ResourcePlatform interface maps to Cloud’s engine concept (one implementation per compute tier). Does the team see value in a follow-up RoutingPolicy SPI for rule-based routing (user, group, query type) that would bring OSS closer to Cloud’s WLM model?

Verification and findings

The elastic scaling implementation was verified on a 3-node k3s cluster. The key findings:

  • RBAC is critical: After migrating from Deployments to StatefulSets, the dremio-elastic-role required explicit statefulsets and statefulsets/scale permissions. The RBAC was not re-applied after the StatefulSet migration, causing silent failures (queries attempted scale-up but K8s API rejected the PATCH calls).

  • StatefulSet scaling: Once RBAC was corrected, the coordinator successfully scaled the dremio-executor-small StatefulSet from 0 → 1 when a query arrived. KEDA correctly scaled it down after the 120s grace period (verified by registered_executors: 1 dropping to 0).

  • Two-tier routing: Queries with planCost <= 10M route to dremio-executor-small, queries with planCost > 10M route to dremio-executor-large.


Image registry and repository structure

The Dremio OSS images are hosted on GitHub Container Registry (GHCR):

Image URI Source
Dremio OSS (coordinator + executor) ghcr.io/faenx/dremio-oss:2026.05.0 Single binary with role determined by ConfigMap
KEDA Metrics Exporter ghcr.io/faenx/dremio-keda-exporter:2026.05.0 github.com/FAenX/dremio-keda-exporter

The metrics-exporter source is maintained in a separate public repository to enable independent development and versioning. The KEDA scaler and the Java elastic scaling implementation (this PR’s Layer 1) are designed to work together but are versioned separately.


Happy to rebase against current master, add more tests, or restructure the SPI — looking forward to the discussion.

PR: (will add link after discussion settles)

RFC: Elastic executor auto-scaling for Kubernetes deployments

What this does

Elastic executor auto-scaling enables a Dremio coordinator to dynamically provision Kubernetes executor pods when queries require more compute. The coordinator signals desired replica counts via Kubernetes annotations; KEDA is the sole authority on StatefulSet spec.replicas. This eliminates the race condition where coordinator and KEDA fight over the replica count.

Implemented and verified in production (May 2026). Two tiers: SMALL (interactive) and LARGE (analytics/ETL).


The problem

In a Kubernetes deployment today, you must pre-provision executors statically. A cluster serving both lightweight dashboard queries and heavy analytical workloads must either:

  • Over-provision permanently — paying for idle executors, or
  • Under-provision — and let heavy queries queue or degrade

Dremio OSS had no mechanism to say “spin up N executors when a large query arrives.”


Relationship to Dremio Cloud Engines

Dremio Cloud offers engine-based compute pools with autoscaling managed by the control plane. This RFC provides an analogous capability for self-managed OSS Kubernetes deployments:

Dimension Dremio Cloud This RFC
Tier selection SQL routing rules, pre-planning routingQueue contains “large” → LARGE; else plan cost fallback
Scaling unit Engine replica (multi-node group) Individual executor pod
Scale-up Async; query queues Synchronous; query blocks in waitForExecutors()
Scale-down Control plane KEDA + metrics-exporter (30min grace window)
Admin surface Full UI + REST API dremio.conf only

Design: KEDA is the Sole Authority

Three-Layer Architecture

Layer 1 — Coordinator (annotates intent)
  ElasticResourceAllocator: determine tier → calculate scaleDelta
  K8sPlatform.scaleExecutors(delta, tier): writes annotations ONLY
    - dremio.io/scale-requested-at = <epoch-ms>
    - dremio.io/scale-requested-count = <newReplicas>
  NEVER writes spec.replicas — only signals via annotations.

Layer 2 — Exporter + KEDA (acts on intent)
  Exporter polls: annotations + ZK endpoints + job history
  Computes executor_desired_small/large from 3 signals (30min grace each)
  KEDA reads metrics → sets spec.replicas on each poll cycle (every 10s)

Layer 3 — Cluster Autoscaler (node-level)
  Transparent provisioning when K8s can't schedule pods

Why This Design

  1. Avoids race condition: Coordinator writes annotations; KEDA reads metrics and sets spec.replicas. No dual-write conflict.

  2. Synchronous contract: The allocate() contract is synchronous — it must not return until resources are assigned or denied. Writing annotations and polling ZK (via waitForExecutors()) starts pod creation immediately while honoring this contract.

  3. Clean failure handling: If executors don’t become available within timeout (5 min), throws ResourceUnavailableException. Query is cancelled cleanly, not degraded to wrong tier.


How it works

1. Tier Detection

Primary signal: routingQueue (from session/ALTER SESSION). If queue name contains “large” (case-insensitive), queries route to LARGE tier regardless of plan cost.

Fallback: Plan cost threshold (10M → SMALL, >30M → LARGE). Note: Dremio’s planner often reports planCost = 1.0, so routingQueue is the reliable signal.

public ExecutorTier getTier(double planCost, String routingQueue) {
    if (routingQueue != null && routingQueue.toLowerCase().contains("large")) {
        return ExecutorTier.LARGE;
    }
    return planCost <= smallQueryThreshold ? ExecutorTier.SMALL : ExecutorTier.LARGE;
}

2. Scale-Up Flow

Query arrives with routingQueue="query.large"
    │
    ▼
ElasticAdmissionCalculator.getTier(planCost, routingQueue)
    │
    ├─► routingQueue contains "large" → tier = LARGE
    └─► planCost = 1.0 (ignored)
            │
            ▼
    ElasticResourceAllocator.allocate()
            │
            ├─► requiredExecutors = 3
            ├─► getAvailableExecutors(LARGE) = 0 (from ZooKeeper nodeTag filter)
            └─► scaleDelta = 3 - 0 = 3
                    │
                    ▼
            K8sPlatform.scaleExecutors(3, LARGE)
                    │
                    ├─► Write: dremio.io/scale-requested-at = <now>
                    ├─► Write: dremio.io/scale-requested-count = 3
                    └─► Return (NO spec.replicas write!)
                    │
                    ▼
            waitForExecutors(3, LARGE, 5min)
                    │
                    └─► Poll ZooKeeper endpoints filtered by nodeTag=large
                       until count >= 3 or timeout
                    │
                    ▼
            Query executes on LARGE executors

3. Scale-Down Flow (30-minute grace window)

Three independent signals, each with 1800s grace:

Signal Source Behavior
Job history /apiv2/jobs endTime within 1800s → reset timer
Ready-pod window K8s StatefulSet readyReplicas > 0 within 1800s → reset timer
Annotation K8s StatefulSet scale-requested-at within 1800s → reset timer

When all signals expire: exporter returns desired = 0, KEDA sets spec.replicas = 0, HPA stabilization window (1800s) provides final buffer.


Component Details

ElasticAdmissionCalculator

  • calculateRequiredExecutors(planCost): 1 (≤10M), 2 (10M-30M), 3 (>30M)
  • getTier(planCost, routingQueue): routingQueue primary, cost fallback

ElasticResourceAllocator

  • allocate(): determines tier, checks ZK-available executors, writes annotations, waits for ZK registration, delegates to BasicResourceAllocator
  • On timeout: throws ResourceUnavailableException (query cancelled cleanly)

K8sPlatform

  • Never writes spec.replicas — only annotations
  • Never reads K8s readyReplicas — reads ZooKeeper endpoints filtered by nodeTag
  • Avoids K8s/ZK registration race where pod is K8s-ready but not yet ZK-registered

ResourcePlatform interface

public interface ResourcePlatform extends AutoCloseable {
    int getAvailableExecutors();
    int getAvailableExecutors(ExecutorTier tier);  // ZK nodeTag filter
    boolean waitForExecutors(int required, long timeout, TimeUnit unit);
    boolean waitForExecutors(int required, ExecutorTier tier, long timeout, TimeUnit unit);
    boolean scaleExecutors(int delta);
    boolean scaleExecutors(int delta, ExecutorTier tier);  // annotation-only
}

ExecutorSelectionServiceImpl

  • Routes queries to executors matching nodeTag (“small” or “large”)
  • If no tagged executors found: falls back to all available endpoints with a warning
  • Does NOT throw RuntimeException

Metrics Exporter

Python Flask service that:

  1. Polls /apiv2/jobs every 15s for recent jobs
  2. Reads StatefulSet annotations (scale-requested-at, scale-requested-count)
  3. Tracks readyReplicas timestamps per tier
  4. Computes executor_desired_small and executor_desired_large

Three-signal grace logic (1800s each):

  • Signal 1: job history (endTime within grace)
  • Signal 2: ready-pod window (readyReplicas > 0 within grace)
  • Signal 3: annotation (scale-requested-at within grace) — also carries desired count

Configuration

services.executor.elastic {
  enabled: true
  min_executors: 0
  max_executors: 2        // SMALL tier
  max_executors_large: 8  // LARGE tier
  scale_timeout_minutes: 5
  kubernetes {
    namespace: "dremio"
    pod_template: "dremio-executor-small"
    pod_template_large: "dremio-executor-large"
  }
  small_query_threshold: 10000000
  medium_query_threshold: 30000000
}

Exporter env vars:

  • SCALE_DOWN_GRACE_SECS: 1800 (matches KEDA cooldown)
  • DREMIO_URL: coordinator service
  • DREMIO_USERNAME / DREMIO_PASSWORD: for REST API

Verified Behavior (May 2026)

Scale-Up Timeline

  1. Query submitted with LARGE queue
  2. Coordinator writes annotations: scale-requested-at, scale-requested-count=3
  3. Exporter picks up annotation → desired_large = 3
  4. KEDA polls every 10s → sets spec.replicas = 3
  5. Pods start, register with ZooKeeper with nodeTag=large
  6. waitForExecutors() returns true → query runs
  7. Query completes successfully (state: COMPLETED)

Scale-Down Timeline

  1. Query completes, no activity for 30+ minutes
  2. Signal 3 (annotation) expires first
  3. Signal 2 (ready-pod window) takes over for ~30 min
  4. Both expire → exporter logs “scaling to 0”
  5. KEDA cooldown (30 min) + HPA stabilization (30 min)
  6. Pods terminate cleanly via preStop hook

Total cold-start-to-zero: ~45-60 minutes depending on signal timing


Image Registry

Service Image Tag
Coordinator/Executor ghcr.io/rusha-corp/dremio-oss 2026.05.7
Metrics Exporter ghcr.io/rusha-corp/dremio-keda-exporter 2026.05.6

Files Changed

services/resourcescheduler/src/main/java/com/dremio/resource/elastic/
  ElasticAdmissionCalculator.java
  ElasticResourceAllocator.java
  ResourcePlatform.java (interface, AutoCloseable)
  K8sPlatform.java (annotation-only, ZK-based wait)
  NoOpResourcePlatform.java
  ResourcePlatformProvider.java

services/execselector/src/main/java/com/dremio/service/execselector/
  ExecutorSelectionServiceImpl.java (fallback, no RuntimeException)

docs/elastic-scaling-deployment.md
k8s/10-keda-scaledobject.yaml
k8s/04-coordinator.yaml
k8s/11a-executor-small-stub.yaml
k8s/11b-executor-large-stub.yaml

Questions for the Dremio team

  1. OSS appetite — Is there interest in elastic scaling in OSS, or is this planned as an Enterprise-only feature?
  2. Tier routing — Should we expose a RoutingPolicy SPI for rule-based routing (user, group, query type)?
  3. Extension point — Any concerns about conditionally binding ElasticResourceAllocator vs BasicResourceAllocator in DACDaemonModule?

Verified in production on 3-node k3s cluster (May 2026).