Hi,
I am using dremio oss 25.0 in Kubernetes deployment in EKS and i wanted to test out the autoscale feature for kubernetes. The autoscale configuration i have used is below.
nodeLifecycleService:
enabled: true
scalingMetrics:
default:
cpuAverageUtilization: 50
enabled: true
scalingBehavior:
scaleDown:
defaultPolicy:
enabled: true
value: 2
scaleUp:
userDefined:
- type: Percent
value: 100
periodSeconds: 60
After deploying the helm chart with this setting i noticed the following things
- Initially my executor count was 1. As i started running reflections on a pretty huge table, my cpu spiked and the HPA autoscaled my executors. But the new executors were not being used for executing the threads of the reflection refresh. So does this mean that the new executors wont share the load of existing(already running) jobs.
- When i ran reflection refreshes on multiple tables this same behaviour was noticed and infact since only my inital executor was being used the reflection refresh jobs failed due to memory limits
- When my cluster had autoscaled upto 3 executors i submitted another query. But at this time one of my executors downscaled and this query failed with the error -
ExecutionSetupException: One or more nodes lost connectivity during query. Identified nodes were [dremio-executor-dremioscale-0-3.dremio-cluster-pod-dremioscale-0.dremio-scaling.svc.cluster.local:0
. So does this mean downscaling will affect running queries and cause them to fail? If so how can i overcome this?
Am i missing anything that should be configured for autoscaling to work? I hope someone from the dremio team can help me out with these questions