We made stress tests on EKS dremio (15.7.0)
Our cluster is AWS EKS cluster, helm chart based.
We connect the cloud-watch to the cluster in order to monitor the cluster and to design a HA and auto-scaled service
In the middle of the stress test the Master pod reach the 100% CPU and the service become unavailable.
I assumed that if we will be able to scale the coordinate pods we will handle such stress, am I right?
After the relaxation time, the service becomes available again but we see several strange behaviors
Even when there is minor usage of the service, the CPU usage of the Master pod is 40% (cloud-watch)
@motybz Depends if the stress test is causing slowness due to planning or metadata, if planning then a scale out coordinator would help. Take a look at the GC logs and see if the JVM is doing a Full GC and if the crash generates a heap dump. If you see Full GC then add histograms to the JVM options and send us the GC files which will tell us the root cause for the crash during the stress test
Hi @balaji.ramaswamy thank you for your answer.
I don’t have this data/time but I tried your EXTRA OPS on staging and it works well.
Our use case is querying MongoDB (Flatten VDS) and Parquet files (PetaBytes) on S3
Most of the time there is no stress on the system (10 Queries per second) but In a stress time we have 10K Queries per second (SELECT FROM WHERE without any aggregation)
I have few more questions:
Does autoscaled dremio cluster cloud handled such stress and could be resilient?
Do we need to keep any Coordinators/Executers/Zookeeper ratio?
Can we used AWS Spot instances for Executers and Coordinators?
Does autoscaled dremio cluster cloud handled such stress and could be resilient? - Even though you implement auto scale, queries already planned or running will not use the new executors, only queries waiting in Q or new queries can benefit
Do we need to keep any Coordinators/Executers/Zookeeper ratio? - Zookeeper needs to be external or if K8’s the helm charts would create 3 external ones. There is no specific coordinator to executor ratio but if coordinator shows increased command pool time, see job profile then time to add a scale out coordinator. For executors, if you start seeing increased query run time and if the slowness is not part of coordinator operations like metadata or planning then we see if it was memory/CPU/IO starved and take the right steps
Can we used AWS Spot instances for Executers and Coordinators? Currently not supported
We implemented a dremio cluster with 3 Coordinators 1 Master and 12 Executors.
When we made stress tests only the Master and 1 Coordinator (always the same Coordinator) are planning queries, even the Master and the specific Coordinator reach 100% utilization (and queries rejected).
How can we fix the LB between the whole Coordinators? we used this values.yml.
Is it a bug? are you familiar with it?
@motybz Scale out coordinators are only used for JDBC/ODBC queries, so if you were doing your stress test other than a JDBC/ODBC tool that will explain why the second and third coordinators were not used. Dremio needs the scale out only if you see high command pool wait times in the job profile. What was the reason behind having 3 coordinators?