Memory and slowness after stress test

We made stress tests on EKS dremio (15.7.0)
Our cluster is AWS EKS cluster, helm chart based.
We connect the cloud-watch to the cluster in order to monitor the cluster and to design a HA and auto-scaled service

In the middle of the stress test the Master pod reach the 100% CPU and the service become unavailable.

  1. I assumed that if we will be able to scale the coordinate pods we will handle such stress, am I right?

After the relaxation time, the service becomes available again but we see several strange behaviors

  1. Even when there is minor usage of the service, the CPU usage of the Master pod is 40% (cloud-watch)

  2. I also see executer without query load with high memory usage.


    When I restart the pod (by deleting it) the memory usage becomes reasonable (1%)

  3. It seems that the queries become slower…

BTW why do the metrics of the Master and the Coordinates in the dremio monitor always show 0% (Memory and CPU)

@motybz Depends if the stress test is causing slowness due to planning or metadata, if planning then a scale out coordinator would help. Take a look at the GC logs and see if the JVM is doing a Full GC and if the crash generates a heap dump. If you see Full GC then add histograms to the JVM options and send us the GC files which will tell us the root cause for the crash during the stress test

DREMIO_JAVA_SERVER_EXTRA_OPTS="-XX:+PrintClassHistogramBeforeFullGC -XX:+PrintClassHistogramAfterFullGC -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:MaxGCPauseMillis=500 -XX:InitiatingHeapOccupancyPercent=25"

Thank you, Balaji

The latest logs of the Master and executor are attached.

Can you help me understand them?
dremio_logs.zip (38.2 KB)

Another question, how can I mitigate stress on the Master node due to planning? Are you have any docs for that? Does reflection help here?

@motybz What date/time did the master or the executor go unresponsive?

Hi @balaji.ramaswamy thank you for your answer.
I don’t have this data/time but I tried your EXTRA OPS on staging and it works well.

Our use case is querying MongoDB (Flatten VDS) and Parquet files (PetaBytes) on S3
Most of the time there is no stress on the system (10 Queries per second) but In a stress time we have 10K Queries per second (SELECT FROM WHERE without any aggregation)

I have few more questions:

  1. Does autoscaled dremio cluster cloud handled such stress and could be resilient?
  2. Do we need to keep any Coordinators/Executers/Zookeeper ratio?
  3. Can we used AWS Spot instances for Executers and Coordinators?

Your notes are welcome, thanks

  1. Does autoscaled dremio cluster cloud handled such stress and could be resilient? - Even though you implement auto scale, queries already planned or running will not use the new executors, only queries waiting in Q or new queries can benefit
  2. Do we need to keep any Coordinators/Executers/Zookeeper ratio? - Zookeeper needs to be external or if K8’s the helm charts would create 3 external ones. There is no specific coordinator to executor ratio but if coordinator shows increased command pool time, see job profile then time to add a scale out coordinator. For executors, if you start seeing increased query run time and if the slowness is not part of coordinator operations like metadata or planning then we see if it was memory/CPU/IO starved and take the right steps
  3. Can we used AWS Spot instances for Executers and Coordinators? Currently not supported

Hi @balaji.ramaswamy, thank you for your responses.

We implemented a dremio cluster with 3 Coordinators 1 Master and 12 Executors.
When we made stress tests only the Master and 1 Coordinator (always the same Coordinator) are planning queries, even the Master and the specific Coordinator reach 100% utilization (and queries rejected).

How can we fix the LB between the whole Coordinators? we used this values.yml.
Is it a bug? are you familiar with it?

values.yaml content:

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: dremio
  namespace: default
spec:
  interval: 5m
  timeout: '10m'
  chart:
    spec:
      chart: ./dremio
      sourceRef:
        kind: GitRepository
        name: cluster
        namespace: default
      interval: 1m
  values:
    image: .../import/dremio-oss
    imageTag: 15.7.0

    # Dremio Coordinator
    coordinator:
      # This count is used for slave coordinators only.
      # The total number of coordinators will always be count + 1.
      count: 3
      memory: 65536
      volumeSize: 256Gi
      storageClass: gp2
      nodeSelector:
        eks.amazonaws.com/capacityType: SPOT
      masterNodeSelector:
        eks.amazonaws.com/capacityType: ON_DEMAND
      extraStartParams: >-
        -XX:+PrintClassHistogramBeforeFullGC 
        -XX:+PrintClassHistogramAfterFullGC 
        -XX:+UseG1GC 
        -XX:G1HeapRegionSize=32M 
        -XX:MaxGCPauseMillis=500 
        -XX:InitiatingHeapOccupancyPercent=25
      
    executor:
      cpu: 4
      memory: 16000
      count: 12
      volumeSize: 256Gi
      storageClass: gp2

      cloudCache:
        storageClass: gp2
        volumes:
        - name: "dremio-c3-a"
          size: 256Gi
        - name: "dremio-c3-b"
          size: 256Gi
      
      nodeSelector:
        eks.amazonaws.com/capacityType: SPOT
      
      extraStartParams: >-
        -XX:+PrintClassHistogramBeforeFullGC 
        -XX:+PrintClassHistogramAfterFullGC 
        -XX:+UseG1GC 
        -XX:G1HeapRegionSize=32M 
        -XX:MaxGCPauseMillis=500 
        -XX:InitiatingHeapOccupancyPercent=25
        
    # Zookeeper
    zookeeper:
      # The Zookeeper image used in the cluster.
      #image: k8s.gcr.io/kubernetes-zookeeper
      image: .../import/kubernetes-zookeeper
      imageTag: 1.0-3.4.10
      volumeSize: 16Gi
      storageClass: gp2

    distStorage:
      type: "aws"
      aws:
        bucketName: "{name}"
        path: /staging/

    service:
      type: LoadBalancer
      sessionAffinity: ClientIP
      annotations:
        # matching aws-load-balancer-controller:v2.1.3
        service.beta.kubernetes.io/aws-load-balancer-internal: "true"
        service.beta.kubernetes.io/aws-load-balancer-type: nlb-ip
        #service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: instance
        service.beta.kubernetes.io/aws-load-balancer-subnets: subnet-0ccaf8e114f38a99a,subnet-004c8172a6a8db363

    storageClass: gp2
    busybox:
      image: .../import/busybox
      imageTag: 1.33.

@motybz Scale out coordinators are only used for JDBC/ODBC queries, so if you were doing your stress test other than a JDBC/ODBC tool that will explain why the second and third coordinators were not used. Dremio needs the scale out only if you see high command pool wait times in the job profile. What was the reason behind having 3 coordinators?