Random Hung Query Responses on Nginx-proxied Dremio

Greetings, Dremio friends,

I have a problem related to query responses in the web UI sometimes not returning.

Our analysts are getting really excited about the new superpowers we’ve given them with Dremio. But sometimes, almost at random, a query will run and take far longer than expected. The execution timer will count up… and up… never returning anything.

If a user looks at the “jobs” listing when this happens, though, they’ll see the query executed as expected and the server just forgot to send back the results. The user can click “preview” or “run” again and they’ll generally get back what they expected.

Our setup:

I’ve implemented a series of boutique-scale, VMWare-backed on-premise Dremio clusters running in Kubernetes. The k8s cluster is managed with Rancher, and services are exposed via a MetalLB load balancer to the client-specific isolated network segment in which the k8s nodes and suport VMs are contained. From there, an Ubuntu 16 virtual machine running Nginx proxies all services through a single external port, available to users in the office network.

The Dremio instances are currently running dremio-oss 4.1.3 (4.1.3-202001022113020736-53142377), and primarily accessing Parquet files stored on local S3/Minio data lake storage, with occasional forays to remote SQL, MySQL, and Elasticsearch instances. The system is perofrming admirably, other than this one issue. Various worker instances are scheduled in Kubernetes to do data ingest, feeding an internal Kafka service, which is consumed by Nifi to write Parquet files that Dremio reads. In-house data automation for the win!

Now, apart from local access, we’ve also set up limited access to Dremio for cloud services to consume, specifically Google Data Studio and Tableau.

To do this, I modified the default Dremio helm chart and docker image to establish SSH tunnels from the master node to a remote host on AWS. The tunnelled ODBC connection is visible from the tunnel endpoint host directly, while I proxy the tunnelled http endpoint through AWS load balancers, applying an SSL cert to the connection.

Weirdness:

When I access the Dremio UI via the tunnelled AWS connection, I never, ever get a stalled query response. It performs as expected, every time. However, when I connect through the locally-proxied connection (to the same Dremio cluster) I will see the problem. Thus, I’ve concluded this is an issue with the proxy server. The main reverse-proxy config is:

  location / {
    proxy_pass http://172.16.3.44:9047/;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection $connection_upgrade;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP  $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_read_timeout 3600;
  }

This would seem to be a pretty standard reverse-proxying setup. Things I have tried:

  • Disabling all nginx caching
  • Queried different sources (hangups happen on S3, SQL, mysql, imported files, etc)
  • Fiddling with various keepalive and client timeout values
  • Increasing the number of open files Nginx can access in the operating system
  • Proxying with http 2.0
  • Disabling compression
  • Rebooting the proxy (interestingly, the results show up on a hung query after a reboot)
  • Accessing Dremio on port 80
  • Updating Nginx to the latest version
  • Updating Dremio to the latest version
  • Burning incense, human sacrifice

But, still, no luck. I’m honestly at a loss at this point. Because we have the remote AWS interface, my users are happy for now, but the whole point of running this system on-premise is so I don’t have to depend on cloud services of any kind to deliver the core functionality of our data platform, and this workaround breaks that.

What am I doing wrong here? Has anybody else experienced this?

Did you configure nginx to proxy also websocket queries? http://nginx.org/en/docs/http/websocket.html

Yes, as shown in the example config in my post, I’m issuing the proper connection upgrade directives to make Nginx do websockets. It works most of the time, just not all of the time.

Bump. Anybody else experiencing this proxying issue in Nginx?

We do, did you find a solution?

@M.Gross

Since that was almost 4 years back, let us do a fresh baseline. What version of Dremio are you running on? What is the exact issue in your case? Random hung query responses?

Sounds like a good idea. We are currently running 24.2.5-202311070743190643-0e5f9039 CE.

After opening the UI and leaving it idle for some time, query never show a result and keep on running. From the query log you can tell, that it is only the UI and not the query which is unresponsive. Checking out the changelog for 24.2.6 this issue might be resolved?

@M.Gross That is a known issue where the query editor says query is not complete but the jobs page show as query complete, should be fixed in 24.2.6 or 24.3