Greetings, Dremio friends,
I have a problem related to query responses in the web UI sometimes not returning.
Our analysts are getting really excited about the new superpowers we’ve given them with Dremio. But sometimes, almost at random, a query will run and take far longer than expected. The execution timer will count up… and up… never returning anything.
If a user looks at the “jobs” listing when this happens, though, they’ll see the query executed as expected and the server just forgot to send back the results. The user can click “preview” or “run” again and they’ll generally get back what they expected.
Our setup:
I’ve implemented a series of boutique-scale, VMWare-backed on-premise Dremio clusters running in Kubernetes. The k8s cluster is managed with Rancher, and services are exposed via a MetalLB load balancer to the client-specific isolated network segment in which the k8s nodes and suport VMs are contained. From there, an Ubuntu 16 virtual machine running Nginx proxies all services through a single external port, available to users in the office network.
The Dremio instances are currently running dremio-oss 4.1.3 (4.1.3-202001022113020736-53142377), and primarily accessing Parquet files stored on local S3/Minio data lake storage, with occasional forays to remote SQL, MySQL, and Elasticsearch instances. The system is perofrming admirably, other than this one issue. Various worker instances are scheduled in Kubernetes to do data ingest, feeding an internal Kafka service, which is consumed by Nifi to write Parquet files that Dremio reads. In-house data automation for the win!
Now, apart from local access, we’ve also set up limited access to Dremio for cloud services to consume, specifically Google Data Studio and Tableau.
To do this, I modified the default Dremio helm chart and docker image to establish SSH tunnels from the master node to a remote host on AWS. The tunnelled ODBC connection is visible from the tunnel endpoint host directly, while I proxy the tunnelled http endpoint through AWS load balancers, applying an SSL cert to the connection.
Weirdness:
When I access the Dremio UI via the tunnelled AWS connection, I never, ever get a stalled query response. It performs as expected, every time. However, when I connect through the locally-proxied connection (to the same Dremio cluster) I will see the problem. Thus, I’ve concluded this is an issue with the proxy server. The main reverse-proxy config is:
location / {
proxy_pass http://172.16.3.44:9047/;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_read_timeout 3600;
}
This would seem to be a pretty standard reverse-proxying setup. Things I have tried:
- Disabling all nginx caching
- Queried different sources (hangups happen on S3, SQL, mysql, imported files, etc)
- Fiddling with various keepalive and client timeout values
- Increasing the number of open files Nginx can access in the operating system
- Proxying with http 2.0
- Disabling compression
- Rebooting the proxy (interestingly, the results show up on a hung query after a reboot)
- Accessing Dremio on port 80
- Updating Nginx to the latest version
- Updating Dremio to the latest version
- Burning incense, human sacrifice
But, still, no luck. I’m honestly at a loss at this point. Because we have the remote AWS interface, my users are happy for now, but the whole point of running this system on-premise is so I don’t have to depend on cloud services of any kind to deliver the core functionality of our data platform, and this workaround breaks that.
What am I doing wrong here? Has anybody else experienced this?