Dremio Cloud unable to start engine instances

chough · February 16, 2023, 8:34pm

When attempting to run EC2 instances for Engines in Dremio Cloud, the instance starts up and then quickly shuts itself down.

The end of the EC2 logs show the following:
2023/02/16 20:02:06Z: Amazon SSM Agent v3.1.1732.0 is running
2023/02/16 20:02:06Z: OsProductName: Amazon Linux
2023/02/16 20:02:06Z: OsVersion: 2
[ 124.753674] nessie.sh[4747]: /opt/dremio/bin/vendor_specific.sh: line 25: local: `=': not a valid identifier
[ 124.773346] nessie.sh[4747]: Nessie IP (gw.dremio.cloud) is not a valid IP

This exact behavior occurs both when using the CloudFormation “path” in which the CloudFormation stack creates all required resources, as well as when using the “manual path” in which the required resources are created manually and then the ARNs, etc. are configured in the wizard.

This appears to be a bug in Dremio as there is no user-config that would affect the vendor_specific.sh script or the Dremio Cloud IP.

cindy.la · February 16, 2023, 8:35pm

Hi @chough !

Welcome to the Dremio Community. I can look into this further for you.

Could you share your organization ID to me via private message?

Cindy

chough · February 16, 2023, 8:45pm

Hi Cindy, sure I can do that. I don’t see a way to private message you however. Could you send me a message first?

cindy.la · February 16, 2023, 9:04pm

Hi @chough ,

I have sent you a message

Cindy

cindy.la · February 16, 2023, 9:10pm

I see that your engines are failing to scale up with the error message “Scaling failed. Replica creation timeout exceeded” in the engine events page (Project Settings → Engines → Engine Events). This typically means that your engines are not able to reach out.

https://docs.dremio.com/cloud/appendix/troubleshooting-cloud-resources/#ec2-instances-are-not-launching

I recommend checking over these resources here, depending on how your VPC network is configured:

DNS hostnames and resolutions are enabled on your VPC
Auto-assign public IPv4 address are enabled on your public subnets
Route table routed to an internet gateway for public subnets or a NAT gateway for private subnets

chough · February 16, 2023, 9:17pm

Thank you! I will explore that further and report back.

chough · February 16, 2023, 10:19pm

@cindy.la

I’ve gone through the steps in the document you linked and the issue is not resolved. It appears that the /opt/dremio/bin/vendor_specific.sh script is referencing the dremio forwarder endpoint “gw.dremio.cloud” however, the mentioned script is expecting an IP address rather than a DNS hostname. Is it possible that the user data / script / hostname were updated recently and resulted in this bug? I see that the AMI that is being used was created on 02/07/2023 which seems like a recent enough change that this may not have been detected yet.

cindy.la · February 16, 2023, 10:34pm

Hi @chough ,

I was able to successfully launch a new organization with CFT and spin up engines on my end.

Are you able to spin up a EC2 instance (not the one Dremio Cloud is launching) in the same VPC network you are trying to use for Dremio Cloud and run this curl command? This will check for outbound connectivity from your subnets.

curl -v https://gw.dremio.cloud

We also have a prerequisite page here that you can look over if you haven’t yet. A few things to note: you cannot mix public and private subnets, subnets cannot be in the same availability zone, etc.

https://docs.dremio.com/cloud/getting-started/prerequisites/

chough · February 16, 2023, 10:40pm

Thanks for the additional info.

Yes, I am able to hit https://gw.dremio.cloud from an EC2 instance inside of the subnet without any issues. I have also confirmed that the only subnets I have specified are private (none mixed with public) and that the two subnets are in different availability zones.

cindy.la · February 16, 2023, 10:56pm

Thanks for checking! I’d like to look into this further on my end. I’ll get back to you as soon as possible

cindy.la · February 16, 2023, 11:01pm

@chough ,
I just wanted to confirm, you created this current project with the CloudFormation template from the Dremio Cloud UI?

And did you have a NAT gateway set up (with a routing rule to the (public) subnet where the NAT public IP is pointing it to the Internet Gateway)?

chough · February 17, 2023, 5:54pm

Yes, the project in that organization was created with the CloudFormation template from the Dremio Cloud UI. I had originally setup a different organization and created the resources manually (instead of using the CloudFormation Template), but after hitting this same issue, I decided to try a clean slate and use the CloudFormation template to see where I went wrong, but then ended up hitting the same issue anyway.

And yes, there is a NAT gateway setup which allows the private subnet to communicate to the public internet. I have confirmed that I can reach gw.dremio.cloud from within the private subnet where the Dremio engine nodes are created.

cindy.la · February 17, 2023, 6:46pm

Hi @chough ,

Thanks for being patient as we looked into this and bringing this to our attention! It is indeed causing an issue for AWS PrivateLink connections. We are currently working on a fix and I can update you once that is live!

Cindy

chough · February 17, 2023, 7:03pm

Hi Cindy,

Thanks for confirming! Is there any workaround that we might be able to use in the meantime while waiting for a fix?

cindy.la · February 17, 2023, 7:56pm

Hi @chough ,

If you are OK using public subnets for now, you can use that in the meantime as a workaround.

chough · February 17, 2023, 9:05pm

I’ll have to keep it in a private subnet for now, but I may attempt configuring it without using PrivateLink. Do you have any documentation on a configuration without PrivateLink? I saw it was noted in the setup wizard as optional, but didn’t see what the alternative configuration would be.

chough · February 21, 2023, 5:14pm

@cindy.la

It looks like my attempt to run it in a private subnet, and omit the PrivateLink configuration was unsuccessful. With that, are you able to provide a targeted timeline for when this PrivateLink bug might be resolved?

cindy.la · February 22, 2023, 8:23pm

Hi @chough ,

Apologies for the late response. We rolled out the PrivateLink fix over the weekend and I was able to launch Dremio Cloud in a private subnet on my end.

I was viewing your organization and noticed that it is in a bad state, did you modify any IAM permissions or roles recently?

Cindy

chough · February 22, 2023, 8:49pm

Hi @cindy.la

Thanks for providing the update! I did mess with that organization as I was re-deploying and testing different configs in different orgs, so I will re-deploy and see if it is working as expected and report back.

cindy.la · February 22, 2023, 9:57pm

Sounds good, when you spin up the engine and see the instance in your AWS Dashboard, could you also double check the Instances’ system logs for a section called “Dremio Node Diagnostics” and send it over? That should appear in your system logs (toward the end) and checks for S3 connectivity and Dremio Gateway connectivity.

Topic		Replies	Views
Dremio Cloud : Query timeout Dremio Cloud	4	831	June 8, 2023
The preview engine is not online Dremio University	2	2641	July 23, 2020
Dremio cloud - aws - cannot connect to RDS Dremio Cloud	4	729	June 23, 2023
Hi! Dremio in private subnet	3	1048	May 26, 2021
Unable to connect to Dremio Software on AWS Dremio University	0	955	August 25, 2022

Dremio Cloud unable to start engine instances

Related topics