Dremio Cloud unable to start engine instances

When attempting to run EC2 instances for Engines in Dremio Cloud, the instance starts up and then quickly shuts itself down.

The end of the EC2 logs show the following:
2023/02/16 20:02:06Z: Amazon SSM Agent v3.1.1732.0 is running
2023/02/16 20:02:06Z: OsProductName: Amazon Linux
2023/02/16 20:02:06Z: OsVersion: 2
[ 124.753674] nessie.sh[4747]: /opt/dremio/bin/vendor_specific.sh: line 25: local: `=': not a valid identifier
[ 124.773346] nessie.sh[4747]: Nessie IP (gw.dremio.cloud) is not a valid IP

This exact behavior occurs both when using the CloudFormation “path” in which the CloudFormation stack creates all required resources, as well as when using the “manual path” in which the required resources are created manually and then the ARNs, etc. are configured in the wizard.

This appears to be a bug in Dremio as there is no user-config that would affect the vendor_specific.sh script or the Dremio Cloud IP.

Hi @chough !

Welcome to the Dremio Community. I can look into this further for you.

Could you share your organization ID to me via private message?

Cindy

Hi Cindy, sure I can do that. I don’t see a way to private message you however. Could you send me a message first?

Hi @chough ,

I have sent you a message

Cindy

I see that your engines are failing to scale up with the error message “Scaling failed. Replica creation timeout exceeded” in the engine events page (Project Settings → Engines → Engine Events). This typically means that your engines are not able to reach out.

https://docs.dremio.com/cloud/appendix/troubleshooting-cloud-resources/#ec2-instances-are-not-launching

I recommend checking over these resources here, depending on how your VPC network is configured:

  • DNS hostnames and resolutions are enabled on your VPC
  • Auto-assign public IPv4 address are enabled on your public subnets
  • Route table routed to an internet gateway for public subnets or a NAT gateway for private subnets

Thank you! I will explore that further and report back.

@cindy.la

I’ve gone through the steps in the document you linked and the issue is not resolved. It appears that the /opt/dremio/bin/vendor_specific.sh script is referencing the dremio forwarder endpoint “gw.dremio.cloud” however, the mentioned script is expecting an IP address rather than a DNS hostname. Is it possible that the user data / script / hostname were updated recently and resulted in this bug? I see that the AMI that is being used was created on 02/07/2023 which seems like a recent enough change that this may not have been detected yet.

Hi @chough ,

I was able to successfully launch a new organization with CFT and spin up engines on my end.

Are you able to spin up a EC2 instance (not the one Dremio Cloud is launching) in the same VPC network you are trying to use for Dremio Cloud and run this curl command? This will check for outbound connectivity from your subnets.

curl -v https://gw.dremio.cloud

We also have a prerequisite page here that you can look over if you haven’t yet. A few things to note: you cannot mix public and private subnets, subnets cannot be in the same availability zone, etc.

https://docs.dremio.com/cloud/getting-started/prerequisites/

Thanks for the additional info.

Yes, I am able to hit https://gw.dremio.cloud from an EC2 instance inside of the subnet without any issues. I have also confirmed that the only subnets I have specified are private (none mixed with public) and that the two subnets are in different availability zones.

Thanks for checking! I’d like to look into this further on my end. I’ll get back to you as soon as possible

@chough ,
I just wanted to confirm, you created this current project with the CloudFormation template from the Dremio Cloud UI?

And did you have a NAT gateway set up (with a routing rule to the (public) subnet where the NAT public IP is pointing it to the Internet Gateway)?

Yes, the project in that organization was created with the CloudFormation template from the Dremio Cloud UI. I had originally setup a different organization and created the resources manually (instead of using the CloudFormation Template), but after hitting this same issue, I decided to try a clean slate and use the CloudFormation template to see where I went wrong, but then ended up hitting the same issue anyway.

And yes, there is a NAT gateway setup which allows the private subnet to communicate to the public internet. I have confirmed that I can reach gw.dremio.cloud from within the private subnet where the Dremio engine nodes are created.

Hi @chough ,

Thanks for being patient as we looked into this and bringing this to our attention! It is indeed causing an issue for AWS PrivateLink connections. We are currently working on a fix and I can update you once that is live!

Cindy

Hi Cindy,

Thanks for confirming! Is there any workaround that we might be able to use in the meantime while waiting for a fix?

Hi @chough ,

If you are OK using public subnets for now, you can use that in the meantime as a workaround.

I’ll have to keep it in a private subnet for now, but I may attempt configuring it without using PrivateLink. Do you have any documentation on a configuration without PrivateLink? I saw it was noted in the setup wizard as optional, but didn’t see what the alternative configuration would be.

@cindy.la

It looks like my attempt to run it in a private subnet, and omit the PrivateLink configuration was unsuccessful. With that, are you able to provide a targeted timeline for when this PrivateLink bug might be resolved?

Hi @chough ,

Apologies for the late response. We rolled out the PrivateLink fix over the weekend and I was able to launch Dremio Cloud in a private subnet on my end.

I was viewing your organization and noticed that it is in a bad state, did you modify any IAM permissions or roles recently?

Cindy

Hi @cindy.la

Thanks for providing the update! I did mess with that organization as I was re-deploying and testing different configs in different orgs, so I will re-deploy and see if it is working as expected and report back.

Sounds good, when you spin up the engine and see the instance in your AWS Dashboard, could you also double check the Instances’ system logs for a section called “Dremio Node Diagnostics” and send it over? That should appear in your system logs (toward the end) and checks for S3 connectivity and Dremio Gateway connectivity.