When attempting to run EC2 instances for Engines in Dremio Cloud, the instance starts up and then quickly shuts itself down.
The end of the EC2 logs show the following:
2023/02/16 20:02:06Z: Amazon SSM Agent v3.1.1732.0 is running
2023/02/16 20:02:06Z: OsProductName: Amazon Linux
2023/02/16 20:02:06Z: OsVersion: 2
[ 124.753674] nessie.sh[4747]: /opt/dremio/bin/vendor_specific.sh: line 25: local: `=': not a valid identifier
[ 124.773346] nessie.sh[4747]: Nessie IP (gw.dremio.cloud) is not a valid IP
This exact behavior occurs both when using the CloudFormation “path” in which the CloudFormation stack creates all required resources, as well as when using the “manual path” in which the required resources are created manually and then the ARNs, etc. are configured in the wizard.
This appears to be a bug in Dremio as there is no user-config that would affect the vendor_specific.sh script or the Dremio Cloud IP.
I see that your engines are failing to scale up with the error message “Scaling failed. Replica creation timeout exceeded” in the engine events page (Project Settings → Engines → Engine Events). This typically means that your engines are not able to reach out.
I’ve gone through the steps in the document you linked and the issue is not resolved. It appears that the /opt/dremio/bin/vendor_specific.sh script is referencing the dremio forwarder endpoint “gw.dremio.cloud” however, the mentioned script is expecting an IP address rather than a DNS hostname. Is it possible that the user data / script / hostname were updated recently and resulted in this bug? I see that the AMI that is being used was created on 02/07/2023 which seems like a recent enough change that this may not have been detected yet.
I was able to successfully launch a new organization with CFT and spin up engines on my end.
Are you able to spin up a EC2 instance (not the one Dremio Cloud is launching) in the same VPC network you are trying to use for Dremio Cloud and run this curl command? This will check for outbound connectivity from your subnets.
curl -v https://gw.dremio.cloud
We also have a prerequisite page here that you can look over if you haven’t yet. A few things to note: you cannot mix public and private subnets, subnets cannot be in the same availability zone, etc.
Yes, I am able to hit https://gw.dremio.cloud from an EC2 instance inside of the subnet without any issues. I have also confirmed that the only subnets I have specified are private (none mixed with public) and that the two subnets are in different availability zones.
Yes, the project in that organization was created with the CloudFormation template from the Dremio Cloud UI. I had originally setup a different organization and created the resources manually (instead of using the CloudFormation Template), but after hitting this same issue, I decided to try a clean slate and use the CloudFormation template to see where I went wrong, but then ended up hitting the same issue anyway.
And yes, there is a NAT gateway setup which allows the private subnet to communicate to the public internet. I have confirmed that I can reach gw.dremio.cloud from within the private subnet where the Dremio engine nodes are created.
Thanks for being patient as we looked into this and bringing this to our attention! It is indeed causing an issue for AWS PrivateLink connections. We are currently working on a fix and I can update you once that is live!
I’ll have to keep it in a private subnet for now, but I may attempt configuring it without using PrivateLink. Do you have any documentation on a configuration without PrivateLink? I saw it was noted in the setup wizard as optional, but didn’t see what the alternative configuration would be.
It looks like my attempt to run it in a private subnet, and omit the PrivateLink configuration was unsuccessful. With that, are you able to provide a targeted timeline for when this PrivateLink bug might be resolved?
Apologies for the late response. We rolled out the PrivateLink fix over the weekend and I was able to launch Dremio Cloud in a private subnet on my end.
I was viewing your organization and noticed that it is in a bad state, did you modify any IAM permissions or roles recently?
Thanks for providing the update! I did mess with that organization as I was re-deploying and testing different configs in different orgs, so I will re-deploy and see if it is working as expected and report back.
Sounds good, when you spin up the engine and see the instance in your AWS Dashboard, could you also double check the Instances’ system logs for a section called “Dremio Node Diagnostics” and send it over? That should appear in your system logs (toward the end) and checks for S3 connectivity and Dremio Gateway connectivity.