Problems accessing S3 when running Dremio in a private subnet

When running Dremio 4.1.3-202001022113020736-53142377 in a private subnet connecting to S3 fails with the error

software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: Connect to sts.us-east-1.amazonaws.com:443 [sts.us-east-1.amazonaws.com/52.94.241.138] failed: connect timed out

This happens both when using an EC2 instance role (‘EC2 Metadata’ in the Dremio UI) and IAM Access Key/Secret.

It should be possible to resolve this by adding a VPC interface endpoint for STS, but I am running in eu-west-1 so can only add a VPC endpoint for STS in this region, not in us-east-1.

By adding a NAT gateway and proper routing, it is possible to make it work, BUT still this fails intermittently - the same timeout error occurs from time to time, and then disappears again.

I have been able to reproduce this behavior several times.

My questions:

  1. Why does Dremio contact STS service in us-east-1? Can this be changed to the region where the Dremio instance is running?
  2. When using IAM Access Key/Secret, is it logical that Dremio contacts STS service at all?
  3. When adding a NAT gateway, why does the timeout error still occur from time to time?

Please note: while there is sort of a workaround with the NAT gateway, since the timeout still occurs from time to time and there is no way to make it go away other than waiting for a long (??) time or rebooting the service, this makes Dremio effectively unusable for day to day use.

Hello @FreeWillaert,

Welcome to the Dremio community!

  1. By default, we automatically contact the STS service in us-east-1 as a fallback when we don’t have a region set. If you want to set a specific region, you can edit the S3 source and click the Advanced Options page from the left hand side. Under Connection Properties, add a property with the name “dremio.s3.region” and then specify a region, i.e. “eu-west-1”, as the value. That will allow Dremio to connect to the proper STS endpoint for your region.
  2. We use the STS service to do additional validation checks for the provided credentials. It’s possible to bypass this check in its entirety by checking the “Enable compatibility mode (experimental)” checkbox. This compatibility mode is generally for other S3 compatible sources such as Minio that don’t support STS, but would effectively remove this STS call. However, as the option notes, this is experimental and it would be best to not check this seeing as though you are using AWS infrastructure.
  3. I don’t believe we have seen this timeout on our end. As we are dependent on AWS infrastructure for this request, it may be an intermittent issue with AWS and outside of our control (however, this is speculation as we haven’t seen this issue ourselves.)

Hope this answers all of your questions! Please feel free to follow up if you need additional help.

Sincerely,
@RyanTse

Hi Ryan,

Thanks for your fast and clear response. Note that I forgot to state explicitly that we’re running on AWS, but this seems to have been obvious :slight_smile:
Adding the dremio.s3.region property seems to have some effect - at least the intermittent errors have not occurred since.

However, after adding this property (and enabling private dns names on the interface endpoint!) I now get a new error in the UI:

The source [“xxxx”] is currently unavailable. Info: [com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to s3.amazonaws.com:80 [s3.amazonaws.com/52.216.226.43] failed: connect timed out]

I cannot create a VPC endpoint to s3[dot]amazonaws[dot]com, but only to s3[dot]eu-west-1[dot]amazonaws[dot]com, afaik.
I’m not sure, but could/should the connection not go to s3[dot]eu-west-1[dot]amazonaws[dot]com?

Aside, strangely enough the new error does not appear in server.log, but there is a sort of a mapping exception around the same time. This may be a separate issue with error handling?

2020-02-18 08:50:55,682 [qtp1690263751-132] ERROR o.g.j.server.ServerRuntime$Responder - An exception has been thrown from an exception mapper class com.dremio.dac.server.UserExceptionMapper.
java.lang.NullPointerException: at index 0
[…]
2020-02-18 08:50:55,685 [qtp1690263751-132] ERROR o.g.j.server.ServerRuntime$Responder - An exception was not mapped due to exception mapper failure. The HTTP 500 response will be returned.
com.dremio.common.exceptions.UserException: Failure creating/updating source [xxxxxxx].
at com.dremio.common.exceptions.UserException$Builder.build(UserException.java:776)

Of course I can provide the full error logs if you wish.

Thanks,
Frederik

PS. the [dot]s are because as a new user, I cannot post more than two links in one message :o)

Hello @FreeWillaert,

You should be able to set the S3 endpoint with the Connection Property key “fs.s3a.endpoint” and using the full domain name as the value (i.e. s3.eu-west-1.amazonaws.com). If that still doesn’t work, please provide the full error logs.

Sincerely,
@RyanTse

Hi Ryan,

Thank you!! This works fine now.
Just for the record - perhaps for other people who stumble upon the same issue, I now have the following settings:

image

with Dremio running in a private subnet without NAT but with

  • a VPC interface endpoint for STS service, with private dns names enabled (sts.eu-west-1.amazonaws.com)
  • a VPC gateway endpoint for S3 service (com.amazonaws.eu-west-1.s3)

Thanks!