S3 bucket from non-Amazon provider

eellsworth · December 18, 2020, 9:24pm

Hi,
I am trying to set up Dremio to use data in S3 bucket that is offered by a non-Amazon provider (i.e., Wasabi) as the costs are much, much lower. However, when I go to connect to an S3 bucket I only see options to connect to AWS.

Does anyone have advice on configuring a connection to Wasabi?

Thanks,

Eric

ben · December 18, 2020, 10:22pm

Hi @eellsworth,

So long as the storage implements the S3 APIs, you should be able to use the S3 source plugin to configure your connection.

On our docs, we have an example for Minio that may be useful as an analogue to Wasabi: https://docs.dremio.com/data-sources/s3.html#configuring-s3-for-minio

Be sure to enable “compatibility mode”.

eellsworth · December 19, 2020, 11:30pm

Ben,
Thanks so much for this advice! Unfortunately I can’t seem to get Dremio to connect to the wasabi host (s3.wasabisys.com).

Here are the parameters I used:

The exceptions sometimes show Dremio attempting to connect to sts.amazonaws.com, and other times don’t list host names - they never show traffic to s3.wasabisys.com
at com.dremio.plugins.s3.store.STSCredentialProviderV1.getCredentials(STSCredentialProviderV1.java:71)
at org.apache.hadoop.fs.s3a.AWSCredentialProviderList.getCredentials(AWSCredentialProviderList.java:137)
… 47 common frames omitted
Caused by: org.apache.http.conn.HttpHostConnectException: Connect to sts.amazonaws.com:80 [sts.amazonaws.com/54.239.29.25] failed: Connection refused (Connection refused)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:159)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:373)
at sun.reflect.GeneratedMethodAccessor31.invoke(Unknown Source)

i.e.:
cat /var/log/dremio/server.log | grep -C3 wasabisys.com
returns nothing

I installed Wasabi’s CA cert and restarted Dremio, but this had no effect.

I can ping s3.wasabisys.com
ping s3.wasabisys.com
PING s3.wasabisys.com (38.27.106.12) 56(84) bytes of data.
64 bytes from 38.27.106.12 (38.27.106.12): icmp_seq=1 ttl=54 time=1.78 ms

and using the IP address makes no difference.

It sure looks to me like Dremio is connecting to AWS instead of Wasabi.

Any suggestions on how to get it connect to Wasabi instead?

Thanks,

Eric

eellsworth · December 24, 2020, 6:17am

Hi @Ben and @balaji.ramaswamy,
I just wanted to see if you all had any insights on how I can track down why Dremio does not seem to connect to Wasabi. Is there another parameter I can set that forces the S3 authentication?
I took a look at the github repo, and as near as I can tell the instatiation of the connection happens here:

github.com

dremio/dremio-oss/blob/3a843cb175d1b3bc63dd2f3c62038fa632584532/plugins/s3/src/main/java/com/dremio/plugins/s3/store/S3FileSystem.java#L335


}
private static final void invalidateCache(Cache<?, ?> cache) {
  cache.invalidateAll();
  cache.cleanUp();
}
// AwsCredentialsProvider might also implement SdkAutoCloseable
// Make sure to close if using directly (or let client close it for you).
@VisibleForTesting
protected AwsCredentialsProvider getAsync2Provider(Configuration config) {
  switch(config.get(Constants.AWS_CREDENTIALS_PROVIDER)) {
    case ACCESS_KEY_PROVIDER:
      return StaticCredentialsProvider.create(AwsBasicCredentials.create(
        config.get(Constants.ACCESS_KEY), config.get(Constants.SECRET_KEY)));
    case EC2_METADATA_PROVIDER:
      return new SharedInstanceProfileCredentialsProvider();
    case NONE_PROVIDER:
      return AnonymousCredentialsProvider.create();
    case ASSUME_ROLE_PROVIDER:
      return new STSCredentialProviderV2(config);

which seems to be set of an AWS class

github.com

dremio/dremio-oss/blob/3fdcde079f4990d08a90c62a66e05357f5c15099/plugins/s3/src/main/java/com/dremio/plugins/s3/store/S3StoragePlugin.java#L87


/**
 * Controls how many parallel connections HttpClient spawns.
 * Hadoop configuration property {@link org.apache.hadoop.fs.s3a.Constants#MAXIMUM_CONNECTIONS}.
 */
public static final int DEFAULT_MAX_CONNECTIONS = 1000;
public static final String EXTERNAL_BUCKETS = "dremio.s3.external.buckets";
public static final String WHITELISTED_BUCKETS = "dremio.s3.whitelisted.buckets";
// AWS Credential providers
public static final String ACCESS_KEY_PROVIDER = SimpleAWSCredentialsProvider.NAME;
public static final String EC2_METADATA_PROVIDER = "com.amazonaws.auth.InstanceProfileCredentialsProvider";
public static final String NONE_PROVIDER = AnonymousAWSCredentialsProvider.NAME;
public static final String ASSUME_ROLE_PROVIDER = "com.dremio.plugins.s3.store.STSCredentialProviderV1";
public S3StoragePlugin(S3PluginConfig config, SabotContext context, String name, Provider<StoragePluginId> idProvider) {
  super(config, context, name, idProvider);
}
@Override
protected List<Property> getProperties() {
  final S3PluginConfig config = getConfig();

Is there some configuration setting that would not end up using EC2_METADATA_PROVIDER?

Thanks,

Eric

Sarada_Punnana · December 28, 2020, 3:55am

Can you try by providing region name in fs.s3a.endpoint like s3.us-west-1.wasabisys.com?

eellsworth · December 29, 2020, 4:01am

Thanks!
I tried that and unfortunately that didn’t work either.

Any ideas on how I can proceed?

Thanks,

Eric

eellsworth · December 29, 2020, 9:38pm

Hi all,
I got it working! It turns out I needed to remove the IAM role and things worked fine. I appreciate everyone’s help!

Eric

tid · December 30, 2020, 9:44pm

Hi, @eellsworth!
Good to hear that you got Dremio and Wasabi running together. We (Software AG) have run quite extensive tests against Wasabi (in coordination with Dremio’s engineering team). It is fully compatible, BUT: Wasabi applies throttling if the ratio between data you store and your monthly downloads exceeds 1:8 (if I recall that correctly, see “Fair Use Policy”, https://wasabi-support.zendesk.com/hc/en-us/articles/360027447392-How-do-Wasabi-s-fair-use-policies-work-). Depending on your use-case that might not be a problem – for typical analytical use-cases where data is rather static and queried frequently, it might be an issue.
The problem manifests in Dremio errors like ‘Failed to decode column XYZ’. If you then drill down into the detail of the profile, you’ll see that Wasabi responds with HTTP 403 resulting in the decoding problem.

Best, Tim

balaji.ramaswamy · December 30, 2020, 9:59pm

Thanks for sharing @tid

eellsworth · January 3, 2021, 9:08pm

Tim,

Thank you – this is incredible helpful! Our team is pretty small (<10 people), and our main use case is loading flat files (primarily CSV) into Wasabi then querying them through Dremio. Many of the files represent versions of data that we retain so that we can reconstruct the state of our data pipelines and published results as needed; others are used for limited time-series type analysis. There are likely to be several hundred such file-based datasources.

Since the files almost never change once they are added (i.e. very close to write-once), I was thinking that once Reflections are created for the dataset corresponding to each file there should be relatively little traffic to the files themselves in the Wasabi S3 bucket. Since our Dremio instance runs in Azure, I’m thinking that keeping the Reflections on local disk of the Azure Dremio nodes would result in the best performance, or if the files get too large I might switch to ADLS for storage of Reflections.

Does any of that sound to you like a scenario that would encounter problems?

Thanks again and happy New Year.

Best regards,

Eric

tid · January 14, 2021, 3:42pm

Hi @eellsworth,
sorry – missed the notification for your reply.
I don’t think you’ll run into any problems given that your reflections will be stored on local disks. Even if Dremio sometimes won’t use the reflections your whatever reason, you probably won’t exceed Wasabi’s download quota.

You’ll probably also enable dataset caching (on local SSDs again) further reducing the likelihood of a quota problem. If you need to put your reflections on Object Storage, I’d definitely put them on ADLS2 instead of Wasabi.
We’ve also verified that Reflection Storage on Wasabi works well with Dremio, but there is a non-neglectable latency if the data lake isn’t really close to the Dremio executors. (Just to give you an idea of a worst-case scenario: Dremio compute in Europe with S3 buckets in US-East showed a 4-6x increase in query runtime in our experiments. It won’t be as bad if your Dremio executors (Azure) are geographically close to Wasabi, of course)

Best, Tim

balaji.ramaswamy · January 18, 2021, 3:25am

Thanks once again @tid

eellsworth · January 21, 2021, 5:23pm

Tim,
Thanks - this is excellent feedback. The reduced performance for Reflections on Wasabi is in line with what I expected. I will definitely put that and the dataset cache in ADLS.

Really appreciate the help.

Eric

Topic		Replies	Views
Connecting Dremio to an S3 bucket	2	1486	April 10, 2020
Fake S3 support?	4	1403	December 3, 2018
Does Dremio support backblaze b2 as a S3 source?	3	1455	February 21, 2021
Cannot connect to minio	3	1551	July 1, 2024
Can Dremio connect to EMC ECS Object storage?	7	1827	July 2, 2019

S3 bucket from non-Amazon provider

Related topics