S3 bucket from non-Amazon provider

Hi,
I am trying to set up Dremio to use data in S3 bucket that is offered by a non-Amazon provider (i.e., Wasabi) as the costs are much, much lower. However, when I go to connect to an S3 bucket I only see options to connect to AWS.

Does anyone have advice on configuring a connection to Wasabi?

Thanks,

Eric

Hi @eellsworth,

So long as the storage implements the S3 APIs, you should be able to use the S3 source plugin to configure your connection.

On our docs, we have an example for Minio that may be useful as an analogue to Wasabi: https://docs.dremio.com/data-sources/s3.html#configuring-s3-for-minio

Be sure to enable “compatibility mode”.

Ben,
Thanks so much for this advice! Unfortunately I can’t seem to get Dremio to connect to the wasabi host (s3.wasabisys.com).

Here are the parameters I used:


The exceptions sometimes show Dremio attempting to connect to sts.amazonaws.com, and other times don’t list host names - they never show traffic to s3.wasabisys.com
at com.dremio.plugins.s3.store.STSCredentialProviderV1.getCredentials(STSCredentialProviderV1.java:71)
at org.apache.hadoop.fs.s3a.AWSCredentialProviderList.getCredentials(AWSCredentialProviderList.java:137)
… 47 common frames omitted
Caused by: org.apache.http.conn.HttpHostConnectException: Connect to sts.amazonaws.com:80 [sts.amazonaws.com/54.239.29.25] failed: Connection refused (Connection refused)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:159)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:373)
at sun.reflect.GeneratedMethodAccessor31.invoke(Unknown Source)

i.e.:
cat /var/log/dremio/server.log | grep -C3 wasabisys.com
returns nothing

I installed Wasabi’s CA cert and restarted Dremio, but this had no effect.

I can ping s3.wasabisys.com
ping s3.wasabisys.com
PING s3.wasabisys.com (38.27.106.12) 56(84) bytes of data.
64 bytes from 38.27.106.12 (38.27.106.12): icmp_seq=1 ttl=54 time=1.78 ms

and using the IP address makes no difference.

It sure looks to me like Dremio is connecting to AWS instead of Wasabi.

Any suggestions on how to get it connect to Wasabi instead?

Thanks,

Eric

Hi @Ben and @balaji.ramaswamy,
I just wanted to see if you all had any insights on how I can track down why Dremio does not seem to connect to Wasabi. Is there another parameter I can set that forces the S3 authentication?
I took a look at the github repo, and as near as I can tell the instatiation of the connection happens here:

which seems to be set of an AWS class

Is there some configuration setting that would not end up using EC2_METADATA_PROVIDER?

Thanks,

Eric

Can you try by providing region name in fs.s3a.endpoint like s3.us-west-1.wasabisys.com?

Thanks!
I tried that and unfortunately that didn’t work either.

Any ideas on how I can proceed?

Thanks,

Eric

Hi all,
I got it working! It turns out I needed to remove the IAM role and things worked fine. I appreciate everyone’s help!

Eric

Hi, @eellsworth!
Good to hear that you got Dremio and Wasabi running together. We (Software AG) have run quite extensive tests against Wasabi (in coordination with Dremio’s engineering team). It is fully compatible, BUT: Wasabi applies throttling if the ratio between data you store and your monthly downloads exceeds 1:8 (if I recall that correctly, see “Fair Use Policy”, https://wasabi-support.zendesk.com/hc/en-us/articles/360027447392-How-do-Wasabi-s-fair-use-policies-work-). Depending on your use-case that might not be a problem – for typical analytical use-cases where data is rather static and queried frequently, it might be an issue.
The problem manifests in Dremio errors like ‘Failed to decode column XYZ’. If you then drill down into the detail of the profile, you’ll see that Wasabi responds with HTTP 403 resulting in the decoding problem.

Best, Tim

Thanks for sharing @tid

Tim,

Thank you – this is incredible helpful! Our team is pretty small (<10 people), and our main use case is loading flat files (primarily CSV) into Wasabi then querying them through Dremio. Many of the files represent versions of data that we retain so that we can reconstruct the state of our data pipelines and published results as needed; others are used for limited time-series type analysis. There are likely to be several hundred such file-based datasources.

Since the files almost never change once they are added (i.e. very close to write-once), I was thinking that once Reflections are created for the dataset corresponding to each file there should be relatively little traffic to the files themselves in the Wasabi S3 bucket. Since our Dremio instance runs in Azure, I’m thinking that keeping the Reflections on local disk of the Azure Dremio nodes would result in the best performance, or if the files get too large I might switch to ADLS for storage of Reflections.

Does any of that sound to you like a scenario that would encounter problems?

Thanks again and happy New Year.

Best regards,

Eric

Hi @eellsworth,
sorry – missed the notification for your reply.
I don’t think you’ll run into any problems given that your reflections will be stored on local disks. Even if Dremio sometimes won’t use the reflections your whatever reason, you probably won’t exceed Wasabi’s download quota.

You’ll probably also enable dataset caching (on local SSDs again) further reducing the likelihood of a quota problem. If you need to put your reflections on Object Storage, I’d definitely put them on ADLS2 instead of Wasabi.
We’ve also verified that Reflection Storage on Wasabi works well with Dremio, but there is a non-neglectable latency if the data lake isn’t really close to the Dremio executors. (Just to give you an idea of a worst-case scenario: Dremio compute in Europe with S3 buckets in US-East showed a 4-6x increase in query runtime in our experiments. It won’t be as bad if your Dremio executors (Azure) are geographically close to Wasabi, of course)

Best, Tim

Thanks once again @tid

Tim,
Thanks - this is excellent feedback. The reduced performance for Reflections on Wasabi is in line with what I expected. I will definitely put that and the dataset cache in ADLS.

Really appreciate the help.

Eric