Hi,
I am trying to set up Dremio to use data in S3 bucket that is offered by a non-Amazon provider (i.e., Wasabi) as the costs are much, much lower. However, when I go to connect to an S3 bucket I only see options to connect to AWS.
Does anyone have advice on configuring a connection to Wasabi?
The exceptions sometimes show Dremio attempting to connect to sts.amazonaws.com, and other times don’t list host names - they never show traffic to s3.wasabisys.com
at com.dremio.plugins.s3.store.STSCredentialProviderV1.getCredentials(STSCredentialProviderV1.java:71)
at org.apache.hadoop.fs.s3a.AWSCredentialProviderList.getCredentials(AWSCredentialProviderList.java:137)
… 47 common frames omitted Caused by: org.apache.http.conn.HttpHostConnectException: Connect to sts.amazonaws.com:80 [sts.amazonaws.com/54.239.29.25] failed: Connection refused (Connection refused)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:159)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:373)
at sun.reflect.GeneratedMethodAccessor31.invoke(Unknown Source)
Hi @Ben and @balaji.ramaswamy,
I just wanted to see if you all had any insights on how I can track down why Dremio does not seem to connect to Wasabi. Is there another parameter I can set that forces the S3 authentication?
I took a look at the github repo, and as near as I can tell the instatiation of the connection happens here:
which seems to be set of an AWS class
Is there some configuration setting that would not end up using EC2_METADATA_PROVIDER?
Hi, @eellsworth!
Good to hear that you got Dremio and Wasabi running together. We (Software AG) have run quite extensive tests against Wasabi (in coordination with Dremio’s engineering team). It is fully compatible, BUT: Wasabi applies throttling if the ratio between data you store and your monthly downloads exceeds 1:8 (if I recall that correctly, see “Fair Use Policy”, https://wasabi-support.zendesk.com/hc/en-us/articles/360027447392-How-do-Wasabi-s-fair-use-policies-work-). Depending on your use-case that might not be a problem – for typical analytical use-cases where data is rather static and queried frequently, it might be an issue.
The problem manifests in Dremio errors like ‘Failed to decode column XYZ’. If you then drill down into the detail of the profile, you’ll see that Wasabi responds with HTTP 403 resulting in the decoding problem.
Thank you – this is incredible helpful! Our team is pretty small (<10 people), and our main use case is loading flat files (primarily CSV) into Wasabi then querying them through Dremio. Many of the files represent versions of data that we retain so that we can reconstruct the state of our data pipelines and published results as needed; others are used for limited time-series type analysis. There are likely to be several hundred such file-based datasources.
Since the files almost never change once they are added (i.e. very close to write-once), I was thinking that once Reflections are created for the dataset corresponding to each file there should be relatively little traffic to the files themselves in the Wasabi S3 bucket. Since our Dremio instance runs in Azure, I’m thinking that keeping the Reflections on local disk of the Azure Dremio nodes would result in the best performance, or if the files get too large I might switch to ADLS for storage of Reflections.
Does any of that sound to you like a scenario that would encounter problems?
Hi @eellsworth,
sorry – missed the notification for your reply.
I don’t think you’ll run into any problems given that your reflections will be stored on local disks. Even if Dremio sometimes won’t use the reflections your whatever reason, you probably won’t exceed Wasabi’s download quota.
You’ll probably also enable dataset caching (on local SSDs again) further reducing the likelihood of a quota problem. If you need to put your reflections on Object Storage, I’d definitely put them on ADLS2 instead of Wasabi.
We’ve also verified that Reflection Storage on Wasabi works well with Dremio, but there is a non-neglectable latency if the data lake isn’t really close to the Dremio executors. (Just to give you an idea of a worst-case scenario: Dremio compute in Europe with S3 buckets in US-East showed a 4-6x increase in query runtime in our experiments. It won’t be as bad if your Dremio executors (Azure) are geographically close to Wasabi, of course)
Tim,
Thanks - this is excellent feedback. The reduced performance for Reflections on Wasabi is in line with what I expected. I will definitely put that and the dataset cache in ADLS.