I keep running into this error when querying parquet from AWS S3:
Unable to execute HTTP request: Timeout waiting for connection from pool
It’s not against all data sources, but enough to cause pain.
From the research I’ve done, it looks like I need to make sure that the
fs.s3a.connection.maximum connection property is set to a high number. We currently have it set to
100000 at the moment. I could easily up the value to a million, but before I need to reset all our S3 data sources, I feel like there may be something else going on.
Is anyone else seeing this kind of behavior? Do I need to up the connection property to a million? If I do change the value to a million is there a better way to do it other than the GUI because rebuilding all the data sources is a pain.
Any help is appreciated. Thank you!
fs.s3a.connection.maximum = 100000 should be sufficient. Can you try to increase
fs.s3a.threads.max too? Just make sure that
fs.s3a.connection.maximum is at least larger than
Also, I’m curious why you say “rebuilding all the data sources is a pain”? It should be an easy quick update.
Are the Dremio executors on AWS as EC2 instances? Then you may have to add this parameter on core-site.xml and drop under the $DREMIO_HOME/conf on all executors and restart them.
<value> 100000 </value>
@anthony, I meant to say “Data Sets”. We have a ton of data locations in S3 Buckets and when I change any setting on the S3 “Data Source” it drops all the defined “Data Sets”. The only way I know how to create “Data Sets” is to navigate to each folder in the interface. So, unless there is a different way to define “Data Sets” I’ll have to do that through the UI when I update the S3 “Data Source”.
@balaji.ramaswamy, does that mean that the S3 Data Source setup on the Dremio GUI doesn’t propagate those settings to all the executors in the cluster?
@balaji.ramaswamy and @anthony I wanted to give you an update on the changes that have been made and are looking to fix the problem right now.
- I have NOT updated the S3 Source through the UI yet, because I was hoping @anthony could fill me in on a better way to create DataSets so I don’t have to go to each one and recreate them when I update the properties of the S3 Source.
- I updated the core-site.xml files of the master and all the executors with the change that @balaji.ramaswamy suggested and I was able to setup the dataset on the problem parquet location. I assumed that the properties entered through the UI for the S3 Source propagated to all the executors, but after make the change with the core-site.xml, I now assume that is not true. Is the GUI S3 Source properties only for the UI operations? Could you guys shed some light on that?
Thanks for the response @balaji.ramaswamy. I think I’m confused by the terminology.
1.) When we create a Data Set in S3, we use the UI to navigate to the location in an S3 bucket and click the Action button to convert the folder to a Data Set. I’m not sure if that’s a Physical Data Set or a Virtual Data Set. I guess I see them as Purple or Green? In this case, to create Purple Data sets, we have to do the navigation thing in the UI.
When we change metadata in the S3 Source configuration, it clears all the purple Data Sets defined in that source. So, setting them back up is a pain. (I’m sorry guys, that’s the best way I know how to explain it.)
2.) Yes, that makes sense @balaji.ramaswamy!
Thanks for confirming on #2
#1, now I am also in the same page
Initially when you promote a S3 folder it is a PDS (Purple)
You can do transformations on PDS and save as VDS (Green)
You can promote via UI or also through REST API and script it all
Promote to PDS Via REST API
REST API Reference
Kindly let me know if you have any other questions
That does make sense @balaji.ramaswamy, thank you so much for all your help!