I keep running into this error when querying parquet from AWS S3:
Unable to execute HTTP request: Timeout waiting for connection from pool
It’s not against all data sources, but enough to cause pain.
From the research I’ve done, it looks like I need to make sure that the fs.s3a.connection.maximum connection property is set to a high number. We currently have it set to 100000 at the moment. I could easily up the value to a million, but before I need to reset all our S3 data sources, I feel like there may be something else going on.
Is anyone else seeing this kind of behavior? Do I need to up the connection property to a million? If I do change the value to a million is there a better way to do it other than the GUI because rebuilding all the data sources is a pain.
fs.s3a.connection.maximum = 100000 should be sufficient. Can you try to increase fs.s3a.threads.max too? Just make sure that fs.s3a.connection.maximum is at least larger than fs.s3a.threads.max
Also, I’m curious why you say “rebuilding all the data sources is a pain”? It should be an easy quick update.
Are the Dremio executors on AWS as EC2 instances? Then you may have to add this parameter on core-site.xml and drop under the $DREMIO_HOME/conf on all executors and restart them.
@anthony, I meant to say “Data Sets”. We have a ton of data locations in S3 Buckets and when I change any setting on the S3 “Data Source” it drops all the defined “Data Sets”. The only way I know how to create “Data Sets” is to navigate to each folder in the interface. So, unless there is a different way to define “Data Sets” I’ll have to do that through the UI when I update the S3 “Data Source”.
@balaji.ramaswamy, does that mean that the S3 Data Source setup on the Dremio GUI doesn’t propagate those settings to all the executors in the cluster?
@balaji.ramaswamy and @anthony I wanted to give you an update on the changes that have been made and are looking to fix the problem right now.
I have NOT updated the S3 Source through the UI yet, because I was hoping @anthony could fill me in on a better way to create DataSets so I don’t have to go to each one and recreate them when I update the properties of the S3 Source.
I updated the core-site.xml files of the master and all the executors with the change that @balaji.ramaswamy suggested and I was able to setup the dataset on the problem parquet location. I assumed that the properties entered through the UI for the S3 Source propagated to all the executors, but after make the change with the core-site.xml, I now assume that is not true. Is the GUI S3 Source properties only for the UI operations? Could you guys shed some light on that?
Create datasets: When you change something in the source and it is a metadata impacting change then you do not have to recreate datasets, you might just have to repromote them as PDS. Not exactly sure what you mean by recreating datasets? Are you referring to Virtual Datasets?
The timeout parameter has to go to the source if the error is because the Parquet file is on a S3 bucket. It also has to go into your executors core-site.xml if your executors are on AWS as EC2 machines. Makes sense?
Thanks for the response @balaji.ramaswamy. I think I’m confused by the terminology.
1.) When we create a Data Set in S3, we use the UI to navigate to the location in an S3 bucket and click the Action button to convert the folder to a Data Set. I’m not sure if that’s a Physical Data Set or a Virtual Data Set. I guess I see them as Purple or Green? In this case, to create Purple Data sets, we have to do the navigation thing in the UI.
When we change metadata in the S3 Source configuration, it clears all the purple Data Sets defined in that source. So, setting them back up is a pain. (I’m sorry guys, that’s the best way I know how to explain it.)