Hive Connection to emr

Hello,

We have successfully connected dremio on ec2 to emr and hive tables are visible in the datasets. when we run a query against this dataset, query runs for about 10 mins and comes back with a error stating “cannot connect to dremio server”

any help is appreciated

Thanks
Shashi

Could you please look and/or share your query profile?: How To Share A Query Profile
Also do you have any other sources configured and can run queries against them successfully?

Yes, we have s3 setup and are able to access them fine
Here is the query profile :
. 0cdfd631-1296-480f-86e8-917166c5cabb.zip (2.8 KB)

We were able to run queries sucessfully connecting to postgres too

If your hive is on EMR we may not support it out of the box at the moment. Sorry for inconvenience.
Try to add following jar(s) to Dremio classpath: emrfs-hadoop-assembly-2.18.0.jar, jets3t-0.9.0.jar
Not sure it may be enough, but you can take a look at query profile/error to see if anything else is missing.
Your current error is:

"error":"org/jets3t/service/S3ServiceException","verboseError":"VALIDATION ERROR: org/jets3t/service/S3ServiceException
Sql Query SELECT *
FROM \"emr-hive\".breeding.breeding_digitalpipeline
  (java.lang.NoClassDefFoundError) org/jets3t/service/S3ServiceException    org.apache.hadoop.fs.s3.S3FileSystem.createDefaultStore():119
org.apache.hadoop.fs.s3.S3FileSystem.initialize():109
org.apache.hadoop.fs.FileSystem.createFileSystem():2811
 org.apache.hadoop.fs.FileSystem.access$200():100  
org.apache.hadoop.fs.FileSystem$Cache.getInternal():2848
org.apache.hadoop.fs.FileSystem$Cache.get():2830
org.apache.hadoop.fs.FileSystem.get():389
org.apache.hadoop.fs.Path.getFileSystem():356
com.dremio.exec.store.dfs.FileSystemWrapper.get():129

we are seeing the below errors with external tables where storage is from s3 to be specific

Failure while attempting to read metadata for hive.

Any suggestions are appreciated

Thanks
Shashi

Add your modified core-site.xml to a dremio classpath

Hi,

we are finally able to query hive external table on emr. protocol used shd be s3a for location instead of s3. This made a difference. Thanks Again for your help

Shashi

Glad it worked. Did you have to add additional jars, or emrfs-hadoop-assembly-2.18.0.jar, jets3t-0.9.0.jar were enough?

We didnot use any additional jars. we had to adjust our bucket policy though

Thanks
Shashi

we are still encountering some issues. For a small dataset that is “7 Objects - 23.1 MB” we are able to query an external hive table that is located in S3 in parquet format. This table is partitioned. For a dataset that is “196 Objects - 27.7 MB”. This is the profile.

5c80d02a-2cc2-4482-b603-72701a70f231.zip (9.3 KB)

Any ideas?

Here is the cause of your error:
Caused By (com.amazonaws.SdkClientException) Unable to execute HTTP request: Timeout waiting for connection from pool

Looks like timeout on the connection to s3 bucket. Do you get it consistently or intermittently? Do you see any pattern - like you get it when you do large query and do not get it if you query data that resides in a single partition?

We are able to read the data fine with hive and when we use the same external hive table with Presto. I am guessing Dremio can handle very large data sizes. Our hive tables are partitioned as well. Does Dremio have any settings like Presto for Hive s3. like hive.s3.connect-timeout and hive.s3.max-retry-time

We don’t have ability to set those properties at the moment, but would not be difficult to enhance it. Will add it to the list of enhancements.

ok, do you think that is the root cause? Do you know how much data Dremio can handle in the s3 external hive table?

Thanks!

I take my words back - about inablility to set those properties. I confused hive source with pure s3 source. You should be able to set whatever config properties (including timeouts and what not) while adding/modifying hive source

any idea what settings we should increase?

See here: https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java

updated fs.s3a.connection.maximum and fs.s3a.threads.max both to 60 with no luck. Any additional insight would be much appreciated.

quick update, it is working now. I have not pinpointed the actual fix but will update once I do.