Hive Connection to emr

shashi · November 26, 2017, 7:56pm

Hello,

We have successfully connected dremio on ec2 to emr and hive tables are visible in the datasets. when we run a query against this dataset, query runs for about 10 mins and comes back with a error stating “cannot connect to dremio server”

any help is appreciated

Thanks
Shashi

yufeldman · November 26, 2017, 9:58pm

Could you please look and/or share your query profile?: How To Share A Query Profile
Also do you have any other sources configured and can run queries against them successfully?

Russ_Wilson · November 27, 2017, 12:11am

Yes, we have s3 setup and are able to access them fine
Here is the query profile :
. 0cdfd631-1296-480f-86e8-917166c5cabb.zip (2.8 KB)

shashi · November 27, 2017, 12:26am

We were able to run queries sucessfully connecting to postgres too

yufeldman · November 27, 2017, 3:04am

If your hive is on EMR we may not support it out of the box at the moment. Sorry for inconvenience.
Try to add following jar(s) to Dremio classpath: emrfs-hadoop-assembly-2.18.0.jar, jets3t-0.9.0.jar
Not sure it may be enough, but you can take a look at query profile/error to see if anything else is missing.
Your current error is:

"error":"org/jets3t/service/S3ServiceException","verboseError":"VALIDATION ERROR: org/jets3t/service/S3ServiceException
Sql Query SELECT *
FROM \"emr-hive\".breeding.breeding_digitalpipeline
  (java.lang.NoClassDefFoundError) org/jets3t/service/S3ServiceException    org.apache.hadoop.fs.s3.S3FileSystem.createDefaultStore():119
org.apache.hadoop.fs.s3.S3FileSystem.initialize():109
org.apache.hadoop.fs.FileSystem.createFileSystem():2811
 org.apache.hadoop.fs.FileSystem.access$200():100  
org.apache.hadoop.fs.FileSystem$Cache.getInternal():2848
org.apache.hadoop.fs.FileSystem$Cache.get():2830
org.apache.hadoop.fs.FileSystem.get():389
org.apache.hadoop.fs.Path.getFileSystem():356
com.dremio.exec.store.dfs.FileSystemWrapper.get():129

shashi · November 27, 2017, 1:26pm

we are seeing the below errors with external tables where storage is from s3 to be specific

Failure while attempting to read metadata for hive.

Any suggestions are appreciated

Thanks
Shashi

yufeldman · November 27, 2017, 3:26pm

Add your modified core-site.xml to a dremio classpath

shashi · November 28, 2017, 1:03am

Hi,

we are finally able to query hive external table on emr. protocol used shd be s3a for location instead of s3. This made a difference. Thanks Again for your help

Shashi

yufeldman · November 28, 2017, 1:06am

Glad it worked. Did you have to add additional jars, or emrfs-hadoop-assembly-2.18.0.jar, jets3t-0.9.0.jar were enough?

shashi · November 28, 2017, 1:16am

We didnot use any additional jars. we had to adjust our bucket policy though

Thanks
Shashi

Russ_Wilson · November 28, 2017, 2:17pm

we are still encountering some issues. For a small dataset that is “7 Objects - 23.1 MB” we are able to query an external hive table that is located in S3 in parquet format. This table is partitioned. For a dataset that is “196 Objects - 27.7 MB”. This is the profile.

5c80d02a-2cc2-4482-b603-72701a70f231.zip (9.3 KB)

Any ideas?

yufeldman · November 28, 2017, 4:04pm

Here is the cause of your error:
Caused By (com.amazonaws.SdkClientException) Unable to execute HTTP request: Timeout waiting for connection from pool

Looks like timeout on the connection to s3 bucket. Do you get it consistently or intermittently? Do you see any pattern - like you get it when you do large query and do not get it if you query data that resides in a single partition?

Russ_Wilson · November 28, 2017, 4:23pm

We are able to read the data fine with hive and when we use the same external hive table with Presto. I am guessing Dremio can handle very large data sizes. Our hive tables are partitioned as well. Does Dremio have any settings like Presto for Hive s3. like hive.s3.connect-timeout and hive.s3.max-retry-time

yufeldman · November 28, 2017, 6:11pm

We don’t have ability to set those properties at the moment, but would not be difficult to enhance it. Will add it to the list of enhancements.

Russ_Wilson · November 28, 2017, 6:14pm

ok, do you think that is the root cause? Do you know how much data Dremio can handle in the s3 external hive table?

Thanks!

yufeldman · November 28, 2017, 6:27pm

I take my words back - about inablility to set those properties. I confused hive source with pure s3 source. You should be able to set whatever config properties (including timeouts and what not) while adding/modifying hive source

Russ_Wilson · November 28, 2017, 8:07pm

any idea what settings we should increase?

yufeldman · November 28, 2017, 8:51pm

See here: https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java

Russ_Wilson · November 29, 2017, 12:45am

updated fs.s3a.connection.maximum and fs.s3a.threads.max both to 60 with no luck. Any additional insight would be much appreciated.

Russ_Wilson · November 29, 2017, 1:56am

quick update, it is working now. I have not pinpointed the actual fix but will update once I do.

Topic		Replies	Views
Hive connectivity issues	13	3220	June 22, 2020
Dremio Connection With Remote Hive	14	3004	October 19, 2018
Unable to add HDFS nor HIVE to DS	10	2748	June 19, 2018
Could not access tables inside hive source	19	1941	January 29, 2021
Connecting to Hive with Dremio 2.1	3	1386	September 21, 2018

Hive Connection to emr

Related topics