Hive connectivity issues


#1

Hi,

I’m using Hive on top of S3 as my physical datasource on Dremio for evaluation purpose. I’ve a few queries around it:

  1. Do all Hive tables be created with s3a filesystem? I get the following error when i try to connect to an external hive table created on top of s3 filesystem -
    java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException

Is there a way around for dremio read s3 or s3n filesystems as well?

  1. How is the access control on Hive tables managed?
    I’ve a few tables created in hive with s3a filesystem and I was able to connect to them and query them. But i’m getting the following error right now -

2018-12-31 02:40:58,971 [qtp382934088-1539] INFO c.d.e.catalog.ManagedStoragePlugin - User Error Occurred [ErrorId: d3c33d5b-0b9d-4787-ba1a-15b99ca8de30] com.dremio.common.exceptions.UserException: Access denied reading dataset "connstring".schema.tablename.

There is no change in the instance profile that grants the s3 acess. There is no change in the security group to access hive thrift server. Infact i even tried removing the source and adding it back. It retrieves the tables when the new connection is added; but still throws the same Access Denied error when querying.


#2

Hi @Sneha_Krishnaswamy

For #1, please add below to the Hive source-advanced-Additional Properties

Name: fs.s3.impl
Value: org.apache.hadoop.fs.s3a.S3AFileSystem

For #2, Can you please send us the profile?

Thanks
@balaji.ramaswamy


#3

I’ve the following set of properties in my Hive connection:
store.hive.use_stats_in_metastore = true
fs.s3.impl = org.apache.hadoop.fs.s3a.S3AFileSystem
fs.s3a.connection.maximum = 5000
fs.s3a.threads.max = 7000

Attached is the query profile 469daa30-a02b-457d-ba45-9fc4e09f49e8.zip (3.5 KB)
Note: The query failed at the planning stage itself.


#4

Bumping it.
@balaji.ramaswamy any suggestions for me to try out?


#5

2 things @Sneha_Krishnaswamy

  1. connection.maximum should >= thread.max, so make connection.maximum also 7000

Is impersonation turned on? If yes then does both the process that is running Dremio and the Dremio login use has access to the HDFS folder/file?

Thanks
@balaji.ramaswamy


#6

Made the change in connection.maximum. to be 9000 now.

Also, currently I’ve whitelisted the dremio security group to allow access to the hive emr. Do i’ve to specifically do any sort of whitelisting for the core nodes / hdfs?

Adding to the above answer;

  • The Hive tables are created on top of S3. Are you referring to the HDFS of the core nodes on the Hive EMR cluster?
  • Where is impersonation supposed to be enabled? If its not enabled by default, I don’t think i’ve turned it on. And from what i understand from the documentation, impersonation is enabled while adding HDFS source. I’ve added Hive as a source in dremio and I connect via the Hive thrift server. So I’m not sure where to do this.

The issue to me currently looks like, Dremio is able to connect to Hive thrift server, but not to Hive itself.


#7

Hi @Sneha_Krishnaswamy

Are the Dremio executors running on EC2? If yes then the Dremio executor should be able to talk to the EMR nodes as during query execution Dremio executors will have to query the HDFS nodes that actually contain the Parquet/ORC data. Makes sense?

Impersonation can be overidden via the Hive source form on the Dremio UI by going to Advance-Additional Properties

Name: hive.server2.enable.doAs
Value: False (All queries will hit HDFS as unix user that runs the Dremio process)
Value: True (Deep Metadata probe done by unix user that runs the Dremio process while the actual data is queries as the user who logs into Dremio)

Note: To check the current impersonation setting on your Hive, open hive-site.xml on the EMR node that runs the metastore and check for value of hive.server2.enable.doAs

Thanks
@balaji.ramaswamy