I’m using Hive on top of S3 as my physical datasource on Dremio for evaluation purpose. I’ve a few queries around it:
Do all Hive tables be created with s3a filesystem? I get the following error when i try to connect to an external hive table created on top of s3 filesystem - java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
Is there a way around for dremio read s3 or s3n filesystems as well?
How is the access control on Hive tables managed?
I’ve a few tables created in hive with s3a filesystem and I was able to connect to them and query them. But i’m getting the following error right now -
2018-12-31 02:40:58,971 [qtp382934088-1539] INFO c.d.e.catalog.ManagedStoragePlugin - User Error Occurred [ErrorId: d3c33d5b-0b9d-4787-ba1a-15b99ca8de30] com.dremio.common.exceptions.UserException: Access denied reading dataset "connstring".schema.tablename.
There is no change in the instance profile that grants the s3 acess. There is no change in the security group to access hive thrift server. Infact i even tried removing the source and adding it back. It retrieves the tables when the new connection is added; but still throws the same Access Denied error when querying.
I’ve the following set of properties in my Hive connection:
store.hive.use_stats_in_metastore = true
fs.s3.impl = org.apache.hadoop.fs.s3a.S3AFileSystem
fs.s3a.connection.maximum = 5000
fs.s3a.threads.max = 7000
Made the change in connection.maximum. to be 9000 now.
Also, currently I’ve whitelisted the dremio security group to allow access to the hive emr. Do i’ve to specifically do any sort of whitelisting for the core nodes / hdfs?
Adding to the above answer;
The Hive tables are created on top of S3. Are you referring to the HDFS of the core nodes on the Hive EMR cluster?
Where is impersonation supposed to be enabled? If its not enabled by default, I don’t think i’ve turned it on. And from what i understand from the documentation, impersonation is enabled while adding HDFS source. I’ve added Hive as a source in dremio and I connect via the Hive thrift server. So I’m not sure where to do this.
The issue to me currently looks like, Dremio is able to connect to Hive thrift server, but not to Hive itself.
Are the Dremio executors running on EC2? If yes then the Dremio executor should be able to talk to the EMR nodes as during query execution Dremio executors will have to query the HDFS nodes that actually contain the Parquet/ORC data. Makes sense?
Impersonation can be overidden via the Hive source form on the Dremio UI by going to Advance-Additional Properties
Name: hive.server2.enable.doAs
Value: False (All queries will hit HDFS as unix user that runs the Dremio process)
Value: True (Deep Metadata probe done by unix user that runs the Dremio process while the actual data is queries as the user who logs into Dremio)
Note: To check the current impersonation setting on your Hive, open hive-site.xml on the EMR node that runs the metastore and check for value of hive.server2.enable.doAs
Hi @balaji.ramaswamy
how the user impersonation works when the hive is running in a different cluster. I have dremio user in my cluster where dremio is installed and not in the cluster where hive is.
Check the value of “hive.server2.enable.doAs” in the hive-site.xml on the Hive server. You can override the setting via Dremio using the steps suggested above
Thanks @balaji.ramaswamy for quick turnaround.
Yes i set the doas property to false in hive connection advanced setup. and in coresite.xml of the cluster where hive is, i setup the parameters
Hello guys,
I have been exploring dremio AWS edition for the last few days for our business use cases. However, I am still facing lot of issues in integrating with hive.
I have tried almost all the steps that were mentioned in different community threads.
As of now, here are my major issues with dremio:
org/jets3t/service/ServiceException
While accessing the hive tables, i am getting this exception multiple times
I have tried below things, but it had no luck
Have added below properties in Hive source-advanced-Additional Properties
hive.server2.enable.doAs=false
fs.s3.impl = org.apache.hadoop.fs.s3a.S3AFileSystem
fs.s3a.connection.maximum = 9000
fs.s3a.threads.max = 7000
also have updated below properties in hive cluster:
hadoop.proxyuser.dremio.hosts
hadoop.proxyuser.dremio.groups
hadoop.proxyuser.dremio.users
*
But i’m still getting the same error.
The second issue, i am getting is: ClassNotFoundException: org.apache.hadoop.hive.contrib.serde2.JsonSerde
We are accessing external hive tables created on s3.
I tried fixing this by adding hive-json-serde.jar jar /opt/dremio/plugins/connectors/hive2.d/ which has the missing class, but still it din’t solve my problem
Thanks @balaji.ramaswamy
I have done the above steps and still i’m seeing the issue.
Also when i have seen the dremio server.log, i don’t see the jars from <DREMIO_HOME>/plugins/connectors/hive2-ee.d or <DREMIO_HOME>/plugins/connectors/hive2.d being loaded in the classpath.
I also tried adding jars in <DREMIO_HOME>/jars and checked, still dremio is not picking those jars.
Do i need to do any more steps for fixing this. Also can you help me where can add our custom jars to dremio’s classpath