Hive connectivity issues

Hi,

I’m using Hive on top of S3 as my physical datasource on Dremio for evaluation purpose. I’ve a few queries around it:

  1. Do all Hive tables be created with s3a filesystem? I get the following error when i try to connect to an external hive table created on top of s3 filesystem -
    java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException

Is there a way around for dremio read s3 or s3n filesystems as well?

  1. How is the access control on Hive tables managed?
    I’ve a few tables created in hive with s3a filesystem and I was able to connect to them and query them. But i’m getting the following error right now -

2018-12-31 02:40:58,971 [qtp382934088-1539] INFO c.d.e.catalog.ManagedStoragePlugin - User Error Occurred [ErrorId: d3c33d5b-0b9d-4787-ba1a-15b99ca8de30] com.dremio.common.exceptions.UserException: Access denied reading dataset "connstring".schema.tablename.

There is no change in the instance profile that grants the s3 acess. There is no change in the security group to access hive thrift server. Infact i even tried removing the source and adding it back. It retrieves the tables when the new connection is added; but still throws the same Access Denied error when querying.

Hi @Sneha_Krishnaswamy

For #1, please add below to the Hive source-advanced-Additional Properties

Name: fs.s3.impl
Value: org.apache.hadoop.fs.s3a.S3AFileSystem

For #2, Can you please send us the profile?

Thanks
@balaji.ramaswamy

I’ve the following set of properties in my Hive connection:
store.hive.use_stats_in_metastore = true
fs.s3.impl = org.apache.hadoop.fs.s3a.S3AFileSystem
fs.s3a.connection.maximum = 5000
fs.s3a.threads.max = 7000

Attached is the query profile 469daa30-a02b-457d-ba45-9fc4e09f49e8.zip (3.5 KB)
Note: The query failed at the planning stage itself.

Bumping it.
@balaji.ramaswamy any suggestions for me to try out?

2 things @Sneha_Krishnaswamy

  1. connection.maximum should >= thread.max, so make connection.maximum also 7000

Is impersonation turned on? If yes then does both the process that is running Dremio and the Dremio login use has access to the HDFS folder/file?

Thanks
@balaji.ramaswamy

Made the change in connection.maximum. to be 9000 now.

Also, currently I’ve whitelisted the dremio security group to allow access to the hive emr. Do i’ve to specifically do any sort of whitelisting for the core nodes / hdfs?

Adding to the above answer;

  • The Hive tables are created on top of S3. Are you referring to the HDFS of the core nodes on the Hive EMR cluster?
  • Where is impersonation supposed to be enabled? If its not enabled by default, I don’t think i’ve turned it on. And from what i understand from the documentation, impersonation is enabled while adding HDFS source. I’ve added Hive as a source in dremio and I connect via the Hive thrift server. So I’m not sure where to do this.

The issue to me currently looks like, Dremio is able to connect to Hive thrift server, but not to Hive itself.

Hi @Sneha_Krishnaswamy

Are the Dremio executors running on EC2? If yes then the Dremio executor should be able to talk to the EMR nodes as during query execution Dremio executors will have to query the HDFS nodes that actually contain the Parquet/ORC data. Makes sense?

Impersonation can be overidden via the Hive source form on the Dremio UI by going to Advance-Additional Properties

Name: hive.server2.enable.doAs
Value: False (All queries will hit HDFS as unix user that runs the Dremio process)
Value: True (Deep Metadata probe done by unix user that runs the Dremio process while the actual data is queries as the user who logs into Dremio)

Note: To check the current impersonation setting on your Hive, open hive-site.xml on the EMR node that runs the metastore and check for value of hive.server2.enable.doAs

Thanks
@balaji.ramaswamy

Hi @balaji.ramaswamy
how the user impersonation works when the hive is running in a different cluster. I have dremio user in my cluster where dremio is installed and not in the cluster where hive is.

@mmohan

Check the value of “hive.server2.enable.doAs” in the hive-site.xml on the Hive server. You can override the setting via Dremio using the steps suggested above

Thanks
@balaji.ramaswamy

Thanks @balaji.ramaswamy for quick turnaround.
Yes i set the doas property to false in hive connection advanced setup. and in coresite.xml of the cluster where hive is, i setup the parameters

  <property>
    <name>hadoop.proxyuser.dremio.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.dremio.groups</name>
    <value>*</value>
  </property>
  <property>
      <name>hadoop.proxyuser.dremio.users</name>
      <value>*</value>
  </property>

do i need to have dremio user in the hive cluster as well?

Hello guys,
I have been exploring dremio AWS edition for the last few days for our business use cases. However, I am still facing lot of issues in integrating with hive.
I have tried almost all the steps that were mentioned in different community threads.

As of now, here are my major issues with dremio:

  1. org/jets3t/service/ServiceException
    While accessing the hive tables, i am getting this exception multiple times
    I have tried below things, but it had no luck
    Have added below properties in Hive source-advanced-Additional Properties
    hive.server2.enable.doAs=false
    fs.s3.impl = org.apache.hadoop.fs.s3a.S3AFileSystem
    fs.s3a.connection.maximum = 9000
    fs.s3a.threads.max = 7000
    also have updated below properties in hive cluster:

    hadoop.proxyuser.dremio.hosts



    hadoop.proxyuser.dremio.groups



    hadoop.proxyuser.dremio.users
    *

But i’m still getting the same error.

  1. The second issue, i am getting is: ClassNotFoundException: org.apache.hadoop.hive.contrib.serde2.JsonSerde
    We are accessing external hive tables created on s3.
    I tried fixing this by adding hive-json-serde.jar jar /opt/dremio/plugins/connectors/hive2.d/ which has the missing class, but still it din’t solve my problem

Could anyone help me on these issues?

Our dremio AWS edition details:
Build: 4.2.1-202004111451200819-0c3ecaea
Edition: AWS Edition
Build Time:04/11/2020 20:28:17

Thanks,
Jalandhar

@jalandhar

For the JsonSerde issue,

cd <DREMIO_HOME>/plugins/connectors
mkdir hive2-ee.d (if using Hive 2.x source)
mkdir hive3-ee.d (If using Hive 3.x source)
cp -p hive-hcatalog-core-.jar cd <DREMIO_HOME>/plugins/connectors/hive2-ee.d (if 2.x Hive source)
cp -p hive-hcatalog-core-.jar cd <DREMIO_HOME>/plugins/connectors/hive3-ee.d (if 3.x Hive source)

Restart Dremio

Query the table

Thanks @balaji.ramaswamy
I have done the above steps and still i’m seeing the issue.
Also when i have seen the dremio server.log, i don’t see the jars from <DREMIO_HOME>/plugins/connectors/hive2-ee.d or <DREMIO_HOME>/plugins/connectors/hive2.d being loaded in the classpath.
I also tried adding jars in <DREMIO_HOME>/jars and checked, still dremio is not picking those jars.

Do i need to do any more steps for fixing this. Also can you help me where can add our custom jars to dremio’s classpath

Thanks,
Jalandhar

@jalandhar

If this is Dremio 4.2.2 or later, have you also symlinked DREMIO_HOME/conf to DREMIO_HOME/plugins/connector/hive2.d-ee? See documentation below?

http://docs.dremio.com/data-sources/hive.html#hive-configuration