Hive Connector - issues with a JSON SerDe external table

Hey guys!

We started recently evaluating Dremio for our project and there is one single blocker that popped up (at least for now).

We are using YARN provisioning on an existing HDP 31.4 cluster.
When we are trying to access our Hive external tables that are using the JSON SerDe (we tried both with hive-hcatalog-core and with GitHub - rcongiu/Hive-JSON-Serde: Read - Write JSON SerDe for Apache Hive.).

First we had an error where the executors couldn’t find the serde class, and we added the jar’s in dremio.conf in both “provisioning.yarn.classpath” and “provisioning.yarn.app.classpath” and we got past this.

Now the error we get is:

(java.lang.RuntimeException) java.lang.ClassCastException: class org.apache.hive.hcatalog.data.JsonSerDe
com.dremio.exec.store.hive.exec.ScanWithHiveReader$1$1.run():193
com.dremio.exec.store.hive.exec.ScanWithHiveReader$1$1.run():184
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():360
org.apache.hadoop.security.UserGroupInformation.doAs():1710
com.dremio.exec.store.hive.exec.ScanWithHiveReader$1.apply():184
com.dremio.exec.store.hive.exec.ScanWithHiveReader$1.apply():180
com.google.common.collect.Iterators$7.transform():750
com.google.common.collect.TransformedIterator.next():47
com.google.common.collect.MultitransformedIterator.next():66
com.dremio.sabot.op.scan.ScanOperator.():159
com.dremio.sabot.op.scan.ScanOperator.():134
com.dremio.exec.store.hive.exec.HiveScanBatchCreator.create():128
com.dremio.exec.store.hive.exec.HiveProxyingScanBatchCreator.create():34
com.dremio.exec.store.hive.exec.HiveProxyingScanBatchCreator.create():27
com.dremio.sabot.driver.OperatorCreatorRegistry.getProducerOperator():94
com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitSubScan():210
com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitSubScan():115
com.dremio.exec.physical.base.AbstractSubScan.accept():85
com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():247
com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():115
com.dremio.exec.physical.base.AbstractPhysicalVisitor.visitLimit():101
com.dremio.exec.physical.config.Limit.accept():59
com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():247
com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():115
com.dremio.exec.physical.base.AbstractPhysicalVisitor.visitLimit():101
com.dremio.exec.physical.config.Limit.accept():59
com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():247
com.dremio.sabot.driver.PipelineCreator$CreatorVisitor.visitOp():115
com.dremio.exec.physical.base.AbstractPhysicalVisitor.visitProject():81
com.dremio.exec.physical.config.Project.accept():55

Google’d a lot for it, checked the Dremio source starting with ScanWithHiveReader but I cannot figure it out what’s the issue - it would suggest the class is not the right type (must implement / extend something). And I’m not sure why.

Is Dremio expected to work with the JSON SerDe on Hive external tables or just external tables in general? Did we do something very wrong?

There’s another tiny little question: is it possible to limit the data pulled by the Preview functionality (like it would include some “TOP/LIMIT x” in the JDBC SQL queries?
It currently seems to pull the entire source table (we have some huge ones) and then does the limit after it gets all the data.
We checked all the settings / configs - didn’t find anything for this.

Thank you,
Aris B.

@arisro

Please try the below

Upgrade Dremio to 4.1.8,

cd <DREMIO_HOME>/plugins/connectors
mkdir hive2-ee.d (if using Hive 2.x source)
mkdir hive3-ee.d (If using Hive 3.x source)
cp -p hive-hcatalog-core-.jar cd <DREMIO_HOME>/plugins/connectors/hive2-ee.d (if 2.x Hive source)
cp -p hive-hcatalog-core-.jar cd <DREMIO_HOME>/plugins/connectors/hive3-ee.d (if 3.x Hive source)

Restart Dremio
Query the table

Thanks
@balaji.ramaswamy

Hey @balaji.ramaswamy,

We are using Dremio 4.1.8.
I did what you suggested - and it’s good to know - this way we could remove the settings “provisioning.yarn.classpath” from dremio.conf and we don’t get the “class not found” exception too.

But with this change, we still get the same “(java.lang.RuntimeException) java.lang.ClassCastException: class org.apache.hive.hcatalog.data.JsonSerDe” error. :confused:
It looks like Dremio cannot use the Hive JSON SerDe’s for external tables - I don’t know where this error comes from, but it looks like it just doesn’t like the SerDe class, and we tried 2 different libs.

Does the Hive SerDe class need to be somehow specially crafted for Dremio? Does anyone know if there’s a Hive JSON SerDe compatible with Dremio? :confused:
Is anyone else using Hive external tables on JSON structured data in Dremio?

Note: we have some tables using “org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe” and they work fine. So how is this SerDe more different than the JSON ones?

Thank you,
Aris

@arisro

I was able to validate for am managed table. Does a managed table works?

@balaji.ramaswamy Did you try it with a managed table saved as JSON (ROW FORMAT SERDE ‘org.apache.hive.hcatalog.data.JsonSerDe’) and it worked fine?

I tried it, and got the same error:

(java.lang.RuntimeException) java.lang.ClassCastException: class * org.apache.hive.hcatalog.data.JsonSerDe
com.dremio.exec.store.hive.exec.ScanWithHiveReader$1$1.run():193
com.dremio.exec.store.hive.exec.ScanWithHiveReader$1$1.run():184
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():360
org.apache.hadoop.security.UserGroupInformation.doAs():1710
com.dremio.exec.store.hive.exec.ScanWithHiveReader$1.apply():184
com.dremio.exec.store.hive.exec.ScanWithHiveReader$1.apply():180
com.google.common.collect.Iterators$7.transform():750

It works fine with Parquet storage for example - it has some issues with the JSON SerDe’s - both of them, that we tried. :confused:

Aris

Hi @arisro,

Can you try these modified instructions (note the directory name difference):

cd <DREMIO_HOME>/plugins/connectors
mkdir hive2.d (if using Hive 2.x source)
mkdir hive3.d (If using Hive 3.x source)
cp -p hive-hcatalog-core-.jar cd <DREMIO_HOME>/plugins/connectors/hive2.d (if 2.x Hive source)
cp -p hive-hcatalog-core-.jar cd <DREMIO_HOME>/plugins/connectors/hive3.d (if 3.x Hive source)

@jduong

If I move the jar to hive3.d, I get the "ClassNotFoundException: org.apache.hive.hcatalog.data.JsonSerDe " again. So “hive3-ee.d” looks to be the dir - but for some reason Dremio doesn’t like that class :confused:

The error is pretty self-explanatory “java.lang.ClassCastException: class” - so I was just wondering if these JSON SerDe’s are not supported for Hive 3 :confused: .

As a workaround we will probably migrate some processes to parquet files.

Regarding the second question, the Preview of a dataset, is it possible to not pull the entire table from the external source when preview-ing, and apply the limit after that? Just pull the data with a limit - ie. if it’s a JDBC source, just do a TOP / LIMIT X query?
Is there any support for this, some configuration that we missed? We have dome SQL Server tables with 500M rows, and the preview attempts to bring everything into Dremio just for previewing…

Thanks,
Aris

Hi @balaji.ramaswamy,
We are using Dremio 4.9.1 now,
May I know does Dremio support HIVE tables in JSON format?

or do we need to import the below external .jar package to save the table in TEXTFILE format by below sql?
ROW FORMAT SERDE ‘org.apache.hive.hcatalog.data.JsonSerDe’
STORED AS TEXTFILE

Expected your confirmation on this.Thanks.

Hi @jduong,
We are using Dremio 4.9.1 now,
May I know does Dremio support HIVE tables in JSON format?

or do we need to import the below external .jar package (hive-hcatalog-core-0.13.1-cdh5.3.6.jar)
to save the table in TEXTFILE format by below sql:
ROW FORMAT SERDE ‘org.apache.hive.hcatalog.data.JsonSerDe’
STORED AS TEXTFILE

Expected your confirmation on this.Thanks.

@ThomasW

Upgrade Dremio to 4.1.7 or later

cd <DREMIO_HOME>/plugins/connectors
mkdir hive2.d (if using Hive 2.x source)
mkdir hive3.d (If using Hive 3.x source)
cp -p hive-hcatalog-core-1.2.2.jar cd <DREMIO_HOME>/plugins/connectors/hive2.d (if 2.x Hive source)
cp -p hive-hcatalog-core-1.2.2.jar cd <DREMIO_HOME>/plugins/connectors/hive3.d (if 3.x Hive source)

Restart Dremio

Query the table

Hi @balaji.ramaswamy,
We need to create the table by
STORED AS TEXTFILE
or
STORED AS JSON
directly?
Thanks for your reply.

@ThomasW Here is an example Create tables on JSON datasets | Databricks on AWS

Thank you @balaji.ramaswamy.

Is this a must to use this external JAR? to create Hive table on JSON in Dremio?

@ThomasW That is correct