Dremio Apache Hudi Support

rubenssoto · May 5, 2020, 11:40pm

Hi Guys,

We are planning our data infra on aws, we are starting to create our data lake, our plans is to use Emr for etl prupose and dremio for our user access the data lake.

We want to use Hudi datasets, is there any plan to dremio support this data source?

thank you so much!

balaji.ramaswamy · May 9, 2020, 7:41am

@rubenssoto

Have you tried this in versions 4.1.7 or higher?

cd <DREMIO_HOME>/plugins/connectors
mkdir hive2-ee.d (if using Hive 2.x source)
mkdir hive3-ee.d (If using Hive 3.x source)

cp -p < hudi>.jar cd <DREMIO_HOME>/plugins/connectors/hive2-ee.d (if 2.x Hive source)
cp -p < hudi>.jar cd <DREMIO_HOME>/plugins/connectors/hive3-ee.d (if 3.x Hive source)

Restart Dremio

Query the table

spyzzz · October 28, 2020, 2:45pm

Hello,

I’m also interrested in reading hudi table from dremio.
I’m using Dremio 4.0.5-201911202046080257-19b10938

I’m yet not able to upgrade to 4.1 because you’re datalake use hive 1.X

I tried to add hudi jar to
DREMIO_HOME/jars/lib/hudi-hadoop-mr-bundle-0.6.0.jar
AND
DREMIO_HOME/plugins/connectors/hudi-hadoop-mr-bundle-0.6.0.jar

But it still doesnt work :

java.lang.RuntimeException: class org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat not org.apache.hadoop.mapred.InputFormat

When i’m using hive directly, its works perfectly …

balaji.ramaswamy · October 29, 2020, 6:20am

@spyzzz

Custom jars feature are only from 4.1.7 and higher

spyzzz · October 30, 2020, 2:16pm

@balaji.ramaswamy thanks for reply.
There is really no workaround for a 4.0.5 version ? its our production version and migration to 4.1 needs us to upgrade all HIVE env to 2.X and its not in our early roadmap.

This functionality seems basic.

jarred · January 29, 2021, 3:26am

Hi @balaji.ramaswamy,

Above you have explained where to put the Hudi jars for Hive2 and Hive3. However, we are using AWS Glue (for our AWS EMR cluster running Hive). Is there a similar process / folder to add Hudi jars for AWS Glue?

kval · February 26, 2021, 5:41pm

Hi @balaji.ramaswamy
Do we have a documentation page where you can refer to, about configuration of Dremio for Hudi?
We have done copying the Jars, but we still see duplicates (aggregating all s3 files instead latest Hudi Dataset), where as Hive working well. We are on AWS Edition, and build 13.2.0

balaji.ramaswamy · February 27, 2021, 5:50am

@kval

Sorry, now able to follow, what do you mean by “aggregating all s3 files instead latest Hudi Dataset”, are you getting wrong results? Are you able to share a profile?

skesa · March 1, 2021, 6:01pm

Hi @balaji.ramaswamy , I work with @kval

Please find attached Job profile.

Let me know if you need more information.

8b060356-38b0-472b-a372-786f75afc3e9.zip (23.6 KB)

kval · March 1, 2021, 6:33pm

@balaji.ramaswamy I would like to present a scenario that is causing duplicates.
If there is a new record with cdc data in the existing partition , then Hudi creates a new file on s3 in that partition, that includes the new record came in with cdc and also it will have the existing record (came with bulk insert in the past). Since now this record present in both parquet files (file with bulk, and new file created with cdc), Dremio is simply aggregating this dataset and so we see duplicates instead of seeing data from cdc file.
as per reference : Hive · Dremio It says it support out-of-the box hive serde and custom serde, but it seems it is not using the Hudi JARs we deployed. We used a test using Hive SQL using spark on EMR and we do not see dups but same query on dremio returns dups.

balaji.ramaswamy · March 2, 2021, 7:59am

Thanks for the explanation @kval , let me discuss this internally and get back to you

balaji.ramaswamy · March 3, 2021, 7:19am

@kval

Are you able to provide us with the describe formatted <table_name> from hive shell?

kval · March 3, 2021, 5:21pm

@balaji.ramaswamy pls see below.

col_name	data_type	comment
_hoodie_commit_time	string
_hoodie_commit_seqno	string
_hoodie_record_key	string
_hoodie_partition_path	string
_hoodie_file_name	string
je_batch_id	decimal(15,0)
last_update_date	bigint
last_updated_by	decimal(15,0)
set_of_books_id_11i	decimal(15,0)
name	string
status	string
status_verified	string
actual_flag	string
default_effective_date	bigint
average_journal_flag	string
budgetary_control_status	string
approval_status_code	string
creation_date	bigint
created_by	decimal(15,0)
last_update_login	decimal(15,0)
status_reset_flag	string
default_period_name	string
unique_date	string
earliest_postable_date	bigint
posted_date	bigint
date_created	bigint
description	string
control_total	decimal(38,10)
running_total_dr	decimal(38,10)
running_total_cr	decimal(38,10)
running_total_accounted_dr	decimal(38,10)
running_total_accounted_cr	decimal(38,10)
parent_je_batch_id	decimal(15,0)
attribute1	string
attribute2	string
attribute3	string
attribute4	string
attribute5	string
attribute6	string
attribute7	string
attribute8	string
attribute9	string
attribute10	string
context	string
unreservation_packet_id	decimal(15,0)
packet_id	decimal(15,0)
ussgl_transaction_code	string
context2	string
posting_run_id	decimal(15,0)
request_id	decimal(15,0)
org_id	decimal(15,0)
posted_by	decimal(15,0)
chart_of_accounts_id	decimal(15,0)
period_set_name	string
accounted_period_type	string
group_id	decimal(38,10)
approver_employee_id	decimal(15,0)
global_attribute_category	string
global_attribute1	string
global_attribute2	string
global_attribute3	string
global_attribute4	string
global_attribute5	string
global_attribute6	string
global_attribute7	string
global_attribute8	string
global_attribute9	string
global_attribute10	string
global_attribute11	string
global_attribute12	string
global_attribute13	string
global_attribute14	string
global_attribute15	string
global_attribute16	string
global_attribute17	string
global_attribute18	string
global_attribute19	string
global_attribute20	string
default_effective_date_partition	string
	NULL	NULL
# Partition Information	NULL	NULL
# col_name	data_type	comment
	NULL	NULL
default_effective_date_partition	string

kval · March 4, 2021, 4:28pm

Hi @balaji.ramaswamy I can also provide another scenario if that helps. It also aggregate same record even if its gets an update but written to new parquet file. So when you query from Dremio, it shows two records for same key one with first and second with updated data where the spark sql on Hive table just pulls up the latest updated record.

balaji.ramaswamy · March 5, 2021, 7:44am

@kval

From the describe formatted, I am particularly interested in “Inputformat” and “Outputformat”, can you please run the command from Hive shell?

Thanks

kval · March 5, 2021, 9:21am

@balaji.ramaswamy I just don’t have access to Hive shell right now, but i could able to grab the format from Hue. Let me know if this doesn’t help

kval · March 12, 2021, 8:46pm

@balaji.ramaswamy any update on this pls?

balaji.ramaswamy · March 12, 2021, 9:58pm

@kval

It seems like you have opened a ticket with (being an enterprise customer, let me check on the status and get back to you)

kval · March 12, 2021, 10:00pm

Thanks @balaji.ramaswamy , our Dremio platform team opened it now…

william.whispell · May 14, 2021, 6:50pm

We are experiencing the same duplicate record issue.

Topic		Replies	Views
Hudi support in Dremio 19 (AWS Edition)?	0	874	November 3, 2021
Why not Apache Hudi?	9	3965	November 19, 2020
Kerberized HDP: HDFS works, Hive doesn't	14	5447	February 17, 2020
Problem integrating Dremio 4.1.4 with HDP 2.6	10	1362	February 27, 2020
Could not access tables inside hive source	19	1919	January 29, 2021

Dremio Apache Hudi Support

Related topics