Dremio Apache Hudi Support

Hi Guys,

We are planning our data infra on aws, we are starting to create our data lake, our plans is to use Emr for etl prupose and dremio for our user access the data lake.

We want to use Hudi datasets, is there any plan to dremio support this data source?

thank you so much!

1 Like

@rubenssoto

Have you tried this in versions 4.1.7 or higher?

cd <DREMIO_HOME>/plugins/connectors
mkdir hive2-ee.d (if using Hive 2.x source)
mkdir hive3-ee.d (If using Hive 3.x source)

cp -p < hudi>.jar cd <DREMIO_HOME>/plugins/connectors/hive2-ee.d (if 2.x Hive source)
cp -p < hudi>.jar cd <DREMIO_HOME>/plugins/connectors/hive3-ee.d (if 3.x Hive source)

Restart Dremio

Query the table

1 Like

Hello,

I’m also interrested in reading hudi table from dremio.
I’m using Dremio 4.0.5-201911202046080257-19b10938

I’m yet not able to upgrade to 4.1 because you’re datalake use hive 1.X

I tried to add hudi jar to
DREMIO_HOME/jars/lib/hudi-hadoop-mr-bundle-0.6.0.jar
AND
DREMIO_HOME/plugins/connectors/hudi-hadoop-mr-bundle-0.6.0.jar

But it still doesnt work :

java.lang.RuntimeException: class org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat not org.apache.hadoop.mapred.InputFormat

When i’m using hive directly, its works perfectly …

@spyzzz

Custom jars feature are only from 4.1.7 and higher

@balaji.ramaswamy thanks for reply.
There is really no workaround for a 4.0.5 version ? its our production version and migration to 4.1 needs us to upgrade all HIVE env to 2.X and its not in our early roadmap.

This functionality seems basic.

Hi @balaji.ramaswamy,

Above you have explained where to put the Hudi jars for Hive2 and Hive3. However, we are using AWS Glue (for our AWS EMR cluster running Hive). Is there a similar process / folder to add Hudi jars for AWS Glue?

Hi @balaji.ramaswamy
Do we have a documentation page where you can refer to, about configuration of Dremio for Hudi?
We have done copying the Jars, but we still see duplicates (aggregating all s3 files instead latest Hudi Dataset), where as Hive working well. We are on AWS Edition, and build 13.2.0

@kval

Sorry, now able to follow, what do you mean by “aggregating all s3 files instead latest Hudi Dataset”, are you getting wrong results? Are you able to share a profile?

Hi @balaji.ramaswamy , I work with @kval

Please find attached Job profile.

Let me know if you need more information.

8b060356-38b0-472b-a372-786f75afc3e9.zip (23.6 KB)

@balaji.ramaswamy I would like to present a scenario that is causing duplicates.
If there is a new record with cdc data in the existing partition , then Hudi creates a new file on s3 in that partition, that includes the new record came in with cdc and also it will have the existing record (came with bulk insert in the past). Since now this record present in both parquet files (file with bulk, and new file created with cdc), Dremio is simply aggregating this dataset and so we see duplicates instead of seeing data from cdc file.
as per reference : Hive · Dremio It says it support out-of-the box hive serde and custom serde, but it seems it is not using the Hudi JARs we deployed. We used a test using Hive SQL using spark on EMR and we do not see dups but same query on dremio returns dups.

Thanks for the explanation @kval , let me discuss this internally and get back to you

@kval

Are you able to provide us with the describe formatted <table_name> from hive shell?

@balaji.ramaswamy pls see below.

col_name data_type comment
_hoodie_commit_time string
_hoodie_commit_seqno string
_hoodie_record_key string
_hoodie_partition_path string
_hoodie_file_name string
je_batch_id decimal(15,0)
last_update_date bigint
last_updated_by decimal(15,0)
set_of_books_id_11i decimal(15,0)
name string
status string
status_verified string
actual_flag string
default_effective_date bigint
average_journal_flag string
budgetary_control_status string
approval_status_code string
creation_date bigint
created_by decimal(15,0)
last_update_login decimal(15,0)
status_reset_flag string
default_period_name string
unique_date string
earliest_postable_date bigint
posted_date bigint
date_created bigint
description string
control_total decimal(38,10)
running_total_dr decimal(38,10)
running_total_cr decimal(38,10)
running_total_accounted_dr decimal(38,10)
running_total_accounted_cr decimal(38,10)
parent_je_batch_id decimal(15,0)
attribute1 string
attribute2 string
attribute3 string
attribute4 string
attribute5 string
attribute6 string
attribute7 string
attribute8 string
attribute9 string
attribute10 string
context string
unreservation_packet_id decimal(15,0)
packet_id decimal(15,0)
ussgl_transaction_code string
context2 string
posting_run_id decimal(15,0)
request_id decimal(15,0)
org_id decimal(15,0)
posted_by decimal(15,0)
chart_of_accounts_id decimal(15,0)
period_set_name string
accounted_period_type string
group_id decimal(38,10)
approver_employee_id decimal(15,0)
global_attribute_category string
global_attribute1 string
global_attribute2 string
global_attribute3 string
global_attribute4 string
global_attribute5 string
global_attribute6 string
global_attribute7 string
global_attribute8 string
global_attribute9 string
global_attribute10 string
global_attribute11 string
global_attribute12 string
global_attribute13 string
global_attribute14 string
global_attribute15 string
global_attribute16 string
global_attribute17 string
global_attribute18 string
global_attribute19 string
global_attribute20 string
default_effective_date_partition string
NULL NULL
# Partition Information NULL NULL
# col_name data_type comment
NULL NULL
default_effective_date_partition string

Hi @balaji.ramaswamy I can also provide another scenario if that helps. It also aggregate same record even if its gets an update but written to new parquet file. So when you query from Dremio, it shows two records for same key one with first and second with updated data where the spark sql on Hive table just pulls up the latest updated record.

@kval

From the describe formatted, I am particularly interested in “Inputformat” and “Outputformat”, can you please run the command from Hive shell?

Thanks

@balaji.ramaswamy I just don’t have access to Hive shell right now, but i could able to grab the format from Hue. Let me know if this doesn’t help

@balaji.ramaswamy any update on this pls?

@kval

It seems like you have opened a ticket with (being an enterprise customer, let me check on the status and get back to you)

Thanks @balaji.ramaswamy , our Dremio platform team opened it now…

We are experiencing the same duplicate record issue.