Why not Apache Hudi?

wellytambunan · August 25, 2020, 3:47am

Hi,

I watch the bunch of presentation from Subsurface before. Seems a lot of attention going into Apache Iceberg and Delta Lake.

I wonder why not mentioning Apache Hudi? Is there any limitation in licensing or features etc? Is there any comparison on how these kind of layer different?

Currently we are assessing Dremio on AWS. Looks like the features is more advanced in there.

Cheers

datalake_user · August 28, 2020, 1:02pm

I am also curious about it, in fact, Hudi has been integrated into EMR long ago, it is very attractive to use hudi on aws.

wellytambunan · August 29, 2020, 11:13am

any thought on this? @LucioDaza ?

wellytambunan · August 31, 2020, 9:37am

@balaji.ramaswamy hi balaji, any info on this one? thanks

tshiran · September 6, 2020, 6:39pm

Our initial focus is Iceberg and Delta Lake given that those two projects seem to have more momentum/demand in the market. But we haven’t ruled out Hudi support.

datalake_user · September 11, 2020, 3:26pm

Thanks for the reply. However, through my observation, I found that Hudi is more popular and mature in the market and more people take participate in the community.

desidero · September 12, 2020, 6:41am

@tshiran Assuming work on those is already under way, do you have any information on how Dremio thinks about them as a metadata source when compared to some other less reliable/performant sources such as Hive?

tshiran · September 12, 2020, 7:29am

Can you elaborate a bit on the question? Are you asking how metadata refresh works in Dremio when dealing with a transactional table format like Iceberg or Delta?
The experience will certainly be better than HMS or Glue because these formats provide snapshotting/transactions, which makes it possible to know what has changed without having to scan unnecessary metadata. I’m happy to discuss in more detail.

william.whispell · November 18, 2020, 11:11pm

Some recent activity on Hudi:

AWS automatically includes hudi jars in latest EMR setups
Hudi has graduated from incubating
Hudi Github repo shows ~ double the activity over iceberg
Hudi has been around a year longer in apache, and even longer as Hoodie
Just yesterday an AWS architect released a blog post on using Hudi in Glue 2.0 which leads me to believe it may be supported in Glue soon.
In September AWS announced support for reading Hudi tables from Redshift Spectrum

I’ll need to read up on support for iceberg in AWS, but if AWS is adopting Hudi, then support in Dremio would make it easier to integrate.

balaji.ramaswamy · November 19, 2020, 3:47am

@william.whispell

Are you trying to read HUDI Parquet files backed up by a Hie table. Dremio support this, this is what you need to do

cd <DREMIO_HOME>/plugins/connectors
mkdir hive2-ee.d (if using Hive 2.x source)
mkdir hive3-ee.d (If using Hive 3.x source)

cp -p < hoodie>.jar cd <DREMIO_HOME>/plugins/connectors/hive2-ee.d (if 2.x Hive source)
cp -p < hoodie>.jar cd <DREMIO_HOME>/plugins/connectors/hive3-ee.d (if 3.x Hive source)

Restart Dremio

Query the table

Topic		Replies	Views
Hudi support in Dremio 19 (AWS Edition)?	0	874	November 3, 2021
Dremio Apache Hudi Support	21	4252	May 17, 2021
Data sources for apache iceberg	1	1179	April 12, 2021
Dremio Arctic API	3	643	May 1, 2023
Need contact customer support for apache iceberg dataset	1	1112	December 8, 2020

Why not Apache Hudi?

Related topics