Why not Apache Hudi?

Hi,

I watch the bunch of presentation from Subsurface before. Seems a lot of attention going into Apache Iceberg and Delta Lake.

I wonder why not mentioning Apache Hudi? Is there any limitation in licensing or features etc? Is there any comparison on how these kind of layer different?

Currently we are assessing Dremio on AWS. Looks like the features is more advanced in there.

Cheers

I am also curious about it, in fact, Hudi has been integrated into EMR long ago, it is very attractive to use hudi on aws.

any thought on this? @LucioDaza ?

@balaji.ramaswamy hi balaji, any info on this one? thanks

Our initial focus is Iceberg and Delta Lake given that those two projects seem to have more momentum/demand in the market. But we haven’t ruled out Hudi support.

1 Like

Thanks for the reply. However, through my observation, I found that Hudi is more popular and mature in the market and more people take participate in the community.

@tshiran Assuming work on those is already under way, do you have any information on how Dremio thinks about them as a metadata source when compared to some other less reliable/performant sources such as Hive?

Can you elaborate a bit on the question? Are you asking how metadata refresh works in Dremio when dealing with a transactional table format like Iceberg or Delta?
The experience will certainly be better than HMS or Glue because these formats provide snapshotting/transactions, which makes it possible to know what has changed without having to scan unnecessary metadata. I’m happy to discuss in more detail.

1 Like

Some recent activity on Hudi:

  • AWS automatically includes hudi jars in latest EMR setups
  • Hudi has graduated from incubating
  • Hudi Github repo shows ~ double the activity over iceberg
  • Hudi has been around a year longer in apache, and even longer as Hoodie
  • Just yesterday an AWS architect released a blog post on using Hudi in Glue 2.0 which leads me to believe it may be supported in Glue soon.
  • In September AWS announced support for reading Hudi tables from Redshift Spectrum

I’ll need to read up on support for iceberg in AWS, but if AWS is adopting Hudi, then support in Dremio would make it easier to integrate.

@william.whispell

Are you trying to read HUDI Parquet files backed up by a Hie table. Dremio support this, this is what you need to do

cd <DREMIO_HOME>/plugins/connectors
mkdir hive2-ee.d (if using Hive 2.x source)
mkdir hive3-ee.d (If using Hive 3.x source)

cp -p < hoodie>.jar cd <DREMIO_HOME>/plugins/connectors/hive2-ee.d (if 2.x Hive source)
cp -p < hoodie>.jar cd <DREMIO_HOME>/plugins/connectors/hive3-ee.d (if 3.x Hive source)

Restart Dremio

Query the table