We are planning our data infra on aws, we are starting to create our data lake, our plans is to use Emr for etl prupose and dremio for our user access the data lake.
We want to use Hudi datasets, is there any plan to dremio support this data source?
@balaji.ramaswamy thanks for reply.
There is really no workaround for a 4.0.5 version ? its our production version and migration to 4.1 needs us to upgrade all HIVE env to 2.X and its not in our early roadmap.
Above you have explained where to put the Hudi jars for Hive2 and Hive3. However, we are using AWS Glue (for our AWS EMR cluster running Hive). Is there a similar process / folder to add Hudi jars for AWS Glue?
Hi @balaji.ramaswamy
Do we have a documentation page where you can refer to, about configuration of Dremio for Hudi?
We have done copying the Jars, but we still see duplicates (aggregating all s3 files instead latest Hudi Dataset), where as Hive working well. We are on AWS Edition, and build 13.2.0
Sorry, now able to follow, what do you mean by “aggregating all s3 files instead latest Hudi Dataset”, are you getting wrong results? Are you able to share a profile?
@balaji.ramaswamy I would like to present a scenario that is causing duplicates.
If there is a new record with cdc data in the existing partition , then Hudi creates a new file on s3 in that partition, that includes the new record came in with cdc and also it will have the existing record (came with bulk insert in the past). Since now this record present in both parquet files (file with bulk, and new file created with cdc), Dremio is simply aggregating this dataset and so we see duplicates instead of seeing data from cdc file.
as per reference : Hive · Dremio It says it support out-of-the box hive serde and custom serde, but it seems it is not using the Hudi JARs we deployed. We used a test using Hive SQL using spark on EMR and we do not see dups but same query on dremio returns dups.
Hi @balaji.ramaswamy I can also provide another scenario if that helps. It also aggregate same record even if its gets an update but written to new parquet file. So when you query from Dremio, it shows two records for same key one with first and second with updated data where the spark sql on Hive table just pulls up the latest updated record.