I’m currently learning Iceberg and Dremio (Dremio Software), I’m also deploying these services on our Data LakeHouse cluster (which is mostly on-prem with other services like Yarn, ZooKeeper, HDFS, Spark,…, managed by Cloudera service).
I’m a bit confused about terms like Table Format, Catalog, LakeHouse Engine,…
About Iceberg, as far as I understand, it’s a Table Format. It needs a Catalog Service to manage Tables
But can Dremio act as a Catalog Service? I mean, in Dremio WebServer, Dremio can clearly see all my Iceberg tables and databases and manage them.
But can I use Dremio as a catalog instead of Hive for Spark job?
spark = (SparkSession.builder
.appName(appName)
.master("yarn") # Set the cluster manager to YARN
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
.config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")
.config("spark.sql.catalog.spark_catalog.type", "hive")
.config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.local.type", "hadoop")
.getOrCreate())
The above code currently works fine for me, but I wonder if I can change
.config("spark.sql.catalog.spark_catalog.type", "**hive**")
to
.config("spark.sql.catalog.spark_catalog.type", "**dremio**")
?
I’m still confused about what Dremio (Dremio Software) is. Does this act as an Execution Engine service to help me query data from the Iceberg table like Spark, or does it also act as a Catalog service like Hive/Nessie?
Could you guys show me some other tools/services like Dremio, so that I can understand what Dremio is more easily by comparing it to other tools?