Write in Dremio(Nessie catalog) through Spark

@gakshat1107 Have you tried this?

Add a Nessie source to Dremio, documentation below. Then create an Iceberg table in that source and should be Nessie catalog

Via spark, tried the below?

Hi @balaji.ramaswamy
I’ve followed all the above steps but still facing the issue with Nessie Catalog

  1. I created a table through spark.sql in spark scala and insert data in it and tried to read it through the same method and also from Dremio UI - successfull
  2. Then I inserted some data through Dremio UI into the same table and now trying to read it through dremio - successfull
  3. But when I’m trying to read through spark.sql, it’s throwing the below error, is it because Dremio changes any metadata?
    ERROR BaseReader: Error reading file(s): s3://etltest/Silver/region4_adfad9d4-bcec-4c49-a104-20b6940009f1/19465862-bf0b-ac9c-4ce5-2e28bd08ce00/0_0_0.parquet
    java.lang.IllegalArgumentException
    at java.nio.Buffer.limit(Buffer.java:275)
    at org.xerial.snappy.Snappy.uncompress(Snappy.java:553)
    at org.apache.iceberg.shaded.org.apache.parquet.hadoop.codec.SnappyDecompressor.uncompress(SnappyDecompressor.java:30)
    at org.apache.iceberg.shaded.org.apache.parquet.hadoop.codec.NonBlockedDecompressor.decompress(NonBlockedDecompressor.java:73)
    at org.apache.iceberg.shaded.org.apache.parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:51)
    at java.io.DataInputStream.readFully(DataInputStream.java:195)
    at java.io.DataInputStream.readFully(DataInputStream.java:169)
    at org.apache.iceberg.shaded.org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:286)

@gakshat1107 Thanks for the feedback, let me try to reproduce this and get back to you

@balaji.ramaswamy Sure
Here’s a summary of the steps I performed

  1. Installed Dremio & Nessie through Docker
  2. Installed Spark 3.4.3
  3. Connected to spark shell using spark-shell --jars /usr/local/lib/zstd-jni-1.5.2-5.jar,/data/spark/jars/dremio-jdbc-driver-11.0.0-202011171636110752-16ab953d.jar,/data/spark/jars/iceberg-spark-runtime-3.4_2.12-1.5.2.jar,/data/spark/jars/aws-java-sdk-1.11.901.jar,/data/spark/jars/aws-java-sdk-bundle-1.11.901.jar,/data/spark/jars/aws-java-sdk-dynamodb-1.11.901.jar,/data/spark/jars/aws-java-sdk-kms-1.11.901.jar,/data/spark/jars/aws-java-sdk-core-1.11.901.jar,/data/spark/jars/aws-java-sdk-s3-1.11.901.jar,/data/spark/jars/hadoop-aws-3.2.4.jar --conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider --conf “spark.executor.extraJavaOptions=-Djava.library.path=/usr/local/lib” --conf “spark.driver.extraJavaOptions=-Djava.library.path=/usr/local/lib” --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
  4. Created a session to connect to nessie val sparknessie = SparkSession.builder().appName(“IcebergNessieExample”).config(“spark.jars.packages”,“org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.5.2,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.3_2.12:0.94.4”).config(“spark.sql.extensions”,“org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions”).config(“spark.sql.catalog.nessie.uri”, “http://10.**.196.***:19120/api/v1”).config(“spark.sql.catalog.nessie.ref”, “main”).config(“spark.sql.catalog.nessie.authentication.type”, “NONE”).config(“spark.sql.catalog.nessie.catalog-impl”, “org.apache.iceberg.nessie.NessieCatalog”).config(“spark.sql.catalog.nessie.warehouse”, “s3a://etltest”).config(“spark.sql.catalog.nessie”, “org.apache.iceberg.spark.SparkCatalog”).getOrCreate()
  5. After this performed the above steps

Hi @balaji.ramaswamy , anything on the above request?

@gakshat1107 Not yet, I have still not got to this, this week

1 Like

Hi @balaji.ramaswamy
Were you able to reproduce the above error?

Hi @balaji.ramaswamy,
Thank you for the support till now, were you able to reproduce the above error cause I’m still facing the above issue.

Tried the below using Hive catalog,

spark-sql> create table test_spark_dremio_iceberg (c1 int, c2 string) using iceberg;
Time taken: 0.683 seconds

spark-sql> insert into test_spark_dremio_iceberg select 1,'Stefan Edberg';
Time taken: 0.617 seconds
spark-sql> insert into test_spark_dremio_iceberg select 2,'Boris Becker';
Time taken: 0.8 seconds
spark-sql> 

Then queried via Dremio, able to query , see screenshot

Then inserted 2 more rows via Dremio

INSERT INTO localhive3."iceberg_partition".test_spark_dremio_iceberg
select 3,'Ivan Lendl';
INSERT INTO localhive3."iceberg_partition".test_spark_dremio_iceberg
select 4,'Pat Cash';

Queried via Dremio, works, see screenshot

Then queried via spark-sql

spark-sql> select * from test_spark_dremio_iceberg;
4	Pat Cash
3	Ivan Lendl
2	Boris Becker
1	Stefan Edberg
Time taken: 0.13 seconds, Fetched 4 row(s)
spark-sql> 

Does the issue only reproduce vi Nessie catalog?

Hi @balaji.ramaswamy ,
Hive was also having the same issue.
Can you please share the configuration and jars that you are using to connect spark to hive?
I’ll try the similar configurations for nessie too

@gakshat1107 localhive3 is just my source name in Dremio

This is what I did,

  • Install spark sql spark-3.2.3-bin-hadoop3.2
  • start the spark sql server
  • Invoke command line using below flags
bin/spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.14.1\
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
    --conf spark.sql.catalog.spark_catalog.type=hive \
    --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.local.type=hadoop \
    --conf spark.sql.catalog.local.warehouse=$PWD/warehouse
  • Create iceberg table using spark-sql
  • Insert rows
  • Query using Dremio
  • insert more rows via Dremio
  • query via spark-sql