Write in Dremio(Nessie catalog) through Spark

balaji.ramaswamy · September 16, 2024, 2:27pm

@gakshat1107 Have you tried this?

Add a Nessie source to Dremio, documentation below. Then create an Iceberg table in that source and should be Nessie catalog

Via spark, tried the below?

gakshat1107 · September 17, 2024, 6:02am

Hi @balaji.ramaswamy
I’ve followed all the above steps but still facing the issue with Nessie Catalog

I created a table through spark.sql in spark scala and insert data in it and tried to read it through the same method and also from Dremio UI - successfull
Then I inserted some data through Dremio UI into the same table and now trying to read it through dremio - successfull
But when I’m trying to read through spark.sql, it’s throwing the below error, is it because Dremio changes any metadata?
ERROR BaseReader: Error reading file(s): s3://etltest/Silver/region4_adfad9d4-bcec-4c49-a104-20b6940009f1/19465862-bf0b-ac9c-4ce5-2e28bd08ce00/0_0_0.parquet
java.lang.IllegalArgumentException
at java.nio.Buffer.limit(Buffer.java:275)
at org.xerial.snappy.Snappy.uncompress(Snappy.java:553)
at org.apache.iceberg.shaded.org.apache.parquet.hadoop.codec.SnappyDecompressor.uncompress(SnappyDecompressor.java:30)
at org.apache.iceberg.shaded.org.apache.parquet.hadoop.codec.NonBlockedDecompressor.decompress(NonBlockedDecompressor.java:73)
at org.apache.iceberg.shaded.org.apache.parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:51)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at org.apache.iceberg.shaded.org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:286)

balaji.ramaswamy · September 17, 2024, 2:26pm

@gakshat1107 Thanks for the feedback, let me try to reproduce this and get back to you

gakshat1107 · September 17, 2024, 2:52pm

@balaji.ramaswamy Sure
Here’s a summary of the steps I performed

Installed Dremio & Nessie through Docker
Installed Spark 3.4.3
Connected to spark shell using spark-shell --jars /usr/local/lib/zstd-jni-1.5.2-5.jar,/data/spark/jars/dremio-jdbc-driver-11.0.0-202011171636110752-16ab953d.jar,/data/spark/jars/iceberg-spark-runtime-3.4_2.12-1.5.2.jar,/data/spark/jars/aws-java-sdk-1.11.901.jar,/data/spark/jars/aws-java-sdk-bundle-1.11.901.jar,/data/spark/jars/aws-java-sdk-dynamodb-1.11.901.jar,/data/spark/jars/aws-java-sdk-kms-1.11.901.jar,/data/spark/jars/aws-java-sdk-core-1.11.901.jar,/data/spark/jars/aws-java-sdk-s3-1.11.901.jar,/data/spark/jars/hadoop-aws-3.2.4.jar --conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider --conf “spark.executor.extraJavaOptions=-Djava.library.path=/usr/local/lib” --conf “spark.driver.extraJavaOptions=-Djava.library.path=/usr/local/lib” --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
Created a session to connect to nessie val sparknessie = SparkSession.builder().appName(“IcebergNessieExample”).config(“spark.jars.packages”,“org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.5.2,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.3_2.12:0.94.4”).config(“spark.sql.extensions”,“org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions”).config(“spark.sql.catalog.nessie.uri”, “http://10.**.196.***:19120/api/v1”).config(“spark.sql.catalog.nessie.ref”, “main”).config(“spark.sql.catalog.nessie.authentication.type”, “NONE”).config(“spark.sql.catalog.nessie.catalog-impl”, “org.apache.iceberg.nessie.NessieCatalog”).config(“spark.sql.catalog.nessie.warehouse”, “s3a://etltest”).config(“spark.sql.catalog.nessie”, “org.apache.iceberg.spark.SparkCatalog”).getOrCreate()
After this performed the above steps

gakshat1107 · September 19, 2024, 1:40pm

Hi @balaji.ramaswamy , anything on the above request?

balaji.ramaswamy · September 19, 2024, 6:40pm

@gakshat1107 Not yet, I have still not got to this, this week

gakshat1107 · September 24, 2024, 4:51am

Hi @balaji.ramaswamy
Were you able to reproduce the above error?

gakshat1107 · October 17, 2024, 5:07am

Hi @balaji.ramaswamy,
Thank you for the support till now, were you able to reproduce the above error cause I’m still facing the above issue.

balaji.ramaswamy · October 19, 2024, 6:15am

Tried the below using Hive catalog,

spark-sql> create table test_spark_dremio_iceberg (c1 int, c2 string) using iceberg;
Time taken: 0.683 seconds

spark-sql> insert into test_spark_dremio_iceberg select 1,'Stefan Edberg';
Time taken: 0.617 seconds
spark-sql> insert into test_spark_dremio_iceberg select 2,'Boris Becker';
Time taken: 0.8 seconds
spark-sql>

Then queried via Dremio, able to query , see screenshot

Then inserted 2 more rows via Dremio

INSERT INTO localhive3."iceberg_partition".test_spark_dremio_iceberg
select 3,'Ivan Lendl';
INSERT INTO localhive3."iceberg_partition".test_spark_dremio_iceberg
select 4,'Pat Cash';

Queried via Dremio, works, see screenshot

Then queried via spark-sql

spark-sql> select * from test_spark_dremio_iceberg;
4	Pat Cash
3	Ivan Lendl
2	Boris Becker
1	Stefan Edberg
Time taken: 0.13 seconds, Fetched 4 row(s)
spark-sql>

Does the issue only reproduce vi Nessie catalog?

gakshat1107 · October 22, 2024, 5:33pm

Hi @balaji.ramaswamy ,
Hive was also having the same issue.
Can you please share the configuration and jars that you are using to connect spark to hive?
I’ll try the similar configurations for nessie too

balaji.ramaswamy · October 28, 2024, 4:52am

@gakshat1107 localhive3 is just my source name in Dremio

This is what I did,

Install spark sql spark-3.2.3-bin-hadoop3.2
start the spark sql server
Invoke command line using below flags

bin/spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.14.1\
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
    --conf spark.sql.catalog.spark_catalog.type=hive \
    --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.local.type=hadoop \
    --conf spark.sql.catalog.local.warehouse=$PWD/warehouse

Create iceberg table using spark-sql
Insert rows
Query using Dremio
insert more rows via Dremio
query via spark-sql

Topic		Replies	Views
Dremio Iceberg JDBC catalog	9	2019	May 24, 2023
Connect to Nessie catalog Dremio University	5	1412	December 12, 2022
Postgres Schema for Iceberg Nessie Apache Iceberg	4	325	October 17, 2024
Spark jdbc insert	6	697	December 27, 2023
Iceberg: Choosing a catalog when using "Dremio Software" with other compute engines	8	1089	October 28, 2024

Write in Dremio(Nessie catalog) through Spark

Related topics