How to insert data from dataframe in a iceberg table?

I did setup minio and connected it via nessie to dremio ( community edition). Now I want to insert rows from a dataframe in an iceberg table via dremio / pyarrow.
how would I do this ?

I managed to read the data, but I didn’t find an example yet how to insert rows from let’s say a pandas dataframe. An example would be super helpful here :wink:

thanks in advance

Chris

Pandas Dataframe is an in-memory dataframe. Pandas reads data from CSV files from disk, tables from databases, and saves it as an in-memory dataframe.

Dremio can read/write from/to Iceberg tables (need Catalog like Nessie), and Iceberg tables save data as parquet files, which are stored on disk (S3/HDFS/Local HDD). Iceberg tables not only need data files (parquet format), but they also need metadata files (Avro, JSON format).

1. Insert into Dremio Iceberg table from another table (even a different DB/source, like from PostgreSQL to Dremio Nessie)

INSERT INTO arctic.sales_data
SELECT * FROM postgres.sales_info;

2. Insert into Dremio Iceberg table from CSV files
COPY INTO arctic.customer_data
FROM ‘@s3/bucket-name/customer-data-folder/’
FILES (‘customers.csv’, ‘additional_customers.csv’)
FILE_FORMAT ‘csv’
(FIELD_DELIMITER ‘,’)

  1. Insert into Dremio Iceberg table from Pandas df
    You have to use a specific library and JDBC to do this, like pyodbc. The flow is something like, use Pandas to read data from CSV/table, then write to Dremio Iceberg table using JDBC connection

If you use Spark, you can write directly from Spark df to Iceberg Tables using the Spark-Iceberg extension. I don’t know if Pandas has this kind of extension

I am able to insert a spark dataframe in an iceberg table directly, without dremio. But I want to insert data via dremio, without materializing the data before. I know that single row inserts are possible via pyarrow or pyodbc. But I want to do bulk uploads from a dataframe or similar. Surprisingly dremio does not support parametrized uploads, so I guess execute many is not possible, and single row inserts are for sure too slow.

so is there a way to do bulk uploads via pyarrow or pyodbc directly via dremio ?

Parameterized Query is supported in the Cloud version. But still no update on the Software Community version