I did setup minio and connected it via nessie to dremio ( community edition). Now I want to insert rows from a dataframe in an iceberg table via dremio / pyarrow.
how would I do this ?
I managed to read the data, but I didn’t find an example yet how to insert rows from let’s say a pandas dataframe. An example would be super helpful here
Pandas Dataframe is an in-memory dataframe. Pandas reads data from CSV files from disk, tables from databases, and saves it as an in-memory dataframe.
Dremio can read/write from/to Iceberg tables (need Catalog like Nessie), and Iceberg tables save data as parquet files, which are stored on disk (S3/HDFS/LocalHDD). Iceberg tables not only need data files (parquet format), but they also need metadata files (Avro, JSON format).
1. Insert into Dremio Iceberg table from another table (even a different DB/source, like from PostgreSQL to Dremio Nessie)
INSERT INTO arctic.sales_data
SELECT * FROM postgres.sales_info;
2. Insert into Dremio Iceberg table fromCSV files
COPY INTO arctic.customer_data
FROM ‘@s3/bucket-name/customer-data-folder/’
FILES (‘customers.csv’, ‘additional_customers.csv’)
FILE_FORMAT ‘csv’
(FIELD_DELIMITER ‘,’)
Insert into Dremio Iceberg table from Pandas df
You have to use a specific library and JDBC to do this, like pyodbc. The flow is something like, use Pandas to read data from CSV/table, then write to Dremio Iceberg table using JDBC connection
If you use Spark, you can write directly from Spark df to Iceberg Tables using the Spark-Iceberg extension. I don’t know if Pandas has this kind of extension
I am able to insert a spark dataframe in an iceberg table directly, without dremio. But I want to insert data via dremio, without materializing the data before. I know that single row inserts are possible via pyarrow or pyodbc. But I want to do bulk uploads from a dataframe or similar. Surprisingly dremio does not support parametrized uploads, so I guess execute many is not possible, and single row inserts are for sure too slow.
so is there a way to do bulk uploads via pyarrow or pyodbc directly via dremio ?