Is it possible to write to an Iceberg table from Python via a dremio arrow flight client?
I’ve set up a flight client in python which works for reading existing data through the dremio flight endpoint. I’m now trying enable writing from the client but have not been able to so successfully. There doesn’t seem to be much documentation on this, is it possible?
I’ve already confirmed I can insert data to an iceberg table / create an iceberg table via the dremio UI, so it’s just a question of if it’s possible from the flight client
I also have the same question.
@gakshat1107 Totally possible,
Follow instructions in below link to install the Dremio Python Arrow flight library
Example config.yaml to insert into iceberg will be something like below
% cat config.yaml
hostname: <host_name>
#port needs to be changed if not default
port: 32010
username: <username>
password: <password>
#token:
query: INSERT INTO localhdfs.data.dremio.pdfs.scratch.store_returns_ctas SELECT * FROM Samples."samples.dremio.com"."tpcds_sf1000"."store_returns"
#tls:
#disable_certificate_verification:
#path_to_certs:
#session_properties:
# - schema: Samples."samples.dremio.com"
# - <keep adding more session properties>
#engine:
Hi @balaji.ramaswamy , is it possible to write a dataframe or a list in the Dremio table through Arrow Flight?
@gakshat1107 What do you mean by list in the Dremio table?
@balaji.ramaswamy , sorry I mean array of values or a dataframe, any one of them
@gakshat1107 Yes Array(list), struct and map data types are supported
@balaji.ramaswamy, is there any reference links from where I can get the help from?
Hi @balaji.ramaswamy , I tried connecting to Dremio Flight Client but getting the following error:
ERROR:root:There was an error trying to connect to the Dremio Flight Endpoint
Traceback (most recent call last):
File “/usr/local/lib/python3.9/site-packages/dremio/flight/connection.py”, line 55, in connect
return self._connect_to_software(
File “/usr/local/lib/python3.9/site-packages/dremio/flight/connection.py”, line 104, in _connect_to_software
bearer_token = client.authenticate_basic_token(
File “pyarrow/_flight.pyx”, line 1492, in pyarrow._flight.FlightClient.authenticate_basic_token
File “pyarrow/_flight.pyx”, line 55, in pyarrow._flight.check_flight_status
pyarrow._flight.FlightInternalError: Could not finish writing before closing
below is my config file:
hostname: 10.61.196.–
port: 32010
username: admin
password: Admin@123
query: select * from NessieCatalog.Bronze.drembronzetable
disable_certificate_verification: True
below is my python file:
from dremio.arguments.parse import get_config
from dremio.flight.endpoint import DremioFlightEndpoint
if name == “main”:
Parse the config file.
args = get_config()
dremio_flight_endpoint = DremioFlightEndpoint(args)
flight_client = dremio_flight_endpoint.connect()
reader = dremio_flight_endpoint.get_reader(flight_client)
print(reader.read_pandas())
@gakshat1107 is your flight port encrypted?
Try the following simple Python code using pyarrow and report back any errors:
import pyarrow
from pyarrow import flight
import pandas
host = 'your_host'
port = '32010'
uid = 'your_uid'
pwd = 'your_pwd'
#Connect
client = flight.FlightClient('grpc+tcp://' + host + ':' + port)
#Authenticate
bearer_token = client.authenticate_basic_token(uid, pwd)
options = flight.FlightCallOptions(headers=[bearer_token])
#Query
sql= "SELECT * FROM sys.memory"
info = client.get_flight_info(flight.FlightDescriptor.for_command(sql),options)
reader = client.do_get(info.endpoints[0].ticket, options)
#Print
df=reader.read_all()
print(df)
Hi @lenoyjacob,
Port is not encrypted.
After trying the above code I’m getting the following error:
Traceback (most recent call last):
File “/data/dremioflightsetup/dremiocommcode.py”, line 14, in
bearer_token = client.authenticate_basic_token(uid, pwd)
File “pyarrow/_flight.pyx”, line 1492, in pyarrow._flight.FlightClient.authenticate_basic_token
File “pyarrow/_flight.pyx”, line 55, in pyarrow._flight.check_flight_status
pyarrow._flight.FlightInternalError: Could not finish writing before closing
Make sure your username and password are correct. Test them with something else like the JDBC driver via dbeaver.
@lenoyjacob
I’ve already tried with Dremio JDBC and Dremio API, I’m able to connect through both of them.
I’m getting error only through Arrow Flight, I’ve followed all the steps mentioned iny Python | Dremio Documentation
1 Like
Please provide the following:
- Pyarrow version
- Dremio version
- Sanitized core-site.xml
- Sanitized dremio.conf
Hi @lenoyjacob
Dremio Version
Build
25.0.0-202404051521110861-ed9515a8
Edition
Community Edition
Pyarrow Version
Version: 17.0.0
Dremio.conf
paths: {
dist: “dremioS3:///etltest/–/”
}
services: {
coordinator.enabled: true,
coordinator.master.enabled: true,
executor.enabled: true,
flight.use_session_service: true
}
registration.publish-host: “–.–.–.–”
Core-site.xml
Hi @lenoyjacob ,
Any update on the above request please?
The code I posted above seems to work fine for me using what you’re using (Dremio CE 25.0.0 and pyarrow 17):
However, if I change the port to something else other than the flight port (say 32011) I get the error you’re getting:
So my best guess is for you see what’s up with the flight port (check the load balancer if any, if on docker/k8s - is the port exposed etc.)
Hi @lenoyjacob ,
The port is fine.
After seeing your code, I changed my hostname to localhost and it’s working now.
Also, are there any resource to take as reference to write python Array or Pandas dataframe through arrow flight?
Thank you so much for the support.
Hi @lenoyjacob ,
I was trying to write the data in the Dremio catalog table but it throwing error.
Below is the code
import pyarrow as pa
from pyarrow import flight
import pandas as pd
host = ‘localhost’
port = ‘32010’
uid = ‘–’
pwd = ‘–’
client = flight.FlightClient(‘grpc+tcp://’ + host + ‘:’ + port)
bearer_token = client.authenticate_basic_token(uid, pwd)
options = flight.FlightCallOptions(headers=[bearer_token])
descriptor = pa.flight.FlightDescriptor.for_path(“NessieCatalog.Bronze.drembronzetable”)
data = {
“department_id”: [1],
“department_name”: [“a”],
“location”: [“b”]
}
df = pd.DataFrame(data)
arrow_table = pa.Table.from_pandas(df)
writer, _ = client.do_put(descriptor, arrow_table.schema, options=options)
writer.write_table(arrow_table)
writer.close()
Below is the error
File “dremiocommcode.py”, line 49, in
writer.close()
File “pyarrow/_flight.pyx”, line 1163, in pyarrow._flight.MetadataRecordBatchWriter.close
File “pyarrow/_flight.pyx”, line 55, in pyarrow._flight.check_flight_status
pyarrow._flight.FlightInternalError: Flight returned internal error, with message: There was an error servicing your request… gRPC client debug context: UNKNOWN:Error received from peer ipv6:%5B::1%5D:32010 {grpc_message:“There was an error servicing your request.”, grpc_status:13, created_time:“2024-10-09T16:19:30.797671287+04:00”}. Client context: OK
Hi @lenoyjacob
Is there anything on the above issue?