Hi,
I’m running up against an error saying “java.io.IOException: Not a file…” when trying to query an AWS Glue table (parquet in S3). The same Glue table works fine in Athena, a Presto/Trino.
It looks like Dremio is getting hung up on the fact that the parquet files are nested in date folders (ie s3://my-bucket/master-db/expansion_serps/parq/2021/09/30/14/files.parq).
Is it possible to configure Dremio to recurse through the “folders” in the s3 bucket?
python example.py -user **************** -pass "********************************" -query "SELECT * FROM glue.dev.serps limit 10"
[INFO] Authentication was successful
[INFO] Query: SELECT * FROM glue.dev.serps limit 10
[INFO] GetSchema was successful
[INFO] Schema: <pyarrow._flight.SchemaResult object at 0x106896070>
[INFO] GetFlightInfo was successful
[INFO] Ticket: <Ticket b'\n-SELECT * FROM glue.dev.serps limit 10\x12G\nE\n-SELECT * FROM glue.dev.serps limit 10\x10\x03\x1a\x12\tE\xb4\xa6\xb2U\xe8\xa9\x1e\x11\x00\xff\x9ah\x84\x956\x84'>
[ERROR] Exception: ArrowInvalid('gRPC returned invalid argument error, with message: java.io.IOException: Not a file: s3://my-bucket/master-db/expansion_serps/parq/2021. Client context: IOError: Server never sent a data message. Detail: Internal. gRPC client debug context: {"created":"@1633032107.899306000","description":"Error received from peer ipv6:[::1]:32010","file":"../src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"java.io.IOException: Not a file: s3://my-bucket/master-db/expansion_serps/parq/2021","grpc_status":3}')
Traceback (most recent call last):
File "/Users/erik/Dropbox/home/git/arrow-flight-client-examples/python/example.py", line 171, in <module>
connect_to_dremio_flight_server_endpoint(args.hostname, args.flightport, args.username,
File "/Users/erik/Dropbox/home/git/arrow-flight-client-examples/python/example.py", line 158, in connect_to_dremio_flight_server_endpoint
reader = client.do_get(flight_info.endpoints[0].ticket, options)
File "pyarrow/_flight.pyx", line 1319, in pyarrow._flight.FlightClient.do_get
File "pyarrow/_flight.pyx", line 80, in pyarrow._flight.check_flight_status
File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: gRPC returned invalid argument error, with message: java.io.IOException: Not a file: s3://my-bucket/master-db/expansion_serps/parq/2021. Client context: IOError: Server never sent a data message. Detail: Internal. gRPC client debug context: {"created":"@1633032107.899306000","description":"Error received from peer ipv6:[::1]:32010","file":"../src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"java.io.IOException: Not a file: s3://my-bucket/master-db/expansion_serps/parq/2021","grpc_status":3}
Thanks!
Erik