Error querying parquet files from AWS Glue/S3

Hi,

I’m running up against an error saying “java.io.IOException: Not a file…” when trying to query an AWS Glue table (parquet in S3). The same Glue table works fine in Athena, a Presto/Trino.

It looks like Dremio is getting hung up on the fact that the parquet files are nested in date folders (ie s3://my-bucket/master-db/expansion_serps/parq/2021/09/30/14/files.parq).

Is it possible to configure Dremio to recurse through the “folders” in the s3 bucket?

python example.py -user **************** -pass "********************************" -query "SELECT * FROM glue.dev.serps limit 10"
[INFO] Authentication was successful
[INFO] Query:  SELECT * FROM glue.dev.serps limit 10
[INFO] GetSchema was successful
[INFO] Schema:  <pyarrow._flight.SchemaResult object at 0x106896070>
[INFO] GetFlightInfo was successful
[INFO] Ticket:  <Ticket b'\n-SELECT * FROM glue.dev.serps limit 10\x12G\nE\n-SELECT * FROM glue.dev.serps limit 10\x10\x03\x1a\x12\tE\xb4\xa6\xb2U\xe8\xa9\x1e\x11\x00\xff\x9ah\x84\x956\x84'>
[ERROR] Exception: ArrowInvalid('gRPC returned invalid argument error, with message: java.io.IOException: Not a file: s3://my-bucket/master-db/expansion_serps/parq/2021. Client context: IOError: Server never sent a data message. Detail: Internal. gRPC client debug context: {"created":"@1633032107.899306000","description":"Error received from peer ipv6:[::1]:32010","file":"../src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"java.io.IOException: Not a file: s3://my-bucket/master-db/expansion_serps/parq/2021","grpc_status":3}')
Traceback (most recent call last):
  File "/Users/erik/Dropbox/home/git/arrow-flight-client-examples/python/example.py", line 171, in <module>
    connect_to_dremio_flight_server_endpoint(args.hostname, args.flightport, args.username,
  File "/Users/erik/Dropbox/home/git/arrow-flight-client-examples/python/example.py", line 158, in connect_to_dremio_flight_server_endpoint
    reader = client.do_get(flight_info.endpoints[0].ticket, options)
  File "pyarrow/_flight.pyx", line 1319, in pyarrow._flight.FlightClient.do_get
  File "pyarrow/_flight.pyx", line 80, in pyarrow._flight.check_flight_status
  File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: gRPC returned invalid argument error, with message: java.io.IOException: Not a file: s3://my-bucket/master-db/expansion_serps/parq/2021. Client context: IOError: Server never sent a data message. Detail: Internal. gRPC client debug context: {"created":"@1633032107.899306000","description":"Error received from peer ipv6:[::1]:32010","file":"../src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"java.io.IOException: Not a file: s3://my-bucket/master-db/expansion_serps/parq/2021","grpc_status":3}

Thanks!
Erik

@erikcw I know 2 parameters that can be added for Hive, for Glue this is something we need to check, is there any parameter on Glue that allows to recurse? We can add the same via the Glue source, advanced options, add parameter option

In Trino/Presto – you add the option hive.recursive-directories = true to the catalog config file. Glue is really a managed hive catalog – so that seems to work well.

I went through the dremio docs and the dremio helm chart trying to find the appropriate place configure this, but so far haven’t found anything.

@erikcw Try to add it to the Glue surce under advanced options add property

@erikcw If that does not work try the below 2

mapred.input.dir.recursive=true
hive.mapred.supports.subdirectories=true

@balaji.ramaswamy That did the trick! Thank you!