Unable to find bucket named error when querying S3

All,

When I create a physical dataset on S3 datasource (specifically, a HIVE partitioned dataset) Dremio shows me a sample of the data and successfully creates the dataset.

However, when I re-open the dataset in SQL Editor, I encounter ‘Unable to find bucket named xxx’ error.

I am able to create dataset on another S3 bucket fine.

Both buckets have hyphens in the name.
Problem bucket has datafiles under 2 levels of folder while the working dataset has the 1 level.
Problem bucket datafiles are HIVE partitioned 3 levels (YYYY=year/MM=month/DD=day) while the working dataset is not partitioned.

Any glues?

thanks
BJ

Hello @BJ_Choi,

Forgive me if you already checked, but is bucket ‘xxxx’ actually present in S3?
It’s possible that Dremio has stale metadata.
On the S3 source, what are top-level dataset discovery settings?

Hi, Ben

Yes, the bucket ‘xxxx’ is present in S3. I can query it fine via AWS Athena.

My ‘Edit Source : Metadata’ screen looks identical yours.

Via ‘Settings’ on the dataset, dremio is able to show data in the partitioned file (see below). I get the error when I try to query it from SQL Editor.

Though the preview fails, there may be a job associated with this; can you supply the profile if it’s present?

I’d like to see the complete error stack.

Thank, Ben. How do I grab the profile?

If the error can’t be found there, it may be in the logs, in particular server.log
https://docs.dremio.com/advanced-administration/log-files.html

downloaded profile is attached.

In the server queryies.json log I see this entry

{“queryId”:“237ea235-d876-2350-f452-26731c2e3c00”,“schema”:"[s3, ttam-datalake-dev-d]",“queryText”:“SELECT * FROM mailgun”,“start”:1551982025451,“finish”:1551982025487,“outcome”:“FAILED”,“username”:“dremio_admin”}

d2093d0e-627a-47d9-85a1-04acbf6daea5.zip (18.0 KB)

It looks like the error reports a single file within one of the partitions:

/ttam-datalake-dev-d/mailgun/api_events/YYYY=2017/MM=08/DD=05/part-00071-93cbe7d6-2c7a-4be4-ba52-e1b372c43acb.c000.snappy.parquet

This could be absent or there might be permission issues.

First let’s try running a refresh on that source in Dremio:
ALTER PDS s3."ttam-datalake-dev-d".mailgun REFRESH METADATA

Hi, Ben

I was able to execute the refresh command successfully, but the error didn’t go away.

So, having only 1 file per partition is a problem?

thanks
BJ

Ben,

Don’t know whether you got your issue squared away but I had a similar issue and this is what worked for me: Remove the S3 source (the connection, not the PDS) and readd. I know this can be a pain if you already have a lot of PDSes defined on that connection, but that did it for me. The fact that this is what cleared it up for me makes me think that the metadata is being cached on the executor but the refresh is only being executed on the master. Anyway, hope this helps.

Thanks,
Chip