Error accessing parquet table in MinIO

I have an error when running a query to any table pointing to parquet files in MinIO. I get the error “IllegalStateException: Invalid AWSCredentialsProvider provided:”.

I have other tables pointing to CSV files in the same bucket in MinIO and I have no issues with those. I can even preview the data in the parquet file when creating the dataset. The error is only when running the SELECT. Any advice?

image

Eventhough the error seems to be related to AWS credentials, I wonder why I have no issues accessing other CSV-based datasets in the same minio and I can even preview the data. This happens to only my Parquet-based datasets in minio.

Any advice?

@moy8011 Welcome to Dremio Community.

My 2 cents → How are you configuring the MinIO credentials and settings in Dremio? The reason I’m asking is that Preview and even CSV scans are sometimes done from a single node. While S3 scans are typically distributed in nature, so this may have something to do with configuration mismatches between the executors in your cluster. The same core-site.xml need to be present in all nodes.

We have deployed Dremio with helm in a kubernetes cluster, so the core-site.xml is replicated to all the nodes as a configmap.

We put a trace on the MinIO bucket and noticed that when querying parquet files, the search path is incorrect, looks like Dremio is repeating the path:

s3.HeadObject pool-0-1.xxxxx.xxxxx.xxx.xxxx.xxxx:0000/my-bucket/test/other/minio1/my-bucket/test/other/joined.parquet

when the same dataset is being created, during the preview, the trace shows correctly:

s3.HeadObject pool-0-1.xxxxx.xxxxx.xxx.xxxx.xxxx:0000/my-bucket/test/other/joined.parquet

I tried changing the context and trying different ways on the select statement but no luck.

Any idea why it’s repeating the path?

Can we get a sense of what are you putting into core-site.xml (and why) and what are all the non-default values in S3 source config (besides compatibility flag)?

Looks like this:

core-site.xml

configuration
!-- If you are editing any content in this file, please remove lines with double curly braces around them –
!-- S3 Configuration Section –
property
name fs.dremioS3.impl /name
description The FileSystem implementation. Must be set to com.dremio.plugins.s3.store.S3FileSystem /description
value com.dremio.plugins.s3.store.S3FileSystem /value
/property
property
name fs.s3a.access.key /name
description AWS access key ID. /description
value XXXXXXXXXXXXXXXX /value
/property
property
name fs.s3a.secret.key /name
description AWS secret key. /description
value XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX /value
/property

property
name fs.s3a.endpoint /name
value minio.tenant.xxx.xxxxxx.local /value
/property
property
name fs.s3a.path.style.access /name
value true /value
/property
property
name dremio.s3.compat /name
value true /value
/property

/configuration

In the S3 source, I’m setting these:

fs.s3a.path.style.access
fs.s3a.endpoint
fs.s3a.endpoint.region
Also, enabling compatibility mode and choosing PARQUET as default CTAS Format

I followed this documentation: https://docs.dremio.com/cloud/sonar/data-sources/amazon-s3/configuring-s3-for-minio/

What are the values (in each)?

It’s our minio endpoint where we have the CSVs and the parquet files

fs.s3a.path.style.access true
fs.s3a.endpoint minios3.xxxxxxxxxxxxx.xxxxx.com
fs.s3a.endpoint.region minio

Are you using slashes for table names during table creation or query?

Nope. I even tried removing the file extension to avoid quotes. I also tried creating the dataset on the folder rather than on the file and it’s the same behavior.

Spitballing…

  1. Looking at the traces you provided, could look at all your configurations to find out where is the term minio1 coming from? It’s odd that this specific term gets inserted in on the HeadObject call.
  2. Remove the fs.s3a.endpoint.region from source settings. It shouldn’t be needed.
  3. What happens when you explicitly set the credentials provider in Source settings? i.e. fs.s3a.aws.credentials.provider to org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider. Along with using “AWS Access Key” mode of authentication.

Also a full stack trace of the error would help too. You can get this from the logs, or maybe also from the Job Profile of the query that failed → Raw Profile → Error tab