Dremio w/ MinIO distributed store failures

Using Dremio 11.0 (from Dremio’s download site):
When I start up dremio, I receive errors like the following on the executor nodes:

2020-12-02 21:53:56,469 [start-__jobResultsStore] WARN  c.d.e.catalog.ManagedStoragePlugin - Error starting new source: __jobResultsStore
com.google.common.util.concurrent.UncheckedExecutionException: com.amazonaws.SdkClientException: Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListAllMyBucketsHandler
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2051)
        at com.google.common.cache.LocalCache.get(LocalCache.java:3953)
        at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3976)
        at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4960)
        at com.dremio.exec.store.dfs.FileSystemPlugin.newFileSystem(FileSystemPlugin.java:318)
        at com.dremio.exec.store.dfs.FileSystemPlugin.createFS(FileSystemPlugin.java:306)
        at com.dremio.exec.store.dfs.FileSystemPlugin.createFS(FileSystemPlugin.java:302)
        at com.dremio.exec.store.dfs.FileSystemPlugin.createFS(FileSystemPlugin.java:298)
        at com.dremio.exec.store.dfs.FileSystemPlugin.start(FileSystemPlugin.java:615)
        at com.dremio.exec.catalog.ManagedStoragePlugin.lambda$newStartSupplier$1(ManagedStoragePlugin.java:523)
        at com.dremio.exec.catalog.ManagedStoragePlugin.lambda$nameSupplier$3(ManagedStoragePlugin.java:591)
        at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: com.amazonaws.SdkClientException: Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListAllMyBucketsHandler
        at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:166)
        at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseListMyBucketsResponse(XmlResponsesSaxParser.java:372)
        at com.amazonaws.services.s3.model.transform.Unmarshallers$ListBucketsUnmarshaller.unmarshall(Unmarshallers.java:80)
        at com.amazonaws.services.s3.model.transform.Unmarshallers$ListBucketsUnmarshaller.unmarshall(Unmarshallers.java:76)
        at com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:62)
        at com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:31)
        at com.amazonaws.http.response.AwsResponseHandlerAdapter.handle(AwsResponseHandlerAdapter.java:69)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleResponse(AmazonHttpClient.java:1714)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleSuccessResponse(AmazonHttpClient.java:1434)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1356)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1139)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:796)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:764)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:738)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:698)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:680)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:544)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:524)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5054)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5000)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4994)
        at com.amazonaws.services.s3.AmazonS3Client.listBuckets(AmazonS3Client.java:993)
        at com.amazonaws.services.s3.AmazonS3Client.listBuckets(AmazonS3Client.java:999)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.dremio.plugins.s3.store.S3FileSystem.lambda$registerReference$0(S3FileSystem.java:199)
        at com.sun.proxy.$Proxy56.listBuckets(Unknown Source)
        at com.dremio.plugins.s3.store.S3FileSystem.getContainerCreators(S3FileSystem.java:293)
        at com.dremio.plugins.util.ContainerFileSystem.refreshFileSystems(ContainerFileSystem.java:91)
        at com.dremio.plugins.util.ContainerFileSystem.initialize(ContainerFileSystem.java:159)
        at com.dremio.exec.store.dfs.FileSystemPlugin$1.lambda$load$0(FileSystemPlugin.java:204)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
        at com.dremio.exec.store.dfs.FileSystemPlugin$1.load(FileSystemPlugin.java:209)
        at com.dremio.exec.store.dfs.FileSystemPlugin$1.load(FileSystemPlugin.java:186)
        at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3529)
        at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2278)
        at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2155)
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2045)
        ... 14 common frames omitted
Caused by: org.xml.sax.SAXParseException: The element type "link" must be terminated by the matching end-tag "</link>".
        at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
        at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
        at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
        at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
        at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
        at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
        at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanEndElement(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:152)
        ... 55 common frames omitted

Many ManagedStoragePlugin s fail this way: __accelerator, __home, $scratch

dremio.conf:

paths: {

  # the local path for dremio to store data.
  local: "/var/lib/dremio",
  # the distributed path Dremio data including job results, downloads, uploads, etc
  dist: "dremioS3:///dremio/",
  # location for catalog database (if master node)
  db: "/var/lib/dremio/db",
  spilling: ["/var/lib/dremio/spill"],
  # storage area for the accelerator cache.
  accelerator: "dremioS3:///dremio/accelerator",
  # staging area for json and csv ui downloads
  downloads: "dremioS3:///dremio/downloads",
  # stores uploaded data associated with user home directories
  uploads: "dremioS3:///dremio/uploads",
  # stores data associated with the job results cache.
  results: "dremioS3:///dremio/results",
  # shared scratch space for creation of tables.
  scratch: "dremioS3:///dremio/scratch"
}

services: {
  coordinator: {
    enabled: false
  }

  executor: {
    enabled: true
  },

  fabric: {
    port: 45678,

    memory: {
      reservation: 100M
    }
  },

  conduit: {
    # If set to 0, a port is automatically allocated (typically in ephemeral range). Otherwise, the configured value
    # is used.
    port: 8889,
    ssl: {
      # If SSL for communication path between Dremio instances should be enabled.
      enabled: ${services.fabric.ssl.enabled},
      # Allow for auto-generated certificates if keyStore option is not set
      # Auto-generated self-signed certificates are considered insecure, and this
      # option should be set to false in production environment
      auto-certificate.enabled: false,
      # KeyStore and TrustStore settings default to Java keystore and truststore JVM arguments.
      # If needed to be overridden, then change the below properties
      # KeyStore type
      keyStoreType: ${services.fabric.ssl.keyStoreType},
      # Path to KeyStore file
      keyStore: ${services.fabric.ssl.keyStore},
      # Password to access the keystore file
      keyStorePassword: ${services.fabric.ssl.keyStorePassword},
      # Password to access the key
      keyPassword: ${services.fabric.ssl.keyPassword},
      # TrustStore type
      trustStoreType: ${services.fabric.ssl.trustStoreType},
      # Path to TrustStore file
      trustStore: ${services.fabric.ssl.trustStore},
      # Password to access the truststore file
      trustStorePassword: ${services.fabric.ssl.trustStorePassword}
    }
  }

  # Set up kerberos credentials in server (applicable for both coordinator and executor)
  kerberos: {
    principal: "",
    keytab.file.path: ""
  }

  web-admin: {
    enabled: true,
    # Port, bound to loopback interface, on which the daemon responds to liveness HTTP requests (0 == auto-allocated)
    port: 8888
  }
}

# the zookeeper quorum for the cluster
zookeeper: "<>"

zk.client.session.timeout: 90000

debug: {
  enabled: false,
  autoPort: false,
  prepopulate: false,
  singleNode: false,
  verboseAccessLog: false,
  allowTestApis: false,
  forceRemote: false,
  useMemoryStorage: false,
  addDefaultUser: false,
  allowNewerKVStore: true,
  # to enable remote debugging of the DremioDaemon running in YARN container
  yarnremote.enabled: false
  # UI Red Screen Of Death
  rsod.enabled: false
  # UI File A Bug option
  bug.filing.enabled: false
  # enable on-idle load shedding
  task.on_idle_load_shed: true
  # enable rescheduling task on unblock
  task.reschedule_on_unblock: true
  # Use election service to elect between multiple master candidates
  # has to be set to false if multiple master candidates
  master.election.disabled: false,
}

# These system properties are listed here to allow substitution of system property values for DAC Web SSL properties
# listed in services.web.ssl section. Currently we consider only the system properties listed in this file for
# substitution.
javax.net.ssl {
  keyStoreType: "",
  keyStore: "",
  keyStorePassword: "",
  keyPassword: "",
  trustStoreType: "",
  trustStore:"",
  trustStorePassword: ""
}

registration.publish-host: ""

core-site.xml:

<?xml version="1.0"?>
<configuration>
   <property>
       <name>fs.dremioS3.impl</name>
       <description>The FileSystem implementation. Must be set to com.dremio.plugins.s3.store.S3FileSystem</description>
       <value>com.dremio.plugins.s3.store.S3FileSystem</value>
   </property>
   <property>
       <name>fs.s3a.access.key</name>
       <description>AWS access key ID.</description>
       <value>xxx</value>
   </property>
   <property>
       <name>fs.s3a.secret.key</name>
       <description>AWS secret key.</description>
       <value>xxx</value>
   </property>
   <property>
       <name>fs.s3a.aws.credentials.provider</name>
       <description>The credential provider type.</description>
       <value>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</value>
    </property>
    <property>
        <name>fs.s3a.endpoint</name>
        <description>Endpoint can either be an IP or a hostname, where Minio server is running . However the endpoint value cannot contain the http(s) prefix. E.g. 175.1.2.3:9000 is a valid endpoint. </description>
       <value>localhost:8080</value>
    </property>
    <property>
        <name>fs.s3a.path.style.access</name>
        <description>Value has to be set to true.</description>
        <value>true</value>
    </property>
    <property>
        <name>dremio.s3.compat</name>
        <description>Value has to be set to true.</description>
        <value>true</value>
    </property>

    <property>
        <name>fs.s3a.connection.ssl.enabled</name>
        <description>Value can either be true or false, set to true to use SSL with a secure Minio server.</description>
        <value>false</value>
    </property>
</configuration>

Has anyone seen something like this? It is…basically impossible to use the system with these failures.

@johnseekins, that is a warning, rather than an error. Is there and an error associated with some kind of failure?

@johnseekins

I see you only have a root bucket, can you add a folder and also create it, for example

dist: "dremioS3:///<bucket_name>/<folder1>/<folder2>"

When I move forward with this warning/error (although it’s always weird to me when a stacktrace is associated with a warning), I can’t see results of creating datasets. I just get com.dremio.common.exceptions.UserException: The source ["__jobResultsStore"] is currently unavailable. Info: [] in the query UI.

I’ll give that a shot, too. Although further in that documentation, their example doesn’t have a folder.

Two more notes:

  1. adding a path under the root bucket doesn’t help.
  2. when I remove debug logging, the WARN becomes an ERROR:
2020-12-03 15:02:22,301 [main] ERROR c.dremio.exec.catalog.PluginsManager - Failure while starting plugin "__datasetDownload" after 1500ms.
java.util.concurrent.ExecutionException: com.google.common.util.concurrent.UncheckedExecutionException: com.amazonaws.SdkClientException: Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListAllMyBucketsHandler

@johnseekins

Can you check if you have “ListBucket” and “ListAllMyBuckets” on the bucket?

I’m currently using the admin user defined in MinIO for testing for all access (to avoid permissions problems).

@johnseekins

Are you able to access the buckets from AWS S3 command line?

Thanks
Bali

Yes:

# mc ls local
[2020-12-02 16:05:51 UTC]     0B dremio/
[2020-12-02 21:13:43 UTC]     0B hivemetastore/
[2020-11-24 16:33:18 UTC]     0B nifi/
[2020-11-24 21:03:38 UTC]     0B temp/

(mc is using the same credentials dremio is using)

This patch:


once conflicts are resolved, absolutely fixes my problem.

I hope this patch can be added soon. This seems like a bit of a regression.