Connect to multiple hdfs cluster with different kerberos setup

Hi Team,

I need to connect Dremio to 2 HDFS clusters with different Kerberos setup. Can you please confirm if it is possible to do so?

For this I have passed 2 configurations of services.kerberos in dremio.conf

services.kerberos: {
principal:"principal1@REALM1.COM",
keytab.file.path:"/etc/dremio/keytab/cluster1.keytab"
}
services.kerberos: {
principal:"principal2@REALM2.COM",
keytab.file.path:"/etc/dremio/keytab/cluster2.keytab"
}

And I was able to start dremio but while creating source in dremio only 2nd principal is visible i.e. unable to create source for cluster 1.

Please suggest how to cater this requirement.

Thanks in advance.

Hi @Monika_Goel

To be able to connect to 2 secured HDFS clusters, we have to individually pass the specific parameters through the HDFS source add property button.

are you having a local copy of one of the hdfs-site.xml and core-site.xml in your Dremio conf folder? (or symlinked). This method would only work for connecting to a single Kerberized HDFS cluster

This is what you have to do,

rename or unlink core-site.xml and yarn-site.xml, then create a core-site.xml with just one parameter “hadoop.security.authentication” and value as “kerberos”. The rest of the parameters need to be passed through the source form

I will try and get back to you with the exact steps

Thanks
@balaji.ramaswamy

Hi @balaji.ramaswamy

Thanks for your response.

For now I have local copy of hdfs-site.xml and core-site.xml in your Dremio conf folder.
Do you mean to pass all hdfs-site.xml and core-site.xml parameter through HDFS source add property button?

If yes then it’s going to be very tedious task. As each xml file have around 50-60 parameters. Can you please tell me which param does dremio refer or do we need to pass all param?

Hi Monika,

You need to pass only security related and HA related parameters. Would you mind sharing me your hdfs-site.xml (maybe mask the values that are sensitive) and i can get back with a crisp set of parameters you need to pass through the HDFS source additional properties. If you are unable to give it to us, let me try and send something back

Thanks
@balaji.ramaswamy

Hi @balaji.ramaswamy ,

Attached parameters sets in hdfs-site.xml & core-site.xml. Please let me know which all parameters do I need to set in HDFS source additional properties.

Cluster Config Param.zip (2.1 KB)

Thanks.

Hi @balaji.ramaswamy

I have tried above approach but it doesn’t work. Still unable to connect with both kerberized clusters (with different REALM) at same time.

Found below code in DACDaemon.java and seems like we can only connect to one kerberized cluster at a time by specifying principal & keytab in dremio.conf.

 /**
   * Set up the current user in {@link UserGroupInformation} using the kerberos principal and keytab file path if
   * present in config. If not present, this method call is a no-op. When communicating with the kerberos enabled
   * Hadoop based filesystem credentials in {@link UserGroupInformation} will be used..
   * @param config
   * @throws IOException
   */
  private void setupHadoopUserUsingKerberosKeytab(final DremioConfig config) throws IOException {
    final String kerberosPrincipal = config.getString(DremioConfig.KERBEROS_PRINCIPAL);
    final String kerberosKeytab = config.getString(DremioConfig.KERBEROS_KEYTAB_PATH);

    if (Strings.isNullOrEmpty(kerberosPrincipal) || Strings.isNullOrEmpty(kerberosKeytab)) {
      return;
    }

    UserGroupInformation.loginUserFromKeytab(kerberosPrincipal, kerberosKeytab);

    logger.info("Setup Hadoop user info using kerberos principal {} and keytab file {} successful.",
        kerberosPrincipal, kerberosKeytab);
  }

Can you please take a look in this on urgent basis.

Thanks for your support. Appreciate it.

@Monika_Goel - just to let you know your findings in the code are a bit irrelevant, as Dremio needs hdfs for it’s own purposes, as distributed storage and/or YARN deployment.
In the case of connecting to different hdfs clusters as sources all the information including kerberos principals and keytabs has to be specified per source.
Did you try to access two different hdfs clusters with this kind of configuration from command line for example by at least switching config files/directories?
I think it would be a good place to start.

@yufeldman

Thanks for your reply. Can you please elaborate “access two different hdfs clusters with this kind of configuration from command line for example by at least switching config files/directories?”

I am able to connect to both clusters individually. But my requirement is to analyze data from both clusters in dremio at a time.

I understand you want to access both clusters at the same time from Dremio. I am asking if you can access both of those clusters from Dremio coordinator node command line.
Like:
hadoop fs -ls /cluster1/...
and
hadoop fs -ls /cluster2/...

When I try to connect to HDFS by specifying principal & keytab in HDFS source then it fails with "Failure while configuring source [Dev]. Info: Unavailable: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details "

Did Single Node installation of dremio in unix env. And node running dremio is not part of HDFS cluster. So can’t access cluster from Dremio node. But can access clusters with specified principal from cluster edge nodes.

@yufeldman
Using dremio Community edition. Can you please confirm if this edition support specifying kerberos details in Source property?

This is Dremio property is not useful while trying to connect to HDFS cluster.
Essentially in order to connect to multiple HDFS clusters you need to make sure that principal you start Dremio with is accepted in both clusters. You can’t use one principal to start Dremio while trying to use different one(s) to connect to your clusters.

Did you use the same principal to access both?

No principal are different for both clusters. As REALM is different for each cluster. Node running dremio is not Kerberized yet but clusters are and have different principals.

OK, so I would imagine that you need set up your principal to be accessible from both clusters.
You may want to take a look here: https://community.hortonworks.com/articles/18686/kerberos-cross-realm-trust-for-distcp.html
To see what you need for cross cluster realms trust. It may not be as complex, but you definitely need a principal that is trusted on both clusters.

Thank you @yufeldman & @balaji.ramaswamy for looking into it.

Hi,

@Monika_Goel did you able to access multiple HDFS source with different kerberos setup?

If yes, could you please tell how did you achieve that. I want to add the multiple HDFS source in DREMIO.