Multi-Dremio-Instance Federation

Hey everyone, I’ve been using Dremio-OSS inside my organization for a couple of years. We use it at what you might call a “boutique-scale” where we are managing recordsets topping out at a few hundred million rows. We implement RKE2 Kubernetes on a custom deployment platform we developed in-house to run on bare metal, virtual machines, or cloud instances with the same standards & architecture.

Having this superpower, then, we’re able to spawn instances to suit any use-case. Recently requirements demanded we isolate a portion of a client’s data on VMs in a HIPAA-compliant isolated cloud provider environment. However, their overall data-volume is such that we didn’t want to spend vast sums managing ALL their data on the cloud host when we have cheap, plentiful CPU and storage back at home that we don’t have to rent by the hour.

So, our solution is to run our K8s with Dremio on-premise for the larger non-sensitive data, and run our K8s with Dremio in the cloud for the more specific sensitive data. We can then consume both sources in the Tableau instance ALSO in the cloud, and provide the requested decision-support.

What I would LIKE to do, in addition to this, is allow the secure, cloud-based Dremio instance to consume records from the on-premise Dremio instance, giving us the ability to cache pre-aggregated result sets from the big remote data lake, and thereupon, provide a synthesis of sensitive-data results through a single ODBC connection to Tableau that doesn’t cost us big money to host.

I don’t see a way to accomplish this simply. I actually assumed there was a happy little narwhal in the external sources or data lakes connection options, and was sad to discover there’s none to be found.

I can mount the on-prem instance’s Minio storage in the cloud instance and sort of get what I want, but that doesn’t give me access to any of the Dremio virtual table goodness. Instead I have to materialize virtual table results there by doing CTAS queries or running DBT jobs. What I really want is direct ODBC or Arrow Flight access from one instance to the other, but I’m not sure how to implement that.

My only other idea was to look for some kind of ODBC proxy that could consume any source with a driver and represent it as something pedestrian like Postgres or MySQL. I considered trying AWS Glue, but that looks complicated and expensive.

Any other options or suggestions? Is this a Dremio Enterprise feature? Are there some tools or services I should be considering?

@AWaschick

You had mentioned the below, Dremio does not store any information other than C3 files or reflection files, what records are you referring to here?

What I would LIKE to do, in addition to this, is allow the secure, cloud-based Dremio instance to consume records from the on-premise Dremio instance

I would like to query Dremio records in the same way I can when I connect with an ODBC or Arrow Flight connection, query the API, or work in the Dremio web interface. Simply put, I would like my one Dremio to show up as a remote data source in the other Dremio.

The file access I’m talking about would only allow me to retrieve recordsets I materialized OUT of dremio, by writing new parquet files. But, I want to see the same spaces, and virtual tables therein, that I’d see with a normal connection.

@AWaschick

spaces, VDS are stored inside the key value store and currently no way of showing up on another Dremio cluster

How interesting, considering the spaces are fully visible if I connect to the Dremio cluster with ODBC, Arrow Flight, or the API. I shall have to engineer a workaround for this suspicious omission of capability.