Process data only inside Dremio, and save resources from Legacy databases


I have a dremio application running on a container that has 2 data sources configured:

  1. A Greenplum database
  2. A MySQL database

I noticed that when I query something like this:
SELECT * FROM greenplum.TABLE1
ON …

Dremio selects only data from both databases and performs the join internally.

I would like to know if there is a way to reproduce the same behavior when using only one database:
ON …

The purpose is to avoid performing JOINS and other heavy operations on this legacy mysql server.
Is there a configuration, or a way to do this? A co-worker tried to perform this using virtual datasets but without success.

Thanks in advance.


There is no additional configuration to perform joins on the same database. Here at the company, we have the same thing, and we moved some MySQL views to virtual datasets to avoid massive operations on the database.

Some problem that you might be facing is when dremio tries to parse the query.

Try adding an alias to the tables.

SELECT * FROM mysql.TABLE1 tab1
INNER JOIN mysql.TABLE2 tab2
ON tab1.key = tab2.key

Hi Vinicius,

Thanks for replying!

I actually found out a way to “trick” dremio into running the JOIN internally.
I created 2 data sources, but both of them pointing to the same database, and used both data sources on the query, one for each table.