Process data only inside Dremio, and save resources from Legacy databases

usergim · November 28, 2019, 5:13pm

Hi,

I have a dremio application running on a container that has 2 data sources configured:

A Greenplum database
A MySQL database

I noticed that when I query something like this:
SELECT * FROM greenplum.TABLE1
INNER JOIN mysql.TABLE2
ON …

Dremio selects only data from both databases and performs the join internally.

I would like to know if there is a way to reproduce the same behavior when using only one database:
SELECT * FROM mysql.TABLE1
INNER JOIN mysql.TABLE2
ON …

The purpose is to avoid performing JOINS and other heavy operations on this legacy mysql server.
Is there a configuration, or a way to do this? A co-worker tried to perform this using virtual datasets but without success.

Thanks in advance.

vinicius.mello · November 28, 2019, 7:34pm

Hi,

There is no additional configuration to perform joins on the same database. Here at the company, we have the same thing, and we moved some MySQL views to virtual datasets to avoid massive operations on the database.

Some problem that you might be facing is when dremio tries to parse the query.

Try adding an alias to the tables.

SELECT * FROM mysql.TABLE1 tab1
INNER JOIN mysql.TABLE2 tab2
ON tab1.key = tab2.key

usergim · January 8, 2020, 5:46pm

Hi Vinicius,

Thanks for replying!

I actually found out a way to “trick” dremio into running the JOIN internally.
I created 2 data sources, but both of them pointing to the same database, and used both data sources on the query, one for each table.

Topic		Replies	Views
Join tables from same data source	1	1375	September 8, 2020
How to force join pushdown?	4	2453	November 6, 2020
Combine data from multiple datasets	10	4127	March 28, 2023
How join works with two different data sources	6	3413	October 29, 2019
Dremio SELECT * on joins	2	5258	November 6, 2018

Process data only inside Dremio, and save resources from Legacy databases

Related topics