Join between MySQL and ElasticSearch taking too much time

I’m trying to JOIN one tiny index on ES with one not so small table on MySQL with this query:

SELECT mysql_table.idsearches, mysql_table.idtransactions 
FROM "Live MM".mm.transactions as mysql_table
INNER JOIN "elastic-prd"."api-search-2017.08".doc as es_index
ON mysql_table.idsearches = CONVERT_TO_INTEGER(es_index."searchIntention"['legacyId'], 1, 1, 0)
WHERE mysql_table.idsearches = 38827627

But is taking more than 10 minutes to execute. When I perform a query with the WHERE clause on each of my datasets, it returns in seconds - as it should be. I’d already try creating one temporary table with the field legacyId parsed but the performance was the same. My MySQL is on cloud and the transaction table has 490098 rows. My ES is local and this index has only 2 documents.

Anyone can explain to me why does it taking so long?

The profiles files:

Hi Allan,

It looks as though the WHERE clause isn’t getting pushed to the MySQL database when joining heterogeneous data sources.

You may be able to ‘force’ the filter to appear if you restructure your query such that you join a subquery contain the filter to the ES table:

SELECT mysql_table.idsearches, mysql_table.idtransactions 
FROM (SELECT idsearches, idtransactions FROM "Live MM".mm.transactions WHERE idsearches = 38827627) mysql_table
INNER JOIN "elastic-prd"."api-search-2017.08".doc as es_index
ON mysql_table.idsearches = CONVERT_TO_INTEGER(es_index."searchIntention"['legacyId'], 1, 1, 0)

Try this query out and see if the performance is better – we explicitly do the filter on the MySQL table before the join now.

That said, we’ll investigate why this optimization doesn’t happen automatically.

Thanks for helping, @jduong!

Actually, the WHERE clause was some attempt to optimize the query. What I really want to do is test the performance joining MySQL data with ElasticSearch`s.

I just tried this same query after creating Raw Reflections of my transaction table. The result was instantaneous! Really great, but I think that something on the execution plan is not quite right… Without the Reflection, both queries performance continue the same :frowning:

Hi Allan,

What kind of runtime do you get with the MySQL query if you use other tools to query it?

@jduong in a matter of seconds, joining with other tables inside MySQL.

I’ve just installed in my production environment to test in a real world case and discover something: the slowest phases are those which Dremio performs thousands of sub_scans. My ElasticSearch indices have 80 million documents distributed in 5 shards. To build the JOIN, Dremio submits scroll scan queries to ES, parallelizing only with 5 threads (number of shards) and returning only 4000 docs by query. Maybe it should return more docs by query… something about 10000 docs performed well on my 5 nodes cluster.

Those tests before, on my local machine, the slowest phases was also subs_scans, but with JDBC Connector (my local index had only 2 docs).

I suppose that behavior is not so easy to change… JOINs operations are well known to be expensive. I think that just with Raw Reflections and with Partitions these queries can be performative.

Here goes a print of the plan:

1 Like

Hi,

Changing the scroll size for Elastic Search sources is currently scheduled for our next release. Will update here once we release it.

1 Like