I’m trying to JOIN one tiny index on ES with one not so small table on MySQL with this query:
SELECT mysql_table.idsearches, mysql_table.idtransactions
FROM "Live MM".mm.transactions as mysql_table
INNER JOIN "elastic-prd"."api-search-2017.08".doc as es_index
ON mysql_table.idsearches = CONVERT_TO_INTEGER(es_index."searchIntention"['legacyId'], 1, 1, 0)
WHERE mysql_table.idsearches = 38827627
But is taking more than 10 minutes to execute. When I perform a query with the WHERE clause on each of my datasets, it returns in seconds - as it should be. I’d already try creating one temporary table with the field legacyId parsed but the performance was the same. My MySQL is on cloud and the transaction table has 490098 rows. My ES is local and this index has only 2 documents.
Anyone can explain to me why does it taking so long?
Actually, the WHERE clause was some attempt to optimize the query. What I really want to do is test the performance joining MySQL data with ElasticSearch`s.
I just tried this same query after creating Raw Reflections of my transaction table. The result was instantaneous! Really great, but I think that something on the execution plan is not quite right… Without the Reflection, both queries performance continue the same
@jduong in a matter of seconds, joining with other tables inside MySQL.
I’ve just installed in my production environment to test in a real world case and discover something: the slowest phases are those which Dremio performs thousands of sub_scans. My ElasticSearch indices have 80 million documents distributed in 5 shards. To build the JOIN, Dremio submits scroll scan queries to ES, parallelizing only with 5 threads (number of shards) and returning only 4000 docs by query. Maybe it should return more docs by query… something about 10000 docs performed well on my 5 nodes cluster.
Those tests before, on my local machine, the slowest phases was also subs_scans, but with JDBC Connector (my local index had only 2 docs).
I suppose that behavior is not so easy to change… JOINs operations are well known to be expensive. I think that just with Raw Reflections and with Partitions these queries can be performative.