Incorrect execution of SQL function in Dremio (executed by gandiva)

There is a test function:

SELECT upper (text1) a FROM dwh.public.test_upper WHERE trunc (months_between (CURRENT_DATE (), CURRENT_DATE ()) / 12) <= 59 limit 10

We are using version 15 of the dremo (upgrade to a newer version in the plans) included with UTF-16 character support.

Earlier in our fork, we made changes to the lower \ upper function at the sabot kernel StringFunctions level so that they work correctly with UTF-8 characters (upstream uses a character code shift for this, which only works for latin symbols).

When calling a test function on a test table, we observe the following behavior: for latin characters, the lower \ upper functions work, but for UTF-8 characters (in our case, Russian), they do not.

The following line is observed in the log:

pool-18-thread-1 - 1e98e714-7da4-0c62-6ca7-4748b90c3e00: frag: 0: 0] DEBUG cdexec.expr.ExpressionSplitter - Expression executed entirely in Gandiva FunctionHolderExpression [args = [ValueVectorReadExpression [fieldId = TypeddsFieldId [fieldIpeddsFieldId [0], remainder = null]]], name = lower, returnType = varchar, isRandom = false]

As I understand it, we are talking about the fact that the function was performed using Gandiva. This is how I represent the logic for processing requests:

  1. The request is checked for pushdown capability. In this case, the ARP file for the connector does not have the months_between function, so the pushdown does not work.
  2. Dremio determines whether the java implementation will be used to execute the function (apparently, for this case, declared in com / dremio / exec / expr / fn / impl / or the gandiva function.
  3. Judging by the inscription in the log, the lower function from gandiva is applied. I looked at the source code for gandiva - it also uses the character number swig, i.e. works only for Latin.
  4. As a result, the English letters are brought to the required register, but the Russian ones are not.

We found a workaround - changing the original request or adding an arp file so that the request is always pushed to the database. But for us this is a temporary solution, because some of the requests are generated automatically.

In this regard, the question is: is it possible to somehow disable the execution of specific functions (lower, upper) by means of gandiva, so that our implementation of the function in the StringFunctions class works out?
Using some kind of flag when launching a dremo, or inside a dremo, or changing the source code?

Or is it only solved at the gandiva fix level?

@Arol Would you have the profile? We can review what the push down was and get back to you

@balaji.ramaswamy (10.0 КБ)

The profile from the test loop is not available right now, but this one should be about the same.

I think it’s worth noting that the dremo is launched with the following parameters:

-Dsaffron.default.charset = UTF-16LE
-Dsaffron.default.nationalcharset = UTF-16LE = UTF-16LE$en_US

As far as I understand, the function will not be pushed in the database because the months_between function is not declared in the ARP file of the connector. If we replace it with the existing timestampdiff, then the pushdown works, but in this case we are not interested in this case.

As a solution, I found the ability to use the exec.disabled.gandiva-functions key to disable the execution of the lower and upper functions in the gandiva.

23:33:11.377 [pool-18-thread-1 - 1e8f82fa-73c2-b71e-2bca-7aea57217f00:frag:0:0] DEBUG c.d.s.o.l.expr.GandivaPushdownSieve - function [lower] has been disabled to be executed in gandiva
23:33:11.377 [pool-18-thread-1 - 1e8f82fa-73c2-b71e-2bca-7aea57217f00:frag:0:0] DEBUG c.d.exec.expr.ExpressionSplitter - Splitting expression FunctionHolderExpression [args=[ValueVectorReadExpression [fieldId=TypedFieldId [fieldIds=[0], remainder=null]]], name=lower, returnType=varchar, isRandom=false]
23:33:11.377 [pool-18-thread-1 - 1e8f82fa-73c2-b71e-2bca-7aea57217f00:frag:0:0] DEBUG c.d.exec.expr.ExpressionSplitter - Expression executed entirely in Java FunctionHolderExpression [args=[ValueVectorReadExpression [fieldId=TypedFieldId [fieldIds=[0], remainder=null]]], name=lower, returnType=varchar, isRandom=false]

In this case, I see in the debug log information that execution on gandiva is prohibited and a java function is used (like, it’s called LLVM) - an analogue of what is in StringFunctions in sabot.kernel. We changed this function in our code so that it works correctly with UTF-8 and it looks like it works for us.