REGEXP_SPLIT bug with diacritics

Trying to split the twitter stream sample, portuguese. Using:

SELECT text, regexp_split(text, ‘\Q \E’, ‘ALL’, 10) AS text_1
FROM (
SELECT text
FROM “twitter stream”.“twitter-stream.json”
WHERE lang = ‘pt’
) nested_0

But result is wrong when diacritic is present:

“bom diaaaaaa” [ “bom”, “diaaaaaa” ] OK

“Esse frio tá ótimo
Não disse pra que” [ “Esse”, “frio”, “t�”, " óti", “o \nN”, “�o di”, "se ", "ra " ] NOK

@balaji.ramaswamy hello, we have same issue in last dremio version.

Lomásle,Másle changes to ["Lomásl",",Más"]
using REGEXP_SPLIT(tags, ',', 'ALL', 10000)

I use Dremio extensively and have reported more than 20 bugs over the last 4 years, would I really like to contribute to the code, any contribution guide? @dch @Benny_Chow

@balaji.ramaswamy @dch @Benny_Chow hello I debuged and find the error in the code of com.dremio.dac.explore.udfs.SplitPattern in the method splitRegex(Matcher matcher, String matchee)

I patched it, but I have a compile error to test

  • Cannot resolve plugin com.dremio.tools:dremio-fmpp-maven-plugin:24.3.2-202401241821100032-d2d8a497
  • Cannot resolve plugin com.dremio.tools:dremio-fmpp-maven-plugin:24.3.2-202401241821100032-d2d8a497
  • Cannot resolve plugin org.apache.maven.plugins:maven-checkstyle-plugin:3.3.1