REGEXP_SPLIT bug with diacritics

jrussi · October 3, 2018, 3:55am

Trying to split the twitter stream sample, portuguese. Using:

SELECT text, regexp_split(text, ‘\Q \E’, ‘ALL’, 10) AS text_1
FROM (
SELECT text
FROM “twitter stream”.“twitter-stream.json”
WHERE lang = ‘pt’
) nested_0

But result is wrong when diacritic is present:

“bom diaaaaaa” [ “bom”, “diaaaaaa” ] OK

“Esse frio tá ótimo
Não disse pra que” [ “Esse”, “frio”, “t�”, " óti", “o \nN”, “�o di”, "se ", "ra " ] NOK

dacopan · April 3, 2024, 3:01am

@balaji.ramaswamy hello, we have same issue in last dremio version.

Lomásle,Másle changes to ["Lomásl",",Más"]
using REGEXP_SPLIT(tags, ',', 'ALL', 10000)

I use Dremio extensively and have reported more than 20 bugs over the last 4 years, would I really like to contribute to the code, any contribution guide? @dch @Benny_Chow

dacopan · April 8, 2024, 12:05am

@balaji.ramaswamy @dch @Benny_Chow hello I debuged and find the error in the code of com.dremio.dac.explore.udfs.SplitPattern in the method splitRegex(Matcher matcher, String matchee)

I patched it, but I have a compile error to test

Cannot resolve plugin com.dremio.tools:dremio-fmpp-maven-plugin:24.3.2-202401241821100032-d2d8a497
Cannot resolve plugin com.dremio.tools:dremio-fmpp-maven-plugin:24.3.2-202401241821100032-d2d8a497
Cannot resolve plugin org.apache.maven.plugins:maven-checkstyle-plugin:3.3.1

Topic		Replies	Views
Video: Dremio SQL Functions - REGEXP_COL_LIKE Tutorials	0	150	January 8, 2024
About syntax of the regexp_like function	4	3027	October 29, 2019
Using REGEXP_LIKE on PostgreSQL column spins a 100% CPU consuming thread	2	254	January 29, 2024
REGEXP_REPLACE substring replacement syntax	10	4038	May 22, 2018
Line break or Single row to multiple rows	6	3146	October 3, 2018

REGEXP_SPLIT bug with diacritics

Related Topics