Trying to split the twitter stream sample, portuguese. Using:
SELECT text, regexp_split(text, ‘\Q \E’, ‘ALL’, 10) AS text_1
FROM (
SELECT text
FROM “twitter stream”.“twitter-stream.json”
WHERE lang = ‘pt’
) nested_0
But result is wrong when diacritic is present:
“bom diaaaaaa” [ “bom”, “diaaaaaa” ] OK
“Esse frio tá ótimo
Não disse pra que” [ “Esse”, “frio”, “t�”, " óti", “o \nN”, “�o di”, "se ", "ra " ] NOK
Lomásle,Másle changes to ["Lomásl",",Más"]
using REGEXP_SPLIT(tags, ',', 'ALL', 10000)
I use Dremio extensively and have reported more than 20 bugs over the last 4 years, would I really like to contribute to the code, any contribution guide? @dch@Benny_Chow
@balaji.ramaswamy@dch@Benny_Chow hello I debuged and find the error in the code of com.dremio.dac.explore.udfs.SplitPattern in the method splitRegex(Matcher matcher, String matchee)