Error when parsing multibyte UTF-8 characters from S3

Got this error when saving a text delimited format on an S3 data source:

DATA_READ ERROR: Dremio failed to read your text file.  Dremio supports up to 65536 columns in a text file.  Your file appears to have more than that.

Failure while reading file dremio[S3URL]. Happened at or shortly before byte position 182905.
SqlOperatorImpl TEXT_SUB_SCAN
Location 1:10:4
Fragment 1:10

[Error Id: c623feb5-fa6e-4810-a644-9dc696cb4354 on [EXECUTORHOST]:-1]

  (java.lang.ArrayIndexOutOfBoundsException) null

The file has some Chinese UTF-8 characters. Sure enough, near position (182905 - 65536 = 117369) there are Chinese characters. It seems like newlines and delimiters cannot be detected properly after parsing over multibyte UTF-8 characters. This is a deal-breaker for us. Is there any workaround for this issue?

Below is stacktrace from server.log:

2018-06-27 06:44:19,920 [e3 - 24cccdbb-c567-0668-b720-61d5e5ed7100:frag:1:10] INFO  c.d.e.s.e.text.compliant.TextReader - User Error Occurred [ErrorId: c623feb5-fa6e-4810-a644-9dc696cb4354]
com.dremio.common.exceptions.UserException: Dremio failed to read your text file.  Dremio supports up to 65536 columns in a text file.  Your file appears to have more than that.
        at com.dremio.common.exceptions.UserException$Builder.build(UserException.java:746) ~[dremio-common-2.0.5-201806021755080191-767cfb5.jar:2.0.5-201806021755080191-767cfb5]
        at com.dremio.exec.store.easy.text.compliant.TextReader.handleException(TextReader.java:431) [dremio-sabot-kernel-2.0.5-201806021755080191-767cfb5.jar:2.0.5-201806021755080191-767cfb5]
        at com.dremio.exec.store.easy.text.compliant.TextReader.parseNext(TextReader.java:385) [dremio-sabot-kernel-2.0.5-201806021755080191-767cfb5.jar:2.0.5-201806021755080191-767cfb5]
        at com.dremio.exec.store.easy.text.compliant.CompliantTextRecordReader.next(CompliantTextRecordReader.java:265) [dremio-sabot-kernel-2.0.5-201806021755080191-767cfb5.jar:2.0.5-201806021755080191-767cfb5]
        at com.dremio.sabot.op.scan.ScanOperator.outputData(ScanOperator.java:208) [dremio-sabot-kernel-2.0.5-201806021755080191-767cfb5.jar:2.0.5-201806021755080191-767cfb5]
        at com.dremio.sabot.driver.SmartOp$SmartProducer.outputData(SmartOp.java:518) [dremio-sabot-kernel-2.0.5-201806021755080191-767cfb5.jar:2.0.5-201806021755080191-767cfb5]
        at com.dremio.sabot.driver.StraightPipe.pump(StraightPipe.java:56) [dremio-sabot-kernel-2.0.5-201806021755080191-767cfb5.jar:2.0.5-201806021755080191-767cfb5]
        at com.dremio.sabot.driver.Pipeline.doPump(Pipeline.java:82) [dremio-sabot-kernel-2.0.5-201806021755080191-767cfb5.jar:2.0.5-201806021755080191-767cfb5]
        at com.dremio.sabot.driver.Pipeline.pumpOnce(Pipeline.java:72) [dremio-sabot-kernel-2.0.5-201806021755080191-767cfb5.jar:2.0.5-201806021755080191-767cfb5]
        at com.dremio.sabot.exec.fragment.FragmentExecutor$DoAsPumper.run(FragmentExecutor.java:291) [dremio-sabot-kernel-2.0.5-201806021755080191-767cfb5.jar:2.0.5-201806021755080191-767cfb5]
        at com.dremio.sabot.exec.fragment.FragmentExecutor$DoAsPumper.run(FragmentExecutor.java:287) [dremio-sabot-kernel-2.0.5-201806021755080191-767cfb5.jar:2.0.5-201806021755080191-767cfb5]
        at java.security.AccessController.doPrivileged(Native Method) [na:1.8.0_171]
        at javax.security.auth.Subject.doAs(Subject.java:422) [na:1.8.0_171]
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807) [hadoop-common-2.8.0.jar:na]
        at com.dremio.sabot.exec.fragment.FragmentExecutor.run(FragmentExecutor.java:244) [dremio-sabot-kernel-2.0.5-201806021755080191-767cfb5.jar:2.0.5-201806021755080191-767cfb5]
        at com.dremio.sabot.exec.fragment.FragmentExecutor.access$800(FragmentExecutor.java:84) [dremio-sabot-kernel-2.0.5-201806021755080191-767cfb5.jar:2.0.5-201806021755080191-767cfb5]
        at com.dremio.sabot.exec.fragment.FragmentExecutor$AsyncTaskImpl.run(FragmentExecutor.java:580) [dremio-sabot-kernel-2.0.5-201806021755080191-767cfb5.jar:2.0.5-201806021755080191-767cfb5]
        at com.dremio.sabot.task.AsyncTaskWrapper.run(AsyncTaskWrapper.java:107) [dremio-sabot-kernel-2.0.5-201806021755080191-767cfb5.jar:2.0.5-201806021755080191-767cfb5]
        at com.dremio.sabot.task.slicing.SlicingThread.run(SlicingThread.java:71) [dremio-extra-sabot-scheduler-2.0.5-201806021755080191-767cfb5.jar:2.0.5-201806021755080191-767cfb5]
Caused by: java.lang.ArrayIndexOutOfBoundsException: null

Can you share a sample that triggers this error?

Hi Kelly,

Thanks for your response. I tried to create a small sample file, but these samples couldn’t reproduce the error. Is there a way you would recommend me to share a 300mb .csv.gz file privately and securely?

Thanks,
Ricky