Read Parquet fails with custom (OSS) build

Hi,

i have an error reading Parquet files with custom build.

i have managed to build the Dremio oss build from https://github.com/dremio/dremio-oss/ (with flag -Ddremio.oss-only=true).
I have used hints in BUILD FAILURE with -Ddremio.oss-only (removed dremio-tpch-sample-data from POM files) and Dremio-oss build error (manually add jar to local repo, or use free repository) to build it.

After that i used the official Dockerfile from Dockerhub (https://github.com/dremio/dremio-cloud-tools/blob/master/images/dremio-oss/Dockerfile) and changed the download to copy the local archive. Started the docker container an tried to read a single parquet file and a S3 storage with multiple files.

Al of the 3 variants above give me an error reading the files and in the query tables most of the columns missing data.

in the log is an error:

nSqlOperatorImpl PARQUET_ROW_GROUP_SCAN
Location 0:0:8
SqlOperatorImpl PARQUET_ROW_GROUP_SCAN
Location 0:0:8
Fragment 0:0

[Error Id: 2b04a6d4-01b4-41c4-a88c-b252a9be7a75 on 0076d817e848:0]

  (org.apache.arrow.vector.util.SchemaChangeRuntimeException) Schema change error
    com.dremio.common.exceptions.UserException.schemaChangeError():91
    com.dremio.sabot.op.scan.ScanOperator.checkAndLearnSchema():394
    com.dremio.sabot.op.scan.ScanOperator.setupReader():265
    com.dremio.sabot.op.scan.ScanOperator.setup():249
    com.dremio.sabot.driver.SmartOp$SmartProducer.setup():563
    com.dremio.sabot.driver.Pipe$SetupVisitor.visitProducer():79
    com.dremio.sabot.driver.Pipe$SetupVisitor.visitProducer():63
    com.dremio.sabot.driver.SmartOp$SmartProducer.accept():533
    com.dremio.sabot.driver.StraightPipe.setup():102
    com.dremio.sabot.driver.StraightPipe.setup():102
    com.dremio.sabot.driver.StraightPipe.setup():102
    com.dremio.sabot.driver.StraightPipe.setup():102
    com.dremio.sabot.driver.StraightPipe.setup():102
    com.dremio.sabot.driver.StraightPipe.setup():102
    com.dremio.sabot.driver.StraightPipe.setup():102
    com.dremio.sabot.driver.StraightPipe.setup():102
    com.dremio.sabot.driver.Pipeline.setup():68
    com.dremio.sabot.exec.fragment.FragmentExecutor.setupExecution():391
    com.dremio.sabot.exec.fragment.FragmentExecutor.run():273
    com.dremio.sabot.exec.fragment.FragmentExecutor.access$1400():94
    com.dremio.sabot.exec.fragment.FragmentExecutor$AsyncTaskImpl.run():711
    com.dremio.sabot.task.AsyncTaskWrapper.run():112
    com.dremio.sabot.task.single.DedicatedFragmentRunnable.run():47
    java.util.concurrent.Executors$RunnableAdapter.call():511
    java.util.concurrent.FutureTask.run():266
    java.util.concurrent.ThreadPoolExecutor.runWorker():1149
    java.util.concurrent.ThreadPoolExecutor$Worker.run():624
    java.lang.Thread.run():748

If i checkout the official Docker image from dockerhub the container can read the files without issues.
Why is the default oss build failing without modifications?
What causes the custom build to fail reading the same parquet files?

Hello @Chrischhan

This error is not specific to OSS but can happen to any Dremio edition. What this says is that as Dremio scans the Parquet file, it is detecting schema changes, Dremio tries to learn 10 times and gives, although running it again starts from where it left (The 10 attempts are not wasted). Are you able to send the job profile, would give us a hint on what new schema Dremio learnt

Hi,

see the attached file (failing instance)
20dfbe38-85bb-4f26-9e83-d48bfb759b09.zip (38.3 KB)

i could also provide a job profile from the community instance from dockerhub (same file which works in that container)

additional note: if i change the “.” in column names to “__” (double underscores, to seperate from single underscore) i can load the file without issues. But it is not an acceptable solution for my case because I need the “.” in the column names.

@Chrischhan

I see some differences in schema between files, for example we found below new columns in a file. Is this CSV, what happens if you just use the community edition of Dremio ?

integer::int32, string::varchar, unit::varchar, negTol::double, posTol::double, name::varchar, value::union<double, varchar>)

@balaji.ramaswamy

I do not understand this. I import a single parquet file. The uploaded profile is from it. How can the scheme of a single file change and why does this file work with the container from the docker hub but not with the self compiled Dremio?

@Chrischhan

Sorry my bad, so it looks like some part of your OSS build is causing this issue, as you have confirmed that the out of the box version reads it, I see you have 2 issues

#1 Why is the default oss build failing without modifications? Will get an answer for this
#2 What causes the custom build to fail reading the same parquet files? same as #1

Thanks
Bali

Hey,

any new news on the topic?

Thanks

Hello,

I have some more news.
I have rebuilt and tested version 11 from the GitHub repository.
With the flag “oss-only=false”: Parquet file brake_dot -> works.
With the flag “oss-only=true”: Parquet file brake_dot -> does not work.
The other two files work in both versions.
The difference between the files is the separator in the column names, which are used for the application. Double ‘_’ “brake_us.parquet”, ‘-’ “brake_minus.parquet” and ‘.’ “brake_dot.parquet”

The “oss-only=true” version could only be built with the help of the changes (see first post)

demo_parquetFiles.zip (294.4 KB)