Schema Learning problems with spilt parquet files

I’m still having a ton of problems with schema learning and after some additional debugging it looks like the problem is related to split parquet files.

I have the following parquet files created using Apache Drill.

=> hadoop fs -du -h "/year=2017/month=09/*"
10.6 M /test/year=2017/month=09/USA/0_0_0.parquet
18.2 M /test/year=2017/month=09/EUR/0_0_0.parquet
13.4 M /test/year=2017/month=09/EUR/0_0_1.parquet

Choosing the /test/year=2017/month=09/USA directory and clicking on the folder icon take 34 seconds to apply a format:

Query Type:
Internal
Duration:
34s
Start Time:
12/18/2017 13:21:03
End Time:
12/18/2017 13:21:37

Input

Input Bytes: 56.70 MB
Input Records: 2,052
Output

Output Bytes: 56.70 MB
Output Records: 2,052

However choosing the /test/year=2017/month=09/EUR directory and clicking on the folder icon takes 10 schema passes and times out after 10 minute 37 seconds:

Query Type:
Internal
Duration:
10m:37s
Start Time:
12/18/2017 13:07:45
End Time:
12/18/2017 13:18:23

SCHEMA_CHANGE ERROR: Schema change detected but unable to learn schema, query failed. A full table scan may be necessary to fully learn the schema.

Input

Input Bytes: 108.94 MB
Input Records: 4,096
Output

Output Bytes: 108.94 MB
Output Records: 4,096

Now it gets a bit wierd. If I click on the individual parquet file /test/year=2017/month=09/EUR/0_0_0.parquet it takes 1 min 15 seconds:

Query Type:
Internal
Duration:
1m:15s
Start Time:
12/18/2017 13:38:13
End Time:
12/18/2017 13:39:28

Input

Input Bytes: 108.94 MB
Input Records: 4,096
Output

Output Bytes: 108.94 MB
Output Records: 4,096

The /test/year=2017/month=09/EUR/0_0_1.parquet took 1 min 11 seconds.

Query Type:
Internal
Duration:
1m:11s
Start Time:
12/18/2017 13:52:58
End Time:
12/18/2017 13:54:10

Input

Input Bytes: 74.30 MB
Input Records: 2,615
Output

Output Bytes: 74.30 MB
Output Records: 2,615

Does schema learning kick in only when a parquet directory contains many files? Something seems off if I can learn the schema of two files in under 3 minutes, but it takes multiple attempts (5x?) at 10 minutes each to format the directory the same two files are sitting in…

My goal is to apply a format to the /test directory and add new sets of files each month…

Currently there is an issue where the newly learned schema is not stored when applying a formatting to a folder as the dataset is not created yet and there is no context to store. So when the query is reattempted it starts with the same old schema and ends up with the schema change exception. This loop keeps on continuing until number of retries are exhausted. We have internal ticket to improve this.

Workaround: after the error, continue to save the dataset. Open the dataset and try “run”, this should store the newly found schema in dataset context.

Unfortunately trying to hit SAVE after it times out crashes the server and gives me…

java.nio.file.FileSystemException: …/dremio/dremio-community-1.3.1/data/db/search/jobs/core/_gti_Lucene54_0.dvd: Too many open files

java.nio.file.FileSystemException: …/dremio/dremio-community-1.3.1/data/db/search/jobs/core/_gti_Lucene54_0.dvd: Too many open files

Haven’t seen the error before. Is the crash consistently reproes? Couple of things to try.

  1. Check the number of files opened by dremio process “lsof -p <pid of dremio>”
  2. Is the number of files allowed to open per process too low on your system?