I’m still having a ton of problems with schema learning and after some additional debugging it looks like the problem is related to split parquet files.
I have the following parquet files created using Apache Drill.
=> hadoop fs -du -h "/year=2017/month=09/*"
10.6 M /test/year=2017/month=09/USA/0_0_0.parquet
18.2 M /test/year=2017/month=09/EUR/0_0_0.parquet
13.4 M /test/year=2017/month=09/EUR/0_0_1.parquet
Choosing the /test/year=2017/month=09/USA directory and clicking on the folder icon take 34 seconds to apply a format:
Query Type:
Internal
Duration:
34s
Start Time:
12/18/2017 13:21:03
End Time:
12/18/2017 13:21:37
Input
Input Bytes: 56.70 MB
Input Records: 2,052
Output
Output Bytes: 56.70 MB
Output Records: 2,052
However choosing the /test/year=2017/month=09/EUR directory and clicking on the folder icon takes 10 schema passes and times out after 10 minute 37 seconds:
Query Type:
Internal
Duration:
10m:37s
Start Time:
12/18/2017 13:07:45
End Time:
12/18/2017 13:18:23
SCHEMA_CHANGE ERROR: Schema change detected but unable to learn schema, query failed. A full table scan may be necessary to fully learn the schema.
Input
Input Bytes: 108.94 MB
Input Records: 4,096
Output
Output Bytes: 108.94 MB
Output Records: 4,096
Now it gets a bit wierd. If I click on the individual parquet file /test/year=2017/month=09/EUR/0_0_0.parquet it takes 1 min 15 seconds:
Query Type:
Internal
Duration:
1m:15s
Start Time:
12/18/2017 13:38:13
End Time:
12/18/2017 13:39:28
Input
Input Bytes: 108.94 MB
Input Records: 4,096
Output
Output Bytes: 108.94 MB
Output Records: 4,096
The /test/year=2017/month=09/EUR/0_0_1.parquet took 1 min 11 seconds.
Query Type:
Internal
Duration:
1m:11s
Start Time:
12/18/2017 13:52:58
End Time:
12/18/2017 13:54:10
Input
Input Bytes: 74.30 MB
Input Records: 2,615
Output
Output Bytes: 74.30 MB
Output Records: 2,615
Does schema learning kick in only when a parquet directory contains many files? Something seems off if I can learn the schema of two files in under 3 minutes, but it takes multiple attempts (5x?) at 10 minutes each to format the directory the same two files are sitting in…
My goal is to apply a format to the /test directory and add new sets of files each month…