Parquet Reader plugin fails to read metadata with ArrayIndexOutOfBoundsException

The executor logs have a high number of WARN messages containing: Exception from createPageListFromColumnAndOffsetIndexjava.lang.ArrayIndexOutOfBoundsException

I traced the error to the closed source plugin in dremio-ce-parquet-plugin-13.0.0-202101272034330307-20fb9275.jar. (The code hasn’t changed in dremio-ce-parquet-plugin-17.0.0-202107060524010627-31b5222b.jar)

This error occurs when the UnifiedParquetReader attempts to gather statistics on the parquet file by calling RowGroups.getRowGroupMetadataFor(). The exception gets thrown from this line: - statistics.setMinMaxFromBytes(minValue, maxValue); when there are no min/max statistics for columns where it is expected. This occurs when the first page contains all nulls. The line ++columnArrayIndex; is inside the if block, so the columnArrayIndex doesn’t advance when page-0 is null

This was the output from: parquet-tools column-index someParquetFile.parquet for the page that throws the error.

column index for column someColumn:
Boudary order: ASCENDING
null count min max
page-0 20000
page-1 19589 1.0 200.0
page-2 8916 1.0 200.0

1 Like

@nnikitas Does this parquet file have sensitive data, would you be able to share, if not, any chance you can generate one with dummy data that has the first page as nulls with no min/max values, will be useful to debug the issue

Our parquet file does have sensitive data, so I can’t share it.

Earlier today, we did determine that the bug leads to a correctness issue. We were able to work around it by disabling “store.parquet.read_column_indexes”. Or, if left enabled, a query against the column with “AND someColumnName = 'someValueThatShouldBeFound” returns no values while the equivalent AND someColumnName LIKE ‘someValueThatShouldBeFound’ returns rows.