Converting JSON to Parquet

jacques · November 16, 2017, 4:29pm

I would start by verifying that we are using the same options. Are both systems using dictionary encoding and snappy? Are both tools using the same size row group and page size? For example, I believe the Dremio defaults are:

100k page size
256 row group size
Snappy enabled
Dictionary encoding disabled.

These are the most likely reasons for a difference.

Another possibility to could be that the JSON-derived schema is somehow different. For example, Dremio supports a union schema approach and may be producing a different schema given its ability to do schema learning.

If you’ve confirmed that everything else is the same, the other possibility could be related to one additional Dremio Parquet optimization. Dremio stores all the page headers in the Parquet footer. This increases the footer size slightly but allows needle-in-a-haystack queries to be substantially faster, especially when working with sorted data. Normally a Parquet reader has to do a seek to each page header within a column to see whether that page should be considered for reading. In Dremio, this information is duplicated in the footer so we can quickly determine whether any pages are valid for a particular predicate (while not impacting readers that don’t understand this data). This type of behavior (substantially modified) has now formally become a Parquet Format feature through this jira: https://issues.apache.org/jira/browse/PARQUET-922.

Topic		Replies	Views
How to create view of hdfs dataset Dremio University	4	2070	September 12, 2019
Creating / altering datasets with SQL	8	1603	June 25, 2021
Able to read parquet file with parquet-tools, but not dremio	11	3938	August 15, 2019
Issues with JSON content (after upgrade ?)	4	1429	February 27, 2020
Dremio can't read datetime values in parquet files created with Drill 1.11	1	1330	September 19, 2017

Converting JSON to Parquet

Related topics