I would start by verifying that we are using the same options. Are both systems using dictionary encoding and snappy? Are both tools using the same size row group and page size? For example, I believe the Dremio defaults are:
- 100k page size
- 256 row group size
- Snappy enabled
- Dictionary encoding disabled.
These are the most likely reasons for a difference.
Another possibility to could be that the JSON-derived schema is somehow different. For example, Dremio supports a union schema approach and may be producing a different schema given its ability to do schema learning.
If you’ve confirmed that everything else is the same, the other possibility could be related to one additional Dremio Parquet optimization. Dremio stores all the page headers in the Parquet footer. This increases the footer size slightly but allows needle-in-a-haystack queries to be substantially faster, especially when working with sorted data. Normally a Parquet reader has to do a seek to each page header within a column to see whether that page should be considered for reading. In Dremio, this information is duplicated in the footer so we can quickly determine whether any pages are valid for a particular predicate (while not impacting readers that don’t understand this data). This type of behavior (substantially modified) has now formally become a Parquet Format feature through this jira: https://issues.apache.org/jira/browse/PARQUET-922.