You should be able to use our $SCRATCH feature. Try the below
Step 1: On the Dremio UI, Click on new query and run the below query
CREATE TABLE "$SCRATCH".<table_name> AS <SELECT ...> CREATE TABLE "$SCRATCH".emp AS SELECT * from EMP
The output results will have a path column that has the entire path to the parquet files on your executor nodes. You can also go to jobs and click on the above job and “open results” on the top right corner
cd on each executor nodes and you should see theParquet files
Dremio errors out / quits after 10 schema learning attempts. Restarting the query picks up from what it learned on the previous attempts. The JSON file has multiple levels of nesting with keys which are not present in every record.
Queries work fine. The only odd item is that a parquet file created with Dremio is 10% larger than the same Json converted parquet file create with Drill.
I would start by verifying that we are using the same options. Are both systems using dictionary encoding and snappy? Are both tools using the same size row group and page size? For example, I believe the Dremio defaults are:
100k page size
256 row group size
Snappy enabled
Dictionary encoding disabled.
These are the most likely reasons for a difference.
Another possibility to could be that the JSON-derived schema is somehow different. For example, Dremio supports a union schema approach and may be producing a different schema given its ability to do schema learning.
If you’ve confirmed that everything else is the same, the other possibility could be related to one additional Dremio Parquet optimization. Dremio stores all the page headers in the Parquet footer. This increases the footer size slightly but allows needle-in-a-haystack queries to be substantially faster, especially when working with sorted data. Normally a Parquet reader has to do a seek to each page header within a column to see whether that page should be considered for reading. In Dremio, this information is duplicated in the footer so we can quickly determine whether any pages are valid for a particular predicate (while not impacting readers that don’t understand this data). This type of behavior (substantially modified) has now formally become a Parquet Format feature through this jira: https://issues.apache.org/jira/browse/PARQUET-922.
The difference is probably the union approach. Drill has union joins set to false and turning union joins on in Drill completely breaks the json to parquet generation for my examples.
It could also be sorted data within row groups. In the past I’ve noticed a 10% to 15% storage improvement when data is presorted which results is a smaller number of initial branches.
I do not see $SCRATCH feature mentioned in the documentation. Where can I learn more about this feature?
Also Is there a way to set your own $SCRATCH folder path? Thanks
@RWong we’ll be documenting this shortly. However, it’s good to note that this is an experimental feature. Currently $scratch will always point to a scratch folder under your distributed store path (paths.dist in dremio.conf – this can be HDFS/NAS/local FS/S3, etc.)