Converting JSON to Parquet

Is there anyway to call Dremio SQL to create a permanent Parquet File?

In Apache Drill I’m running:

create table hdfs.tmp./my_file as select * from dfs./tmp/my_file.json.gz

Then I’m getting rid of the json permanently after moving the parquet file to a permanent HDFS location.

Increasingly I’m running into index out of bound exceptions in Drill so I’m looking for alternatives… (with dictionary encoding)

Hi @david.lee

You should be able to use our $SCRATCH feature. Try the below

Step 1: On the Dremio UI, Click on new query and run the below query

CREATE TABLE "$SCRATCH".<table_name> AS <SELECT ...>
CREATE TABLE "$SCRATCH".emp AS SELECT * from EMP

The output results will have a path column that has the entire path to the parquet files on your executor nodes. You can also go to jobs and click on the above job and “open results” on the top right corner

cd on each executor nodes and you should see theParquet files

Hope this helps

Thanks,
@balaji.ramaswamy

Hi @david.leeHow-to-Copy-Dremio-Results

Attaching screen shot,

In your case the pdfs might be hdfs

Thanks,
@balaji.ramaswamy

Well the good news is that this worked…

CREATE TABLE “$SCRATCH”.“my_parquet_file” as SELECT * from “Linux Home”.my_home_dir.“not_processed”.“my_json_file.json.gz”

Bad news is that it took 32 schema learning passes / 3 restarts to complete.

Attempts
Attempt 10 (schema learning)
Profile »
Attempt 9 (schema learning)
Profile »
Attempt 8 (schema learning)
Profile »
Attempt 7 (schema learning)
Profile »
Attempt 6 (schema learning)
Profile »
Attempt 5 (schema learning)
Profile »
Attempt 4 (schema learning)
Profile »
Attempt 3 (schema learning)
Profile »
Attempt 2 (schema learning)
Profile »
Attempt 1
Profile »

Attempts
Attempt 10 (schema learning)
Profile »
Attempt 9 (schema learning)
Profile »
Attempt 8 (schema learning)
Profile »
Attempt 7 (schema learning)
Profile »
Attempt 6 (schema learning)
Profile »
Attempt 5 (schema learning)
Profile »
Attempt 4 (schema learning)
Profile »
Attempt 3 (schema learning)
Profile »
Attempt 2 (schema learning)
Profile »
Attempt 1
Profile »

Attempts
Attempt 2 (schema learning)
Profile »
Attempt 1
Profile »

Hi @david.lee

Glad it worked. When you say “3 restarts to complete”, can you please elaborate?

Thanks
@balaji.ramaswamy

Dremio errors out / quits after 10 schema learning attempts. Restarting the query picks up from what it learned on the previous attempts. The JSON file has multiple levels of nesting with keys which are not present in every record.

Thanks @david.lee

Can we consider your initial question answered? Are you able to see the parquet files from your query output?

Queries work fine. The only odd item is that a parquet file created with Dremio is 10% larger than the same Json converted parquet file create with Drill.

I would start by verifying that we are using the same options. Are both systems using dictionary encoding and snappy? Are both tools using the same size row group and page size? For example, I believe the Dremio defaults are:

  • 100k page size
  • 256 row group size
  • Snappy enabled
  • Dictionary encoding disabled.

These are the most likely reasons for a difference.

Another possibility to could be that the JSON-derived schema is somehow different. For example, Dremio supports a union schema approach and may be producing a different schema given its ability to do schema learning.

If you’ve confirmed that everything else is the same, the other possibility could be related to one additional Dremio Parquet optimization. Dremio stores all the page headers in the Parquet footer. This increases the footer size slightly but allows needle-in-a-haystack queries to be substantially faster, especially when working with sorted data. Normally a Parquet reader has to do a seek to each page header within a column to see whether that page should be considered for reading. In Dremio, this information is duplicated in the footer so we can quickly determine whether any pages are valid for a particular predicate (while not impacting readers that don’t understand this data). This type of behavior (substantially modified) has now formally become a Parquet Format feature through this jira: https://issues.apache.org/jira/browse/PARQUET-922.

These are the defaults in Dremio.

store.parquet.enable_dictionary_encoding = true
store.parquet.enable_dictionary_encoding_binary_type = false

The difference is probably the union approach. Drill has union joins set to false and turning union joins on in Drill completely breaks the json to parquet generation for my examples.

It could also be sorted data within row groups. In the past I’ve noticed a 10% to 15% storage improvement when data is presorted which results is a smaller number of initial branches.

Bob
John
Mary
Bob
is four branches

Bob
Bob
John
Mary
is three branches

Hi, @balaji.ramaswamy

I do not see $SCRATCH feature mentioned in the documentation. Where can I learn more about this feature?
Also Is there a way to set your own $SCRATCH folder path? Thanks

@RWong we’ll be documenting this shortly. However, it’s good to note that this is an experimental feature. Currently $scratch will always point to a scratch folder under your distributed store path (paths.dist in dremio.conf – this can be HDFS/NAS/local FS/S3, etc.)

1 Like