Converting JSON to Parquet

david.lee · November 13, 2017, 7:26pm

Is there anyway to call Dremio SQL to create a permanent Parquet File?

In Apache Drill I’m running:

create table hdfs.tmp./my_file as select * from dfs./tmp/my_file.json.gz

Then I’m getting rid of the json permanently after moving the parquet file to a permanent HDFS location.

Increasingly I’m running into index out of bound exceptions in Drill so I’m looking for alternatives… (with dictionary encoding)

balaji.ramaswamy · November 13, 2017, 8:19pm

Hi @david.lee

You should be able to use our $SCRATCH feature. Try the below

Step 1: On the Dremio UI, Click on new query and run the below query

CREATE TABLE "$SCRATCH".<table_name> AS <SELECT ...>
CREATE TABLE "$SCRATCH".emp AS SELECT * from EMP

The output results will have a path column that has the entire path to the parquet files on your executor nodes. You can also go to jobs and click on the above job and “open results” on the top right corner

cd on each executor nodes and you should see theParquet files

Hope this helps

Thanks,
@balaji.ramaswamy

balaji.ramaswamy · November 13, 2017, 8:22pm

Hi @david.lee How-to-Copy-Dremio-Results

Attaching screen shot,

In your case the pdfs might be hdfs

Thanks,
@balaji.ramaswamy

david.lee · November 15, 2017, 6:35pm

Well the good news is that this worked…

CREATE TABLE “$SCRATCH”.“my_parquet_file” as SELECT * from “Linux Home”.my_home_dir.“not_processed”.“my_json_file.json.gz”

Bad news is that it took 32 schema learning passes / 3 restarts to complete.

Attempts
Attempt 10 (schema learning)
Profile »
Attempt 9 (schema learning)
Profile »
Attempt 8 (schema learning)
Profile »
Attempt 7 (schema learning)
Profile »
Attempt 6 (schema learning)
Profile »
Attempt 5 (schema learning)
Profile »
Attempt 4 (schema learning)
Profile »
Attempt 3 (schema learning)
Profile »
Attempt 2 (schema learning)
Profile »
Attempt 1
Profile »

Attempts
Attempt 2 (schema learning)
Profile »
Attempt 1
Profile »

balaji.ramaswamy · November 15, 2017, 6:53pm

Hi @david.lee

Glad it worked. When you say “3 restarts to complete”, can you please elaborate?

Thanks
@balaji.ramaswamy

david.lee · November 15, 2017, 9:00pm

Dremio errors out / quits after 10 schema learning attempts. Restarting the query picks up from what it learned on the previous attempts. The JSON file has multiple levels of nesting with keys which are not present in every record.

balaji.ramaswamy · November 15, 2017, 9:11pm

Thanks @david.lee

Can we consider your initial question answered? Are you able to see the parquet files from your query output?

@balaji.ramaswamy

david.lee · November 16, 2017, 12:42am

Queries work fine. The only odd item is that a parquet file created with Dremio is 10% larger than the same Json converted parquet file create with Drill.

jacques · November 16, 2017, 4:29pm

I would start by verifying that we are using the same options. Are both systems using dictionary encoding and snappy? Are both tools using the same size row group and page size? For example, I believe the Dremio defaults are:

100k page size
256 row group size
Snappy enabled
Dictionary encoding disabled.

These are the most likely reasons for a difference.

Another possibility to could be that the JSON-derived schema is somehow different. For example, Dremio supports a union schema approach and may be producing a different schema given its ability to do schema learning.

If you’ve confirmed that everything else is the same, the other possibility could be related to one additional Dremio Parquet optimization. Dremio stores all the page headers in the Parquet footer. This increases the footer size slightly but allows needle-in-a-haystack queries to be substantially faster, especially when working with sorted data. Normally a Parquet reader has to do a seek to each page header within a column to see whether that page should be considered for reading. In Dremio, this information is duplicated in the footer so we can quickly determine whether any pages are valid for a particular predicate (while not impacting readers that don’t understand this data). This type of behavior (substantially modified) has now formally become a Parquet Format feature through this jira: https://issues.apache.org/jira/browse/PARQUET-922.

david.lee · November 16, 2017, 4:46pm

These are the defaults in Dremio.

store.parquet.enable_dictionary_encoding = true
store.parquet.enable_dictionary_encoding_binary_type = false

The difference is probably the union approach. Drill has union joins set to false and turning union joins on in Drill completely breaks the json to parquet generation for my examples.

It could also be sorted data within row groups. In the past I’ve noticed a 10% to 15% storage improvement when data is presorted which results is a smaller number of initial branches.

Bob
John
Mary
Bob
is four branches

Bob
Bob
John
Mary
is three branches

RWong · November 17, 2017, 6:34pm

Hi, @balaji.ramaswamy

I do not see $SCRATCH feature mentioned in the documentation. Where can I learn more about this feature?
Also Is there a way to set your own $SCRATCH folder path? Thanks

can · November 17, 2017, 9:40pm

@RWong we’ll be documenting this shortly. However, it’s good to note that this is an experimental feature. Currently $scratch will always point to a scratch folder under your distributed store path (paths.dist in dremio.conf – this can be HDFS/NAS/local FS/S3, etc.)

Topic		Replies	Views
How to create view of hdfs dataset Dremio University	4	2076	September 12, 2019
How to process json data correctly in dremio?	1	1409	March 5, 2024
Creating / altering datasets with SQL	8	1603	June 25, 2021
Able to read parquet file with parquet-tools, but not dremio	11	3948	August 15, 2019
Issues with JSON content (after upgrade ?)	4	1434	February 27, 2020

Converting JSON to Parquet

Related topics