Consuming arrow formatted files

jhaynie · December 13, 2018, 9:36pm

This is a bit of a novice question … but our pipeline is written in go. We process a lot of JSON files into Dremio and we’d like to experiment with using pre-processed parquet files instead to see if we can get better performance out of Dremio. Go seems to have crappy parquet support for what I’ve tested so far (if someone knows of a good reader AND writer for parquet that is native Golang, please LMK) … but Apache Arrow has Go official support now. Can I just write arrow formatted files that Dremio can consume? Is this a dumb question?

kelly · January 7, 2019, 11:17pm

This isn’t a dumb question at all.

Data Reflections will do this work for you. Have you tried this approach?

The Arrow columnar format is optimized for in-memory use. I think you would find that it is far from ideal for on-disk storage due to space overhead. In addition, there isn’t an official on disk format (see Feather), and we provide a way for end users to specify this as a data source file format.

Are you deployed within a Hadoop environment? If so, perhaps you should consider ORC (I have no idea of the quality of this project, so caveat emptor): https://github.com/scritchley/orc

Alternately, compressed JSON might be the next best bet (Dremio can read the compressed format).

jhaynie · January 9, 2019, 12:54am

OK, we’re sticking with compressed JSON for now.

abbotware · December 27, 2019, 4:02pm

Regarding arrow files:

There is an official disk format specified via Flat Buffers see “File Format” here: https://arrow.apache.org/docs/ipc.html

In my case, disk space is not a concern as the arrow format is more efficient to read and write for very large data sets

looking into the github repo It seems as though there is a plugin for arrow…

How can this plugin be enabled?

(see below)

github.com

dremio/dremio-oss/blob/8e85901e7222c81ccad3436ba9b63485503472ac/sabot/kernel/src/main/java/com/dremio/exec/store/easy/arrow/ArrowFormatPlugin.java

/*
 * Copyright (C) 2017-2019 Dremio Corporation
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package com.dremio.exec.store.easy.arrow;

import java.io.IOException;
import java.util.List;

This file has been truncated. show original

jrevels · May 3, 2021, 1:32am

Reviving this thread as I’d be really excited about this feature as well

In our org, we write out a lot of our data to Arrow files in S3 (often with compression enabled, which seems to result in comparable sizes as Parquet for our data). We’d love to use Dremio to query these objects directly w/o needing to setting up an intermediary auto-convert-all-incoming-Arrow-to-Parquet job and/or force data producers to switch to writing Parquet.

Any idea what would be required to enable this?

balaji.ramaswamy · May 3, 2021, 5:49am

@jrevels Currently Dremio does not support reading of Arrow files directly,

fetanchaud · May 12, 2021, 2:43pm

Hello @jrevels , eventually dremio has its own arrow format to dump/restore in memory datasets, and you could use dremio SQL extensions to read and write such files. But I don’t know how portable it is.
You can find the related syntax here :

github.com/fabrice-etanchaud/dbt-dremio

"file" materialization : add more export formats

opened 02:40PM - 09 Nov 20 UTC

fabrice-etanchaud

enhancement

Currently, dremio's documentation only mentions parquet tables, but as read in …the sources, CTAS allows for undocumented options : `CREATE TABLE xxx STORE AS (format options) [WITH SINGLE WRITER] AS SELECT yyy` where format options can be : ``` - type => 'json', prettyPrint => false - type => 'text', fieldDelimiter => ',', lineDelimiter => '\r\n' - type => 'parquet', outputExtension => 'myparquet' ``` to investigate : are these memory tables ? how long do they persist ? `- type => 'arrow'` to investigate : `SELECT * FROM TABLE(pds/vds path, (type => 'excel', extractHeader => true, hasMergedCells => true, xls => true))` can we export as excel ?

Topic		Replies	Views
Dremio can not read Parquet produced by Arrow	0	1466	March 27, 2020
How does dremio move data?	10	3078	July 13, 2021
Dremio Apache Arrow writer using Java	2	1300	September 28, 2021
Reading Dremio's parquet files from python	1	2168	July 29, 2019
Unable to read Parquet footer with file generated with turbodbc	12	8091	November 21, 2017

Consuming arrow formatted files

Related topics