This is a bit of a novice question … but our pipeline is written in go. We process a lot of JSON files into Dremio and we’d like to experiment with using pre-processed parquet files instead to see if we can get better performance out of Dremio. Go seems to have crappy parquet support for what I’ve tested so far (if someone knows of a good reader AND writer for parquet that is native Golang, please LMK) … but Apache Arrow has Go official support now. Can I just write arrow formatted files that Dremio can consume? Is this a dumb question?
This isn’t a dumb question at all.
Data Reflections will do this work for you. Have you tried this approach?
The Arrow columnar format is optimized for in-memory use. I think you would find that it is far from ideal for on-disk storage due to space overhead. In addition, there isn’t an official on disk format (see Feather), and we provide a way for end users to specify this as a data source file format.
Are you deployed within a Hadoop environment? If so, perhaps you should consider ORC (I have no idea of the quality of this project, so caveat emptor): https://github.com/scritchley/orc
Alternately, compressed JSON might be the next best bet (Dremio can read the compressed format).
OK, we’re sticking with compressed JSON for now.
Regarding arrow files:
There is an official disk format specified via Flat Buffers see “File Format” here: https://arrow.apache.org/docs/ipc.html
In my case, disk space is not a concern as the arrow format is more efficient to read and write for very large data sets
looking into the github repo It seems as though there is a plugin for arrow…
How can this plugin be enabled?
Reviving this thread as I’d be really excited about this feature as well
In our org, we write out a lot of our data to Arrow files in S3 (often with compression enabled, which seems to result in comparable sizes as Parquet for our data). We’d love to use Dremio to query these objects directly w/o needing to setting up an intermediary auto-convert-all-incoming-Arrow-to-Parquet job and/or force data producers to switch to writing Parquet.
Any idea what would be required to enable this?
@jrevels Currently Dremio does not support reading of Arrow files directly,
Hello @jrevels , eventually dremio has its own arrow format to dump/restore in memory datasets, and you could use dremio SQL extensions to read and write such files. But I don’t know how portable it is.
You can find the related syntax here :