Consuming arrow formatted files


#1

This is a bit of a novice question … but our pipeline is written in go. We process a lot of JSON files into Dremio and we’d like to experiment with using pre-processed parquet files instead to see if we can get better performance out of Dremio. Go seems to have crappy parquet support for what I’ve tested so far (if someone knows of a good reader AND writer for parquet that is native Golang, please LMK) … but Apache Arrow has Go official support now. Can I just write arrow formatted files that Dremio can consume? Is this a dumb question?


#2

This isn’t a dumb question at all. :slight_smile:

Data Reflections will do this work for you. Have you tried this approach?

The Arrow columnar format is optimized for in-memory use. I think you would find that it is far from ideal for on-disk storage due to space overhead. In addition, there isn’t an official on disk format (see Feather), and we provide a way for end users to specify this as a data source file format.

Are you deployed within a Hadoop environment? If so, perhaps you should consider ORC (I have no idea of the quality of this project, so caveat emptor): https://github.com/scritchley/orc

Alternately, compressed JSON might be the next best bet (Dremio can read the compressed format).


#3

OK, we’re sticking with compressed JSON for now.