Loading ORC files

rstel · January 4, 2018, 11:33am

Hi,

I would like to try to load ORC files but so far didn’t succeed.
For my test, I used the following version:
Build 1.3.1-201712020438070881-a7af5c8, Community Edition
(running on my Windows 7 Laptop for testing purpose)

I am able to add the S3 bucket and see the ORC file I’d like to add.
When I click on “Browse content” I get a “Preparing Results…” popup and then I get and empty preview with the error message “Table [x.orc] not found”
(The same happens when I click on the symbol with file icon with an arrow pointing to another file with a grid)

I’m wondering if it might time out. (With the bandwidth currently available it would take me ~5m to download the raw file and I don’t think the “Preparing Results” screen remains there for 5 minutes)

I then downloaded the file to my pc. I can read its structure & content with a local spark shell.
Then I tried to add it to dremio as a dataset but again I can only choose the following formats:
Unknown/Text/JSON/Parquet/Excel/XLS
I then tried with “Unknown” but get a “Something went wrong” error message.

Is there any way to load ORC files?

Thanks much for your support

kelly · January 5, 2018, 3:15pm

You can access ORC files via Hive. Let me see if there are any other options.

rstel · January 5, 2018, 5:07pm

Thanks for your feedback.
It would be great if there’d be another option.
Otherwise, (correct me if I’m wrong) we would need to register the file as a table in Hive as well and have the Hive server running 24/7.

In case that there’s currently no other option, is this a feature somewhere on the roadmap?

Thanks much

kelly · January 5, 2018, 8:15pm

You’re right - for ORC and Avro, Dremio runs the query through Hive and uses Hive’s file readers.

For Parquet, JSON, and other formats, Dremio has it’s own high-performance readers.

ORC is at or near the top of the list on our roadmap for file formats we will support. We will report back here once that’s available.

Thanks for sharing this, and sorry that you had to try as hard as you did. We will update the documentation to explain how to work with ORC and Avro and a few other formats.

akhanolkar · May 2, 2018, 6:24pm

Does Dremio now have its own high-performance readers for Avro and ORC (versus via Hive)?

Thanks

anthony · May 2, 2018, 6:27pm

@akhanolkar Dremio currently works best with Parquet. Avro/ORC performance are constantly being improved and we expect a release to further optimize them during summer/fall.

kelly · May 2, 2018, 9:45pm

Also, if you create a data reflection on the ORC data, the format of the source data is irrelevant in terms of query speed - Dremio will automatically rewrite queries to use the Data Reflection instead of reading the source data.

dewan.lisan · May 29, 2018, 1:39pm

@kelly You mentioned: “Dremio will automatically rewrite queries to use the Data Reflection instead of reading the source data”

Does this mean, data will be copied from hive to local “Dremio Data Store”? If so, is the data movement a seamless process? Or do we need to schedule any job to copy the data? Does this also mean, data is not in hadoop anymore?

kelly · May 29, 2018, 1:46pm

Presumably you’re running Dremio as a YARN app and storing data reflections in HDFS? If so, then the Parquet files will live there. Dremio will manage these files and their updates according to the refresh policy you set on the source.

Does that answer your question?

dewan.lisan · May 29, 2018, 2:23pm

We have not installed Dremio in any envionrment, just reading the documentations still. We are basically looking for a tool which will give our business analysts possibility to run both ad-hoc queries and tableau dashboards on data in data lake. It should be fast and interactive!

We have hive (orc format) tables. And ORC is still not supported by Dremio. How can we use Dremio as an accelerator (assuming Reflection needs to be enabled)?

kelly · May 29, 2018, 2:41pm

You can create reflections on ORC data, and with these reflections your Tableau queries should be interactive in speed. Your other tools as well.

To access this data, create a connection to your Hive tables. Dremio will then access your ORC files through Hive when creating the reflection. Subsequent queries will be rewritten to use the reflection instead of using the Hive readers.

I would suggest you run Dremio in your cluster as a YARN application to the reflections will be stored in HDFS as well: https://docs.dremio.com/deployment/yarn-hadoop.html

sajjan.jindal · November 25, 2019, 7:25pm

Hi Kelly,
did we have any update on the OCR direct reader instead of going through Hive ?

Regards,
Sajjan

Topic		Replies	Views
Dremio Architecture query	6	1653	March 5, 2019
Data Format on Data Lake Storage	2	876	November 19, 2020
How does dremio move data?	10	3162	July 13, 2021
Consuming arrow formatted files	6	1784	May 12, 2021
Impala and Kudu as Source	9	3752	November 29, 2018

Loading ORC files

Related topics