I would like to try to load ORC files but so far didn’t succeed.
For my test, I used the following version:
Build 1.3.1-201712020438070881-a7af5c8, Community Edition
(running on my Windows 7 Laptop for testing purpose)
I am able to add the S3 bucket and see the ORC file I’d like to add.
When I click on “Browse content” I get a “Preparing Results…” popup and then I get and empty preview with the error message “Table [x.orc] not found”
(The same happens when I click on the symbol with file icon with an arrow pointing to another file with a grid)
I’m wondering if it might time out. (With the bandwidth currently available it would take me ~5m to download the raw file and I don’t think the “Preparing Results” screen remains there for 5 minutes)
I then downloaded the file to my pc. I can read its structure & content with a local spark shell.
Then I tried to add it to dremio as a dataset but again I can only choose the following formats:
I then tried with “Unknown” but get a “Something went wrong” error message.
Is there any way to load ORC files?
Thanks much for your support
You can access ORC files via Hive. Let me see if there are any other options.
Thanks for your feedback.
It would be great if there’d be another option.
Otherwise, (correct me if I’m wrong) we would need to register the file as a table in Hive as well and have the Hive server running 24/7.
In case that there’s currently no other option, is this a feature somewhere on the roadmap?
You’re right - for ORC and Avro, Dremio runs the query through Hive and uses Hive’s file readers.
For Parquet, JSON, and other formats, Dremio has it’s own high-performance readers.
ORC is at or near the top of the list on our roadmap for file formats we will support. We will report back here once that’s available.
Thanks for sharing this, and sorry that you had to try as hard as you did. We will update the documentation to explain how to work with ORC and Avro and a few other formats.
Does Dremio now have its own high-performance readers for Avro and ORC (versus via Hive)?
@akhanolkar Dremio currently works best with Parquet. Avro/ORC performance are constantly being improved and we expect a release to further optimize them during summer/fall.
Also, if you create a data reflection on the ORC data, the format of the source data is irrelevant in terms of query speed - Dremio will automatically rewrite queries to use the Data Reflection instead of reading the source data.
@kelly You mentioned: “Dremio will automatically rewrite queries to use the Data Reflection instead of reading the source data”
Does this mean, data will be copied from hive to local “Dremio Data Store”? If so, is the data movement a seamless process? Or do we need to schedule any job to copy the data? Does this also mean, data is not in hadoop anymore?
Presumably you’re running Dremio as a YARN app and storing data reflections in HDFS? If so, then the Parquet files will live there. Dremio will manage these files and their updates according to the refresh policy you set on the source.
Does that answer your question?
We have not installed Dremio in any envionrment, just reading the documentations still. We are basically looking for a tool which will give our business analysts possibility to run both ad-hoc queries and tableau dashboards on data in data lake. It should be fast and interactive!
We have hive (orc format) tables. And ORC is still not supported by Dremio. How can we use Dremio as an accelerator (assuming Reflection needs to be enabled)?
You can create reflections on ORC data, and with these reflections your Tableau queries should be interactive in speed. Your other tools as well.
To access this data, create a connection to your Hive tables. Dremio will then access your ORC files through Hive when creating the reflection. Subsequent queries will be rewritten to use the reflection instead of using the Hive readers.
I would suggest you run Dremio in your cluster as a YARN application to the reflections will be stored in HDFS as well: https://docs.dremio.com/deployment/yarn-hadoop.html
did we have any update on the OCR direct reader instead of going through Hive ?