Impala and Kudu as Source

YapNe · December 15, 2017, 8:15pm

Hi

Any plan to support Impala and Kudu storage as Source?

Many thanks

Nelson

kelly · December 16, 2017, 11:32pm

Impala is not currently supported, but you should be able to use Dremio to query files in HDFS and S3, including Parquet, Avro, ORC, and HBase-encoded data, as well as CSV and JSON.

HLNA · August 10, 2018, 7:36am

I’m not seeing Avro as a source while trying to add format to the file.

kelly · August 10, 2018, 10:49am

Avro is access via Hive tables. You need to set those up first, and we use the Hive readers. You should expect performance on Avro-encoded data to be significantly slower than Parquet and ORC.

HLNA · August 10, 2018, 9:28pm

Thanks @kelly … Perhaps a feature request to have Avro as direct plugin … to be used for creating formats.

kelly · August 11, 2018, 12:37am

We can keep an eye on this, but frankly there seem to be many other ways to convert to columnar formats from Avro. If you create a reflection on an Avro source that’s effectively what you’re doing.

We wouldn’t recommend using Dremio to do this for large data volumes but for small to medium datasets it is probably ok.

Kelly

HLNA · August 11, 2018, 7:38am

Thanks @kelly. Definitely, planning on using Dremio only for read-only operations for now.

szatbarc · November 27, 2018, 10:19am

Hi Kelly

We have a Impala data source in the Parquet format. I tried the workaround by accessing the files via HDFS connection - we don’t use Hive. The problem is that our tables are partitioned by date, so there are many files for a single table and we have many tables.

Dremio HDFS allows me to access each single file, but how do I manage the many files of a table as a whole? Any suggestion, or I missed any thing here ?

Thank you

Forget this, I figured out how to do it.

kelly · November 27, 2018, 5:15pm

Apply the formatting on the folder level, in this example the tmp folder:

That will treat all the files in the directory as one large table. Subdirectories work as well, provided the file formats are all the same.

szatbarc · November 29, 2018, 8:47am

Thanks for the suggestion, I did exactly the same thing and I can now work with the whole table rather than on each file.

But there is a problem here: all the string fields in the parquet file have type of VARBINARY(65536). I can’t change the type of these fields and they cause error when appear in the where clause of a query. For example, query

SELECT * FROM implal.orders where order_id = ‘abcd’

has an error as:

Cannot apply ‘=’ to arguments of type ‘<VARBINARY(65536)> = <VARCHAR(4)>’. Supported form(s): ‘<COMPARABLE_TYPE> = <COMPARABLE_TYPE>’ ’ = ’ ’ = ’

I can user btrim(order_id) in the particular query, but I have to apply this transformation to every string field and in every query.

Another solution is to create a full size virtual dataset for every table in the database.

Is there any other better solution ?

Thank you

Topic		Replies	Views
Does it make sense to make dremio plugin to support avro file	3	2400	August 10, 2018
Support for sequence file from S3	3	1369	August 29, 2018
Consuming arrow formatted files	6	1766	May 12, 2021
Loading ORC files	11	5022	November 25, 2019
Direct reading of avro files on S3	1	1831	December 13, 2019

Impala and Kudu as Source

Related topics