Impala and Kudu as Source


Any plan to support Impala and Kudu storage as Source?

Many thanks


Impala is not currently supported, but you should be able to use Dremio to query files in HDFS and S3, including Parquet, Avro, ORC, and HBase-encoded data, as well as CSV and JSON.

I’m not seeing Avro as a source while trying to add format to the file.

Avro is access via Hive tables. You need to set those up first, and we use the Hive readers. You should expect performance on Avro-encoded data to be significantly slower than Parquet and ORC.

1 Like

Thanks @kelly … Perhaps a feature request to have Avro as direct plugin … to be used for creating formats. :sunglasses:

We can keep an eye on this, but frankly there seem to be many other ways to convert to columnar formats from Avro. If you create a reflection on an Avro source that’s effectively what you’re doing.

We wouldn’t recommend using Dremio to do this for large data volumes but for small to medium datasets it is probably ok.


Thanks @kelly. Definitely, planning on using Dremio only for read-only operations for now.

Hi Kelly

We have a Impala data source in the Parquet format. I tried the workaround by accessing the files via HDFS connection - we don’t use Hive. The problem is that our tables are partitioned by date, so there are many files for a single table and we have many tables.

Dremio HDFS allows me to access each single file, but how do I manage the many files of a table as a whole? Any suggestion, or I missed any thing here ?

Thank you

Forget this, I figured out how to do it.

Apply the formatting on the folder level, in this example the tmp folder:

That will treat all the files in the directory as one large table. Subdirectories work as well, provided the file formats are all the same.

Thanks for the suggestion, I did exactly the same thing and I can now work with the whole table rather than on each file.

But there is a problem here: all the string fields in the parquet file have type of VARBINARY(65536). I can’t change the type of these fields and they cause error when appear in the where clause of a query. For example, query

SELECT * FROM implal.orders where order_id = ‘abcd’

has an error as:

Cannot apply ‘=’ to arguments of type ‘<VARBINARY(65536)> = <VARCHAR(4)>’. Supported form(s): ‘<COMPARABLE_TYPE> = <COMPARABLE_TYPE>’ ’ = ’ ’ = ’

I can user btrim(order_id) in the particular query, but I have to apply this transformation to every string field and in every query.

Another solution is to create a full size virtual dataset for every table in the database.

Is there any other better solution ?

Thank you