I have few questions which came up while I was evaluating the product:
I can’t find a way to connect data sources not listed in Dremio using jdbc driver. I am not looking for jdbc driver for Dremio but jdbc driver which can connect to any source.
I did not understand the point made on one of the forums about impala be replaced by Apache arrow?
I would also like to understand how row level security is maintained in Dremio as I read through I see Dremio uses ranger but if I consider my usecase i have sentry implemented on cloudera stack and it doesn’t supports ranger. How would this model work?
Can we restrict administrator to run query or see data which is highly sensitive?
When you say Dremio provides data as a service, can I publish my jobs as web services? If not then what do you mean by data as service and how to achieve the same?
Which technology does Dremio uses to execute it’s job using yarn resource manager? Is it map reduce or spark?
Finally if I want replace my existing etl tool i.e. informatica how easy would it be to write complex etl replacement in Dremio?
I can’t find a way to connect data sources not listed in Dremio using jdbc driver. I am not looking for jdbc driver for Dremio but jdbc driver which can connect to any source.
This does not exist. We are working on making an SDK available that will allow you to build your own connectors for sources that Dremio does not yet support.
I did not understand the point made on one of the forums about impala be replaced by Apache arrow?
The two projects are not comparable. Apache Arrow is a project for columnar in-memory data and efficient processing of this columnar format. Impala is a SQL engine. Read more about Arrow here: https://www.dremio.com/apache-arrow-explained/
I would also like to understand how row level security is maintained in Dremio as I read through I see Dremio uses ranger but if I consider my usecase i have sentry implemented on cloudera stack and it doesn’t supports ranger. How would this model work?
Dremio Enterprise Edition provide row and column-level access control as well as masking, integrated with LDAP/AD group membership. This functionality is independent of Ranger or Sentry. If you are using Ranger we also provide integration here. We do not support Sentry.
Can we restrict administrator to run query or see data which is highly sensitive?
See previous answer.
When you say Dremio provides data as a service, can I publish my jobs as web services? If not then what do you mean by data as service and how to achieve the same?
Which technology does Dremio uses to execute it’s job using yarn resource manager? Is it map reduce or spark?
Dremio provides its own engine, based on Apache Arrow. When deployed in a Hadoop cluster as a YARN application, Dremio is a long-running process that is allocated resources based on the YARN queue. Most of our cloud customers deploy Dremio independent of a Hadoop cluster using Kubernetes. In short, MapReduce and Spark do not provide the speed or resource management features necessary, so we developed our own SQL engine.
Finally if I want replace my existing etl tool i.e. informatica how easy would it be to write complex etl replacement in Dremio?
Long-running ETL should be run in an ETL tool or via Hive or Spark. Interactive ETL, or what we call “last mile” ETL is a good fit for Dremio’s virtual datasets (https://docs.dremio.com/working-with-datasets/virtual-datasets.html). Instead of making copies of the data, Dremio manages these transformations using standard SQL and applies them at query time each time the data is accessed. This makes it easy to let every user have the exact version of the data they need, without creating copies that add risk for security and governance. In addition, Dremio automatically tracks the lineage and provenance of the data using our Data Graph: https://www.dremio.com/solutions/data-lineage