Dremio Architecture query

I have couple of questions:

  1. When our source is just HDFS and we want to improve query performance then I can choose Drill over Dremio right. What additional benefits fits Dremio can gives us over drill
  2. Dremio does code push down and it doesn’t store any data in that case how columnar and in memory computation helps to improve query performance when why query is spread across different databases . Because once query is pushed down then it’s all up to how source DB process that query if source is traditional rdbms then it’s a row based disc based processing right … Dremio columnar and in memory role will play only after data is fetched from source DB to Dremio . If underlying DB is slow w then my query performance will be slow …

Hi Dinesh,

There are many differences compared to Drill:

  1. Data Reflections. When a source is too slow or you want to offload queries from the source, Dremio supports Data Reflections. This can improve performance by orders of magnitude, especially over JSON and row-oriented sources. You can read more about Data Reflections here: https://docs.dremio.com/acceleration/reflections.html

  2. Performance. Even without Data Reflections, when compared to Drill running queries in a Hadoop/MapR environment, Dremio is significantly faster. There are many factors at play including the use of Apache Arrow, vectorized readers for file formats like Parquet, LLVM-based query compilation, an asynchronous multi-threaded execution model, and many other factors. Typically users see on the order of 5x speed up by simply moving to Dremio over Drill.

  3. Virtual Datasets and Data Catalog. Dremio allows users to develop a logical data model that is independent of the physical data. Users can organize and describe their virtual datasets in spaces, with true enterprise security features like AD/LDAP authentication and role-based access control (Enterprise Edition). In addition, Dremio automatically tracks the provenance and lineage of VDS as they relate to PDS and other VDS.

  4. Better support for more sources. Dremio has more sophisticated push-down abilities with RDBMS, and much better support for sources like Elasticsearch.

  5. Data curation. Dremio provides a GUI for designing data curation actions, including recommended joins.

There are many other differences, but these are a few that come to mind.

Hi Kelly ,
Thanks for the response. I agree to all your points except the 2nd one regarding performance, if we are not using Reflections how Parquet files gets generated. We have hive tables which are nothing but ORC files, on top of it lets assume we have created VDM. I assume VDM doesn’t persist data hence it accesses the ORC file when someone queries on VDM. In this case how vectorized readers for file formats like Parquet, LLVM-based query compilation, an asynchronous multi-threaded execution model, and many other factors comes into picture?
Thanks
Dinesh

Regarding point 2, the performance advantages apply independent of the data source: Parquet, ORC, JSON, CSV should all be faster.

In addition, even if your data is already in Parquet/ORC, Data Reflections can provide major advantages in performance by pre-aggregating the data, pre-computing joins, pre-computing calculated fields, ordering the data, partitioning the data, etc.

Hi Kelly,
Thanks for your response again, really appreciate it…
I am not fully convinced :slight_smile: with your explanation because Columnar data storage (Parquet format) plays huge role in vectorized readers and asynchronous multi-threaded execution but if data is in row format(ORC file) how is it feasible, I am not sure. Let me educate myself more on this topic and come back to you with more precise question.

Thanks
Dinesh

ORC is columnar, no?

Anyway, just try it for yourself. :slight_smile:

Is ORC true columnar? I am not sure .Let me try personally and educate myself… Thanks again Kelly.