Will Dremio kill ETL/ELT?

Hello,

New user here from Indonesia.

I found dremio in data engineering group and find this really awesome. However there are several question that comes to mind after i read most of the docs, youtube and university.

  1. When to use Dremio vs ETL/ELT ( Spark ). Seems like dremio usage is for interactive query engine like impala or presto. Is there still a place of spark or batch processing engine for ETL/ELT ?
  2. Is Data Reflection available on Dremio Community ?
  3. How can we design and what preparation is needed in source system store so it won’t got hit hard when dremio engine hit the database in the source system? Is there any best practices for this ?

I think that what i’ve been struggling to look answer for… But overall Dremio is really sexy one.

Cheers

Hello Welly, welcome to the Dremio community. I’m glad you are enjoying Dremio.
To answer your questions:

  1. When to use Dremio vs ETL: ELT will always be required to get data into the data lake. You will use Dremio’s data lake engine to read the data once is landed in your data lake (storage layer). When using Dremio, your data pipeline gets simpler because we remove the ELT back out and into the enterprise data warehouse.

  2. Is Data Reflection available on Dremio Community? Yes, you can use Data Reflections on Dremio Community, Enterprise and also AWS Edition. Learn more here.

  3. How can we design and what preparation is needed in source system store…? The best practice is to organize your data into a folder per table, with parquet files that make up that table. Here is the documentation on writing parquet files and best practices when doing so.

I hope that helps.
Thanks!

2 Likes

I’m not seeing any point of data lake anymore if we can just query directly from the source. If dremio make it’s so efficient to do interactive query.

Basically we need data warehouse because of it will be structured in the proper way from reader perspective and not to have an impact to the core system.

So if i were to start again ? do i still need data lake or data warehouse ? I’m always try to find more efficient way of doing something.

Thought ?

Cheers

We already have a lot of data sources now and we have apache kudu + minio in data lake.

So basically if dremio can directly query to existing data source in core, that would be really great. Less things to maintain.

I wonder is there any case where data lake is more approriate compare to dremio/data virt

Hi @wellytambunan

  • Data lake like S3/Azure/Hadoop - Dremio can query this directly and scans are highly parallelized
  • OLTP databases like Oracle/MYSQL/Postgres/SQL Server - Dremio can directly query these but scans are single threaded and usually creating reflections would benefit
  • Unstructured data like Mongo/Elastic - Dremio can query them directly and scans are parallelized
  • Datawarehouses like Teradata - Dremio can directly query these but scans are single threaded and usually creating reflections would benefit
  • We have many use case where Dremio is used as a ETL by using our CTAS command to write data as Parquet to the lake

As answered by Tomer on the other posts we are looking into support for ICeBerg/Delta Lake soon

Kindly let us know if you have any other questions

Thanks
Bali

1 Like