Will Dremio kill ETL/ELT?

wellytambunan · June 11, 2020, 10:06am

Hello,

New user here from Indonesia.

I found dremio in data engineering group and find this really awesome. However there are several question that comes to mind after i read most of the docs, youtube and university.

When to use Dremio vs ETL/ELT ( Spark ). Seems like dremio usage is for interactive query engine like impala or presto. Is there still a place of spark or batch processing engine for ETL/ELT ?
Is Data Reflection available on Dremio Community ?
How can we design and what preparation is needed in source system store so it won’t got hit hard when dremio engine hit the database in the source system? Is there any best practices for this ?

I think that what i’ve been struggling to look answer for… But overall Dremio is really sexy one.

Cheers

LucioDaza · June 11, 2020, 3:34pm

Hello Welly, welcome to the Dremio community. I’m glad you are enjoying Dremio.
To answer your questions:

When to use Dremio vs ETL: ELT will always be required to get data into the data lake. You will use Dremio’s data lake engine to read the data once is landed in your data lake (storage layer). When using Dremio, your data pipeline gets simpler because we remove the ELT back out and into the enterprise data warehouse.
Is Data Reflection available on Dremio Community? Yes, you can use Data Reflections on Dremio Community, Enterprise and also AWS Edition. Learn more here.
How can we design and what preparation is needed in source system store…? The best practice is to organize your data into a folder per table, with parquet files that make up that table. Here is the documentation on writing parquet files and best practices when doing so.

I hope that helps.
Thanks!

wellytambunan · June 12, 2020, 12:58am

I’m not seeing any point of data lake anymore if we can just query directly from the source. If dremio make it’s so efficient to do interactive query.

Basically we need data warehouse because of it will be structured in the proper way from reader perspective and not to have an impact to the core system.

So if i were to start again ? do i still need data lake or data warehouse ? I’m always try to find more efficient way of doing something.

Thought ?

Cheers

wellytambunan · June 13, 2020, 2:32am

We already have a lot of data sources now and we have apache kudu + minio in data lake.

So basically if dremio can directly query to existing data source in core, that would be really great. Less things to maintain.

I wonder is there any case where data lake is more approriate compare to dremio/data virt

balaji.ramaswamy · September 8, 2020, 5:56am

Hi @wellytambunan

Data lake like S3/Azure/Hadoop - Dremio can query this directly and scans are highly parallelized
OLTP databases like Oracle/MYSQL/Postgres/SQL Server - Dremio can directly query these but scans are single threaded and usually creating reflections would benefit
Unstructured data like Mongo/Elastic - Dremio can query them directly and scans are parallelized
Datawarehouses like Teradata - Dremio can directly query these but scans are single threaded and usually creating reflections would benefit
We have many use case where Dremio is used as a ETL by using our CTAS command to write data as Parquet to the lake

As answered by Tomer on the other posts we are looking into support for ICeBerg/Delta Lake soon

Kindly let us know if you have any other questions

Thanks
Bali

Topic		Replies	Views
Dremio - Data Lake Engine	1	907	November 19, 2020
What is the best open source BI tool for dremio	4	2411	March 28, 2019
What Dremio is exactly?	3	9097	April 22, 2019
How does dremio move data?	10	3079	July 13, 2021
Performances comparisons	18	11770	February 1, 2021

Will Dremio kill ETL/ELT?

Related topics