How do I build a Datalake on AWS

princewilleghosa2017 · January 10, 2023, 8:24am

My name is Princewill and I am a Data Analyst transitioning to Data Engineering. I need help with building a Lake house on AWS for Streaming data and batch data.

I need help with building a Lake house and will be happy to get guidance on this. I have seen so many videos on YouTube but does not really broken down to my understanding.

balaji.ramaswamy · January 15, 2023, 12:53am

@princewilleghosa2017 Where is your source data? and what is the file format, is it in columnar format like Parquet/ORC/AVRO on S3? or is it in an OLTP database like Oracle/Postsgres and you are asking methods to move it to S3 in columnar format?

princewilleghosa2017 · January 16, 2023, 1:43am

Hi @balaji.ramaswamy , Thanks for the reply.

To answer your question, I guess the project team has opted for external stream data sources. We are using the following sources of data below;

E-Commerce Activity data
2.Autonomous vehicle data-API
3.Cypto Currency data
Log gen data from MS Azure

The file format will be in Jason format since it going to be an unstructured data.

Attached herewith

Can it be setup on AWS Lake and also by creating PODs like containers ?

Looking forward to your response.

Princewill

princewilleghosa2017 · January 16, 2023, 1:44am

balaji.ramaswamy · January 20, 2023, 7:31am

@princewilleghosa2017 JSON is not a columnar format which is best suited for data lakes. Still you can query them via Dremio, no need to move data, Dremio will directly read off your S3 bucket. You can then create a semantic layer on Dremio and then create agg/raw reflections on the VDS so queries can read PARQUET instead of JSON

Check the Dremio white papers on how to build your semantic layer and best practices for reflections

princewilleghosa2017 · January 25, 2023, 4:11am

HI @balaji.ramaswamy , Thanks for the reply. I will will for the white paper and read about it. And also if you have the link to the white paper, please do not hesitate to share. Have a great rest of the week,

balaji.ramaswamy · January 26, 2023, 4:49am

@princewilleghosa2017

Please use below link and filter on white papers

Topic		Replies	Views
Data Format on Data Lake Storage	2	873	November 19, 2020
Dremio use cases	5	1757	March 4, 2021
How to import FB Ads, GA, Google Ads, and CRM data into S3?	2	589	April 9, 2023
Dremio - Data Lake Engine	1	907	November 19, 2020
Will Dremio kill ETL/ELT? Dremio University	4	2961	September 8, 2020

How do I build a Datalake on AWS

Related topics