How do I build a Datalake on AWS

My name is Princewill and I am a Data Analyst transitioning to Data Engineering. I need help with building a Lake house on AWS for Streaming data and batch data.

I need help with building a Lake house and will be happy to get guidance on this. I have seen so many videos on YouTube but does not really broken down to my understanding.

@princewilleghosa2017 Where is your source data? and what is the file format, is it in columnar format like Parquet/ORC/AVRO on S3? or is it in an OLTP database like Oracle/Postsgres and you are asking methods to move it to S3 in columnar format?

Hi @balaji.ramaswamy , Thanks for the reply.

To answer your question, I guess the project team has opted for external stream data sources. We are using the following sources of data below;

  1. E-Commerce Activity data
    2.Autonomous vehicle data-API
    3.Cypto Currency data
  2. Log gen data from MS Azure

The file format will be in Jason format since it going to be an unstructured data.

Attached herewith

Can it be setup on AWS Lake and also by creating PODs like containers ?

Looking forward to your response.

Princewill

@princewilleghosa2017 JSON is not a columnar format which is best suited for data lakes. Still you can query them via Dremio, no need to move data, Dremio will directly read off your S3 bucket. You can then create a semantic layer on Dremio and then create agg/raw reflections on the VDS so queries can read PARQUET instead of JSON

Check the Dremio white papers on how to build your semantic layer and best practices for reflections

HI @balaji.ramaswamy , Thanks for the reply. I will will for the white paper and read about it. And also if you have the link to the white paper, please do not hesitate to share. Have a great rest of the week,

@princewilleghosa2017

Please use below link and filter on white papers