Here is the scenario that we are dealing with, can you please provide any documentation or a tutorial or steps on how to perform this on DremIO
- Input Data = We get our website logging data in 100 MB Json files. We batch bunch of 100 mb Json files and add them every 3mins to ADLS Gen2 Storage Location
- DremIO needs to have a schedule of pulling these logs and converting them to Datasets (or append to existing dataset)
- Do edits on data (Converting string to int or float etc). This has to happen for every new data that we get to DremIO.
- DremIO needs to give us Merged/Appended Dataset to SuperSet.
- Connect SuperSet to DremIO
- Our Data Scientists will use SuperSet Web UI to query on different dimensions, etc.
Here are the things I have been able to do reading DremIO documentation
- Setup DremIO Clusters.
- Connect DremIO to ADLS Gen2 and read the data and generate a Dataset
- Connect Superset and query the Dataset.
Things I need help with: Need documentation or some kind of help for these 3 below
- How do I setup a 3 min schedule. DremIO needs to pull ONLY new data that is added to ADLS Gen2 Storage Location.
- How do I merge this Data to existing Dataset.
- All the Transformations that I have done in this step 3 (in problem statement) needs to be applied for new data as well.
- Superset should be able to see the Data for atleast last 24hrs.