New Onboarding Questions

Here is the scenario that we are dealing with, can you please provide any documentation or a tutorial or steps on how to perform this on DremIO

  1. Input Data = We get our website logging data in 100 MB Json files. We batch bunch of 100 mb Json files and add them every 3mins to ADLS Gen2 Storage Location
  2. DremIO needs to have a schedule of pulling these logs and converting them to Datasets (or append to existing dataset)
  3. Do edits on data (Converting string to int or float etc). This has to happen for every new data that we get to DremIO.
  4. DremIO needs to give us Merged/Appended Dataset to SuperSet.
  5. Connect SuperSet to DremIO
  6. Our Data Scientists will use SuperSet Web UI to query on different dimensions, etc.

Here are the things I have been able to do reading DremIO documentation

  1. Setup DremIO Clusters.
  2. Connect DremIO to ADLS Gen2 and read the data and generate a Dataset
  3. Connect Superset and query the Dataset.

Things I need help with: Need documentation or some kind of help for these 3 below

  1. How do I setup a 3 min schedule. DremIO needs to pull ONLY new data that is added to ADLS Gen2 Storage Location.
  2. How do I merge this Data to existing Dataset.
  3. All the Transformations that I have done in this step 3 (in problem statement) needs to be applied for new data as well.
  4. Superset should be able to see the Data for atleast last 24hrs.

@Kalyan

  1. How do I setup a 3 min schedule. DremIO needs to pull ONLY new data that is added to ADLS Gen2 Storage Location.
    Answer: Metadata Caching

  2. How do I merge this Data to existing Dataset
    Answer: This will be automatically done

  3. All the Transformations that I have done in this step 3 (in problem statement) needs to be applied for new data as well.
    Answer: All the transformations you did would have been saved as a VDS

  4. Superset should be able to see the Data for atleast last 24hrs
    Answer: Can you not query the VDS using a FILTER?

Dremio is now available in the master branch of superset. Time grain is supported too. It will be released likely in 0.35.2.

Here are the steps:

  1. pip install sqlalchemy_dremio
  2. Install Dremio’s odbc driver
  3. Clone the master branch of superset and do the required steps to get it to run

Use an URI like: dremio://dremio:dremio123@localhost:31010/dremio