Hello! I have a question about Data Modeling. I’ve been trying to figure out the best approach to model Data using Dremio. Before using Dremio I was building a Data Lakehouse according to the Medalion architecture, using Delta Lake and Spark on top of s3. The Data model on the silver layer (second zone) was a 3NF and Star Schema / OBT approach on the gold layer. After reading the instrunction on Dremio site about building a semantic Zone, and concluded that I could keep 3 layers of Data (bronze, silver, gold or raw, trusted, curated; call it as you wish). The first layer are just replicas of the physical datasets. On the second layer, I have been using views that clean and standarize data, and on the third layer I have been using OBT. My question is, would you guys recommend to normalize data on the second layer using Dremio? Since everything is virtual, it doesn’ make sense to me but I’d like to hear some thoughts about it. Thanks in Advance!
I would definitely recommend normalizing data on the Silver/Business layer in Dremio so that you have well-defined business entities… except the key difference between the medallion architecture vs Dremio’s semantic layer is that everything is virtual in Dremio. They’re basically just views organized in folders with appropriate role based access control set. It’s actually quite different than the medallion architecture where you are expected to materialize the data at each layer. With Dremio, you first define the semantic layer and then materialize as needed based on performance/cost considerations. (This can be done automatically with reflections or explicitly with Iceberg DML).
Thank you for the answer, @Benny_Chow!
I was quite interested in the Iceberg DML approach, as demonstrated here: Announcing DML Support for Apache Iceberg | Dremio.
But is it not possible to schedule MERGE/INSERT operations, right?
For example, suppose after I created an Iceberg table from a raw delta table through Dremio, I’d like to incrementally fill the table with data from the last day. The only way to achieve this, from what I’ve seen, is saving a script with the DML statement and running it externally through the API with Python. Since I am looking to use DBT, I think the script is not a good idea. Is there another way of doing this? Also, the view approach worries me about long runtime queries and too much data transfer through materializing a reflection every day.
There’s a lot to think about here - and it’s a big step away from having a ODS (3NF) → Star / DWH combination.
For the middle layer, it may be better to think about conforming rather than normalising. As @Benny_Chow mentioned, you want to tidy key business entities, particularly where they are effectively going to be dimensions. (It may then make sense to materialised these as they’ll get repeated use).
One key question is how many different datasets / sources are coming together? Really think about why you want that middle layer, and what purpose it is serving.
It might be that you want to that second layer to be in an Iceberg table format to make use of those features, versioning etc. That’s a valid use-case, but it doesn’t automatically follow that it must be normalised.
It’s a brave new world in modelling land. Start with your use case(s) and go from there…!
Thanks for the answer, @Dave0m! I agree with you. About the Iceberg option, I think that it might be interesting because of the query execution time taking too long on the second layer if I end up normalizing the data. But since the goal of normalization is about avoiding data redundancy and data anomalies, and because the data on Dremio views are virtual, I think that normalizing the data until the 3NF must be an overwork. The 1NF, I totally agree that it may help to target because it might be tricky for an analyst to deal with nested data (even though Dremio has functions that help with it). Because of the virtual nature of the data, sometimes I think it makes sense to treat the second layer as something like a mix of a staging and source of truth layer. In my company, for example, at the moment, we work with event data that we process on daily batches, and that is very denormalized because we have a lot of information about the device and the marketing campaign. So we can normalize, but afterward, we need to denormalize again on the third layer, so this is a tricky trade-off to deal with.