Dremio Reflection Choosing Algorithm

Given the following case:

  • Assume: There are two virtual datasets, call them vd1 and vd2. Each of them has raw reflection on all their entire fields.
  • Create a new virtual dataset vd3 from SELECT * FROM vd1 INNER JOIN vd2 ON vd1.field1=vd2.field3
  • Then create raw reflection on vd3 by selecting all fields as display
  • Then create aggregation reflection on vd3

My expectation was to choose the raw reflection on vd1 and vd2 while creating vd3 raw reflection, and use vd3 raw raw reflection wile creating vd3 aggregation reflection.

The actual is choosing the raw reflection on vd1 and vd2 while creating both raw reflection, and aggregation reflection reflection on vd3

Why?!

@elshafey

We would need to see if the raw reflection on vd3 was matched. Kindly provide us with the job profile of the REFRESH REFLECTION, aggregate reflection job on vd3. Please turn “planner.verbose_profile” to on under support-support key

Share Dremio Query Profile

Here is an example of what I am talking about.
Be noted:

“Package Daily Activity” dataset is skipped and its parents (“Company Package With Retain Info”,“dates”) are used instead

c280bfe4-e1d0-433e-934c-2675342da73c.zip (82.8 KB)

@elshafey

Are you not refreshing agg reflection on “Forasna Views.Employer Activities”? I do not even see the raw reflection on “Forasna Views.Employer Activities”

Also the reflections (raw and agg) on “Package Daily Activity” are not skipped but do not match, which means those 2 reflections do not have all the information they need to accelerate the query

  • Package Daily Activity (“Forasna Views”.Packages)
    • Raw Reflection (raw): considered, not matched.
    • Aggregation Reflection (1) (agg): considered, not matched.
    • Aggregation Reflection (agg): considered, not matched.

Here is another example what I am facing “always” with dremio:

I am trying to run a simple count query as following SELECT COUNT(id) FROM "Package Daily Activity" and here is the hierarchy of my datasets

  • “Package Daily Activity” is a virtual dataset which was built as
    SELECT * FROM "Forasna Views.Packages.Company Package With Retain Info" i INNER JOIN Forasna DB.data_analysis.dates d ON i.f1=d.f1
  • “Forasna Views.Packages.Company Package With Retain Info” is another virtual dataset and has its own raw reflection on the all its fields as a display.
  • “Forasna DB”.data_analysis.dates is another physical dataset with its own raw reflection on all its fields
    • “Package Daily Activity” has raw reflection on all it fields as a display
  • All these reflection were built and refreshed within last 30 mins
  • Given these inputs: my expectation is to select “Package Daily Activity” as the acceleration reflection as it cover such simple query, but that didn’t happen.
    Why to accelerate by “Forasna Views.Packages.Company Package With Retain Info” and “Forasna DB”.data_analysis.dates even in a very such simple query? Really this a very confused and frustrating behavior of dremio

9c709749-66b8-457a-8e56-bfb78ebe0558.zip (15.9 KB)

@elshafey

Sorry about the frustration, can you please try to create an aggregate reflection on with count as a measure? Also if you see all 4 reflections matched which means all 4 have reflections can cover the query but for some reason the planner costed going to a different reflection. To understand that please turn on verbose profile as per this support setting and run the query again. Also can you try to create an aggregation reaction on “Package Daily Activity” with count as a measure. I also see you are creating many reflections on the way, here is a white paper on Best Practices

Thanks
@balaji.ramaswamy

Also

1 Like