Clarifications on Sorting and Partioning

Hi Team,

Few Clarification on Sorting and Partitioning in reflections?
Sorting:

  1. Is sorting only helpful to filter and range queries?
  2. Does Sorting help in joins?
  3. If we enable sorting in the reflections, would it help to accelerate the"Order by" clause?
  4. Having multiple columns as sort and then use the dataset to join with other dataset, Will that improve the joining clause?
    Partitioning:
  5. Is the partitioning concept is same as that of RDBMS?
  6. Does partitioning help us in speeding the joins?
  7. In One single Reflections can we create multiple partitioning? Is it advisable and how does that behave at the backend ?

Have you had a chance to review this best practices guide?

Yes, I had a chance to review this document. I do understand the best practices. But still I need some insights on the questions asked. I am asking this questions to follow some neat protocol while developing the VDS.

Hi @raghu_raman_ravishan, these are really good questions. Let us get back to you shortly with an itemized response.

@raghu_raman_ravishan:

  1. Is sorting only helpful to filter and range queries?
    Yes, the sorting is primarily helpful for filtering and range queries.

  2. Does Sorting help in joins?
    The sorting does not directly help join performance.

  3. If we enable sorting in the reflections, would it help to accelerate the "Order by” clause?
    Yes, if you are ordering by the sorted column.

  4. Having multiple columns as sort and then use the dataset to join with other dataset, Will that improve the joining clause?
    Again, this will not directly improve the join performance. If your are joining VDS that have filters on them, it will speed up this filtering and reduce the size of the datasets being joined.

  5. Is the partitioning concept the same as that of RDBMS?
    The reflection partitioning is like horizontal partitioning in an RDBMS. A dataset is sorted on the partition key(s) and then written as of collection of parquet files distributed across a directory structure that reflects the value of the partition. For instance, if you’ve partitioned your reflection on a column called partition_date, then all the records of transactions from 10-31-2018 will all end up in the same directory called 2018_10_31T00_00_00_000.

  6. Does partitioning help us in speeding the joins?
    Any filtering on the partitioned column will reduce the size of the datasets considered in the join, so in this sense the join will be sped up (because less data is being processed).

  7. In One single Reflections can we create multiple partitioning? Is it advisable and how does that behave at the backend ?
    You can create multiple partitions. The reflection files will be distributed across a directory tree based on the partitioned columns. The more columns and values, the more ramified this directory structure will be and the smaller the reflection file sizes contained in those directories will be. If the files become too small, the overhead of opening the files during query execution can become costly.

2 Likes

You are probably better off using multiple reflections, each with a single partitioning column rather than one reflection with multiple partitioning columns. But an example query would help explore the best option.

Thanks Ben. This really helped me with more insights. I have one silly questions to ask. The question here is : I have a calculated field (month and year) from a date field and a company code. both the fields are used as a filters. How would you advise to create a reflections on it.

I have one more clarifications as well. I have a dataset created by aggregating some of the columns. What kind of reflections do i need to apply on this “Raw/Aggregation” reflections. This dataset can be used as a join another Dataset and further aggregation is also created. So what kind of Reflections would be more useful in this scenario.

Hi @raghu_raman_ravishan,

I have a calculated field (month and year) from a date field and a company code. both the fields are used as a filters. How would you advise to create a reflections on it.

For the above case, you could build an aggregate reflection with the month and company code as dimensions.

I have one more clarifications as well. I have a dataset created by aggregating some of the columns. What kind of reflections do i need to apply on this “Raw/Aggregation” reflections. This dataset can be used as a join another Dataset and further aggregation is also created. So what kind of Reflections would be more useful in this scenario.

If I understand you correctly, you have a dataset that includes aggregate on a number of columns and that you want to join this dataset with others. In this case, a raw reflection on that dataset may be appropriate.

Hi @ben,

I have one more clarifications on Reflections, I have reflections created from different datasets using joins. Also the system says the reflection has been created. But when I run the query(select * from VDS) on the resultant data set that has the reflections created, when i look in to the Jobs I noticed that this SQL doesn’t utilize the reflections created. It says “This query was spilled”. Also want to know how to avoid “This query was spilled”.

I am getting “Query was cancelled because planning time exceeded 60 seconds”. How to rectify this error.

Hello @raghu_raman_ravishan,

Can you provide query profiles for both?