Wanted to understand how the VDS work on a PDS having billion rows. Does the dremio engine get all the billion rows into memory each time the VDS is queried or it gets only first set of rows based on the limit we set? If it is only first set of rows then how about in scenario where the VDS is based on two PDS datasets each existing on different lake. In this case does it get all the rows into memory separately and join in the executor and show the results?
Hey @kumar.paloji, Welcome to Dremio community.
See if the following Dremio Architecture Guide helps lay the foundation (relevant section titled “The life of a Query” onward) → https://www.dremio.com/downloads/DremioArchitectureGuide.pdf
Joins in Dremio are typically via a HashJoin operator. A hash table gets built on the rows of the inner table. The outer table’s rows are used to probe the hash table and find matches. These operations typically happen in memory.
Dremio can also do runtime filtering for joins (i.e. dynamically apply filters from the inner table to the outer table to speed up filtering on larger tables).
Hope that helps you get started…