Creating VDS on PDS having billion rows

kumar.paloji · September 27, 2022, 4:03pm

Wanted to understand how the VDS work on a PDS having billion rows. Does the dremio engine get all the billion rows into memory each time the VDS is queried or it gets only first set of rows based on the limit we set? If it is only first set of rows then how about in scenario where the VDS is based on two PDS datasets each existing on different lake. In this case does it get all the rows into memory separately and join in the executor and show the results?

lenoyjacob · September 28, 2022, 12:59pm

Hey @kumar.paloji, Welcome to Dremio community.

See if the following Dremio Architecture Guide helps lay the foundation (relevant section titled “The life of a Query” onward) → https://www.dremio.com/downloads/DremioArchitectureGuide.pdf

Joins in Dremio are typically via a HashJoin operator. A hash table gets built on the rows of the inner table. The outer table’s rows are used to probe the hash table and find matches. These operations typically happen in memory.

Dremio can also do runtime filtering for joins (i.e. dynamically apply filters from the inner table to the outer table to speed up filtering on larger tables).

Hope that helps you get started…

Topic		Replies	Views
Cannot use partitions filters using VDS	4	975	September 29, 2021
Not able to join VDM	7	1211	March 5, 2019
Out of memory on relatively small join	4	1106	September 30, 2022
32000 KB error when joining VDS but works on its own	1	1041	May 9, 2022
Dremio SELECT * on joins	2	5258	November 6, 2018

Creating VDS on PDS having billion rows

Related topics