Understanding "HASH PARTITION BY"

ddmbr · April 10, 2018, 3:11am

Suppose I create a reflection, and HASH PARTITION BY (A, B, C). Does this mean that each hash bucket is identified by a combination of (A, B, C)? If so, will queries that filter only A and B be slower (than queries that filter A, B, and C)? I imagined that by having a WHERE clause on all three columns it would be fast to locate a bucket, while giving only two columns can result in full scan of the reflection?

I probably need to read the source code to understand better…

ddmbr · April 12, 2018, 2:49am

I noticed that the created acceleration consists of partitioned parquet files. It seems to me this is the key, allowing me to create some hierarchy.

ben · May 1, 2019, 5:55pm

Hi @ddmbr,

Filtering with a WHERE clause on any of those columns will have essentially the same cost, as Dremio ignores the partition hierarchy at planning time. Instead, it stores information about the partition structure and builds an index on the partition keys. During planning, it uses this index to “prune” out the files that it will need to scan during execution.

Topic		Replies	Views
Column names not partitioned properly?	0	1091	July 10, 2018
Dynamic Partitioning and Parameterized Queries in Dremio: Capabilities and Functionality	3	111	September 25, 2024
Clarifications on Sorting and Partioning	11	3530	April 16, 2019
Partition filter not used when in clause	26	283	August 12, 2024
Dremio not using partition filter	3	155	May 19, 2024

Understanding "HASH PARTITION BY"

Related topics