Partition column reflections and spark Catalyst

kprifogle · September 6, 2019, 1:20pm

I established a low cardinality column as a partition column in a reflection and then queried that dataset using spark sql. Normally I would expect spark catalyst to be able to discover those partitions and partition the data accordingly (https://spark.apache.org/docs/2.2.2/sql-programming-guide.html#partition-discovery) however this did not happen. I had a single partition for the whole dataset. Is this to be expected?

rymurr · September 8, 2019, 8:33am

Hey @kprifogle

Are you using the jdbc driver to talk to Dremio from Spark? Spark’s partition discovery doesn’t work for jdbc connections. You can use the option from Section 5 of this article though: https://medium.com/@radek.strnad/tips-for-using-jdbc-in-apache-spark-sql-396ea7b2e3d3

There is a prototype custom connector for Spark to talk to Dremio natively which will drastically speed it up compared to jdbc and respect partitions. I will update you when it’s released.

Topic		Replies	Views
ARP Connector AWS Athena - Dremio ignoring Partition Columns	0	1068	September 5, 2019
Show partition table_name not working in Dremio	2	2006	December 17, 2019
Dir0, dir1 and partitioned datasets, etc	13	4778	March 25, 2022
Dremio data reflection Dremio University	1	1126	September 25, 2020
Dremio reflection creation performance tuning	2	1678	July 17, 2019

Partition column reflections and spark Catalyst

Related topics