Partition column reflections and spark Catalyst

I established a low cardinality column as a partition column in a reflection and then queried that dataset using spark sql. Normally I would expect spark catalyst to be able to discover those partitions and partition the data accordingly (https://spark.apache.org/docs/2.2.2/sql-programming-guide.html#partition-discovery) however this did not happen. I had a single partition for the whole dataset. Is this to be expected?

Hey @kprifogle

Are you using the jdbc driver to talk to Dremio from Spark? Spark’s partition discovery doesn’t work for jdbc connections. You can use the option from Section 5 of this article though: https://medium.com/@radek.strnad/tips-for-using-jdbc-in-apache-spark-sql-396ea7b2e3d3

There is a prototype custom connector for Spark to talk to Dremio natively which will drastically speed it up compared to jdbc and respect partitions. I will update you when it’s released.

1 Like