Dremio vs. database - how best to control JDBC parallelism?

datocrats-org · August 13, 2020, 2:50am

Bigger picture: I think Dremio may be too powerful for the infrastructure we have, how to make sure I don’t crash either Dremio or flood the database server connections, threads, etc.?

Medium picture: I need to get a better understanding of how the Dremio JDBC connector queues reflections and other high cost queries. I see the Queue section of the admin portal provides some control over the number of nodes, but not the number of threads per query.

In the Dremio settings can I control the parallelism of the queries themselves in a queue, as opposed to the number of concurrent jobs?
Where does any record of the “queue” exist, that is the order in which staged queries will be sent along? Do I just have to sort jobs by when run, and reason it out?

I plan for production to limit the number of connections from Dremio using the database’s listener.ora settings.

Is limiting Dremio to 8 or 12 threads out of 16x2 (32 total) too few?
How should I time and size the queries and compute instances according to different loads? I have queries with a max size of 1.5GB running in around 30 minutes?

Details on the limitation I have observed
Under Dremio Jobs High Cost Reflection my Oracle Source database (Oracle 19c EE, 16 core Solaris) is throwing an error indicating there are too many parallel queries for the server to handle. ORA 12801 error signaled in parallel query server P00L

Using development Oracle database I can use SQL Developer to monitor sessions and view 16 active aggregation reflection queries Dremio is sending in parallel, several more dremio sessions marked as inactive that appear to be recent query runs or attempts. I cannot kill sessions and need to prepare for logging of sessions and other OLTP overhead once moving to production. I am relying on Dremio to know not to overload the server, or let me control its queries somehow.

What to do??

I must assume any queries that fail over the JDBC driver can leave interrupts or inactive pending connections.
There appear to be several background refreshes to metadata and to check freshness of the reflections.
I can control which AWS cluster (engine) each queue uses, I just don’t have control over the amount of parallelism in the queries themselves or an understanding of how the order works.
As an outside-of-Dremio solution, would I just need to schedule reflections to run overnight to utilize more parallelism, use multiple database logins scoped to having different parallelism settings?

balaji.ramaswamy · August 13, 2020, 4:49am

@datocrats-org

I am a little confused, our JDBC SCANS are always single threaded and I am not sure if we can control number of parallel slaves in Oracle. We can use Workload Management to control queue but this an EE feature

Thanks
Bali

datocrats-org · August 13, 2020, 2:11pm

@balaji.ramaswamy

Thanks for clarifying re: single threaded JDBC scans. So, the 16 connections I am seeing from dremio into Oracle are a batch or pool of 16 single-threaded JDBC scans being run simultaneously from 1 Dremio job (queue). Is there a way to control the size of the batch per query?

It’s not a show stopper but it seems to differ a lot from how Oracle recommends performance tuning for JDBC apps: https://docs.oracle.com/middleware/1212/wls/JDBCP/performance.htm#JDBCP136

I think the JDBC parallelism hint as its called could be used to allow the database to do more multi-threaded math and in theory achieve orders of magnitude faster results, for aggregation queries.

I think on Oracle each member of the batch or pool is handled on different threads by default given how they match the parallelism of the server (16x2). With parallelism on the Dremio side, as it is usually in data warehousing, wouldn’t it be faster?

I am looking into the Oracle settings will research how to tune the database side further…

show parameter parallel; returns

NAME                            TYPE    VALUE  
------------------------------- ------- ------ 
awr_pdb_max_parallel_slaves     integer 10     
containers_parallel_degree      integer 65535  
fast_start_parallel_rollback    string  LOW    
max_datapump_parallel_per_job   string  50     
optimizer_ignore_parallel_hints boolean FALSE  
parallel_adaptive_multi_user    boolean FALSE  
parallel_degree_limit           string  CPU    
parallel_degree_policy          string  MANUAL 
parallel_execution_message_size integer 16384  
parallel_force_local            boolean FALSE  
parallel_instance_group         string         
parallel_max_servers            integer 32     
parallel_min_degree             string  1      
parallel_min_percent            integer 0      
parallel_min_servers            integer 8      
parallel_min_time_threshold     string  AUTO   
parallel_servers_target         integer 32     
parallel_threads_per_cpu        integer 1      
recovery_parallelism            integer 0

balaji.ramaswamy · August 14, 2020, 7:37am

@datocrats-org

Currently we cannot pass an optimizer hint via the Dremio SQL. Thanks for the link on JDBC tips, will pass that along

Thanks
Bali

Topic		Replies	Views
Connection Pool	2	1354	April 2, 2021
Dremio concurrent queries	7	2060	July 25, 2019
Infrastructure Capacity - Parrallel Operations	2	42	September 24, 2024
Improving reflection performance Site Feedback	6	1930	March 20, 2018
Difference in performance UI/jdbc	1	1043	September 24, 2020

Dremio vs. database - how best to control JDBC parallelism?

Related topics