Data Source : Record Fetch Size, does this work?

Was having some issues loading large data sets on initial load (before incremental update kicks in). Thought the Advanced option in the Data Source setting would assist in limiting the data size extracted from the source. I set the Record Fetch Size to 200,000 records. Then rebuilt the reflections. The reflection job is still attempting to pull in ALL of the millions of records, not just 200,000 at a time.
38%20PM
How is this feature suppose to behave?

As a side note. Be careful, this feature deletes all of the current reflections setup on the direct source tables (however not any of the virtual queries, thank goodness).

What source type are you connecting to?

In Dremio 2.0 when editing a source you should get a dialog telling you that your changes will cause all reflections/etc to be cleared out depending on what settings you are changing.

MySQL

Yes. I got the prompt. Which was fine. I was prepared to reset everything.

However, the functionality of limiting the record fetch size did not appear to change anything.

If you want to limit the amount of data a reflection consist of, I would recommend adding a filter (where clause) in your VDS before creating the reflection.

I do not want to limit the data a reflection consists of.

I want to limit the amount of data that Dremio fetches when building a reflection.

More so, I would I like to understand what the Record Fetch Size configuration option is really doing or is for?

I’m testing using Dremio 3.0.6 with oracle 11g and creating a reflection on a table which has 17M rows. The reflection takes approx 1.5 hours to create compared to say a similar size table on a Postgresql source (approx 5 minutes).

I’m trying to figure out where the bottleneck is for Oracle data source. I’ve tried setting the Record Fetch Size on the Oracle source to high and low numbers (100,000, 500, 100) but the amount of time to create the reflection does not change.

  1. Does the “Record Fetch Size” parameter have any effect?
  2. Is there a way to get more throughput from the Oracle connector (eg: parallelize it etc.)?

You might want to check how much time it would take Oracle to run and return the results to the same query Dremio is running. If Oracle returns all the results much faster, then the bottleneck might be between Dremio and Oracle, or in Dremio processing code (including time spent writing data on your distributed storage).

Also, a note about fetch size: it’s a hint to the RDBMS driver on how many rows to fetch at once, instead of fetching them one by one (which would cause lot of latency/slow transfer speed) or fetching them all at once (which would cause memory issues). It is NOT how many total records are fetched for a given query.
By default, Dremio asks as many rows as its own batch size. Also some drivers do not honor the setting under some circumstances.

Thanks @laurent
I’m trying to find the most efficient way to load the data from Oracle into a reflection. I’m thinking, in order to paralelize it, I’ll create separate virtual datasets each selecting a subset of data from oracle (by date) and create seperate reflections on them. Then have a master virtual dataset that unions the others together. The idea is that dremio could then distribute the load.

Does that sound like a good approach or is there a better way?

Creating a VDS being the union of your Oracle subsets, and accelerating it might achieve the same thing (and require less work).

That said, I would be interested in understanding your Oracle performance issue, and if Dremio is significantively slower at fetching data from Oracle vs fetching it directly.

hi @laurent
The table on which I’m trying to create a reflection on has hundreds of millions of records. For 17 million records it takes 1.5 hours to complete the reflection. As a comparison I’ve extracted the 17 million records from oracle to a CSV file onto an NFS server and then created a NAS source in dremio and created a reflection on it which took 10 minutes.

I’m wondering if it would be possible to get similar performance on creating the reflection directly from oracle to avoid copying data.

In theory yes, although dremio might take advantage of its ability to read the file/files in parallel. Out of curiosity, how long did it take to generate the CSV file using Oracle?

Hi @laurent It takes approx 53 minutes to generate the CSV from oracle using sqlplus running serially. With sqlplus parallel export (5 parallel sessions) it takes approx 10 minutes. Is Dremio loading the data serially?

Thanks for sharing some numbers, we will look into it.

As for Dremio, loading data from RDBMS is indeed done serially for now.

Hello! @laurent did you research something about this?
I’m asking because I have the same performance issue with big oracle 50M+ table and fetch params
As I can see when reflection update starts in profile’s JDBC_SUB_SCAN section dremio fetches by 1000+ rows
Max Batches = 2,648
Max Records = 2,708,904
But in datasource settings Record Fetch Size = 20000
Maybe there are some settings for used memory or jdbc fetch size which can be modified?

I run into the similar problem, any update of this thread?