@Ken Where is the data on table Minio.finance.l1.rollup_daily_hk_stock_shareholding_change__dbt_tmp stored? From the name it look like MinIO. All the time is spent on wait time for table_function 03-xx-02. Expand Operator Metrics and you will see NUM_CACHE_MISSES are matching with NUM_CACHE_HITS which also match with the total number of readers. Looks like a reporting issue that I will follow up. Total datafiles to be read are 30,162. Can you run select * from this table a few times (from JDBC as UI will truncate) and then try the CTAS?
Yes, the problem was with over partitioning the source table. You had 30K parquet files with about 15 records per file. There’s too much overhead with opening and closing so many files for such a small dataset.
Were you able to get the incremental model to work successfully? I.e. was there an incremental filter present in the query that was used to materialize that dbt_tmp table?
Hey there,
I totally got your query,the insertion process from the __dbt_tmp table to the target table is slower than expected. Analyzing the job profile provided can help identify any bottlenecks or inefficiencies. Optimizing SQL queries and ensuring proper Dremio environment configuration may improve performance. If needed, seek assistance from the dbt-dremio community or support team.
You can use either the Visual Profile or the Query Profile’s Operator metrics to get partitioning and row count information. Basically, every Iceberg table scan is going to have a pattern that looks like this:
ICEBERG_SUB_SCAN → TABLE_FUNCTION → OTHER OPERATORS → TABLE_FUNCTION
ICEBERG_SUB_SCAN is a scan against the Iceberg manifest file and outputs the number of manifest files (after partition field pruning).
TABLE_FUNCTION is a scan against the manifest files and outputs the number of data files (after partition and non-partition field pruning).
The second TABLE_FUNCTION is the distributed scan against the Parquet data files which could also include row group pruning. (See NUM_ROW_GROUPS_PRUNED) The output is the total number of records read from the Iceberg table. The NUM_CACHE_HITS and NUM_CACHE_MISSES is particularly important here for IO performance because it tells you about the state of the C3 cache.
OTHER OPERATOR is a special case to handle data file pruning on filter expressions that cannot be pushed down into the Iceberg manifest file scan.
The visual profile will sum the output rows across threads for each operator whereas the operator metrics will breakdown the max records per thread. This just sums back to the same total shown in the visual profile.