Slow insert into table

I am using dbt-dremio to run an incremental model.

In the first phase, it creates a __dbt_tmp table.

It runs pretty quickly.


(Screenshot: Fast creation of __dbt_tmp)

After that, in the second phase, it selects * from the __dbt_tmp, and then insert into the target table.

The second phase, however, runs pretty slowly.


(Screenshot: Very slow insert from__dbt_tmp into target table)

Attached below is the job profile of the long-running insert.

e0ae6a1c-8dc2-4565-9881-e460d2e5c7ca.zip (20.0 KB)

Not too sure what the cause is.

@Ken Where is the data on table Minio.finance.l1.rollup_daily_hk_stock_shareholding_change__dbt_tmp stored? From the name it look like MinIO. All the time is spent on wait time for table_function 03-xx-02. Expand Operator Metrics and you will see NUM_CACHE_MISSES are matching with NUM_CACHE_HITS which also match with the total number of readers. Looks like a reporting issue that I will follow up. Total datafiles to be read are 30,162. Can you run select * from this table a few times (from JDBC as UI will truncate) and then try the CTAS?

Hi @balaji.ramaswamy

Thanks for following up.

I have this job running once every day at night.

Attached below is the log file in yesterday’s run.

c1e8234f-2eb5-4344-8a7f-a182f1705074.zip (22.9 KB)

It is still running 33 minutes for the simple insert.

=========

In fact, I see that the __dbt_tmp table (i.e. the source table) is created with a partition on sehk_code.

Actually, I just want the destination table with the partition.

I feel like the insert is slow because there’s too many partition in the source table.

Let me try to change/remove the partition and see how long it runs tonight

I changed the partition_by to another field called as_of_date and the (INSERT INTO xxxx select * from yyy) is now much faster.

I believe the problem was the partition I chose.

Yes, the problem was with over partitioning the source table. You had 30K parquet files with about 15 records per file. There’s too much overhead with opening and closing so many files for such a small dataset.

Were you able to get the incremental model to work successfully? I.e. was there an incremental filter present in the query that was used to materialize that dbt_tmp table?

Hi Benny,

Thanks for following up.

The issue was not with __dbt_tmp.

__dbt_tmp was created with lots of small partition and it caused problem in the second stage of selecting from __dbt_tmp into the final table.

But the issue was already resolved after I changed to another partition key for the incremental model.

I used date as the key, which is more appropriate, as I got new rows every day (and they all fall into the new day’s partition).

Thank you!