Catalog query timing out

it takes a long time for tableau to refresh metadata
and sometimes it failed to refresh because it exceeded 60 seconds of planning time.
the cluster is sized with 10G of ram for the JVM
small data set
sometimes it ran out of memory during connection/refresh the connection from tableau.

cc24a193-b890-46d3-96c6-b782d15d0807.zip (1.6 KB)

We are aware that in some situations, metadata queries from Tableau take a very long time and we are working on optimizing this. However people usually don’t observe memory issues for these requests.

Would you mind sharing the content of server.log, server.out and server.gc (if present)? Also if you can give some details about your HDFS and Hive sources (how many datasets, average size, average number of columns and average number of partitions), it would also help better diagnose the issue.

Hi Laurent
the environment is quite simple
HDP 2.6 running on aws 16 core 64G of ram
data set, < 100Mb of total data size,
4 datasets, (Customer, Product, Date and Clicks)
I created a mixed of 3 HDFS source, 1 Hive (fact -> Clickstream) single file
The Dremio is on 3 cluster
1 master/ Coordinator + 2 Executor (40G each)

dremio.tar.zip (233.7 KB)

the file is in tar+gzip format

Looking at the server.log/queries.json files, we are able to observe the metadata queries seem to complete normally.

What I can obvserve however is some cases of very long (>60s) planning for some queries involving Workspace.Clickstream_Analytic dataset. As the query doesn’t seem complex, I wonder if you can share the dataset definition, so we can try to reproduce the issue in-house and possibly come up with a fix.

certainly, I have the data file too if you want.
its about 20Mb total, not sure if I can upload that size. but here is the ddl omniture.tar.zip (1.4 KB)

Thanks for the DDL ! As for Workspace.Clickstream_Analytic, can you just confirm if the dataset is a virtual dataset in the Workspace space (and what the SQL query associated would be), or if this is just one of the datasets you gave the DDL?

That is correct.
Clickstream_Analytics is a join between all those fact + dims

We would need the SQL query which is defining this dataset if possible, as it is most likely causing the issue when planning the final query.

SELECT ipaddress, client_lang, CASE WHEN length(substr(client_lang, 1, 5)) > 0 THEN substr(client_lang, 1, 5) ELSE NULL END AS “Language”, isp, user_agent, http_response, city, country, state, local_radio, local_tv, Product_Line, Category, Customer_ID, Birth_Date, Gender, FullDateAlternateKey, daynumberofmonth, nested_0.“year” AS “year”, Count_cookieid, Sum_pageview, Sum_timespent, Count_txn_id
FROM (
SELECT nested_2.ipaddress AS ipaddress, nested_2.client_lang AS client_lang, nested_2.isp AS isp, nested_2.user_agent AS user_agent, nested_2.http_response AS http_response, nested_2.city AS city, nested_2.country AS country, nested_2.state AS state, nested_2.local_radio AS local_radio, nested_2.local_tv AS local_tv, nested_2.Product_Line AS Product_Line, nested_2.Category AS Category, nested_2.Customer_ID AS Customer_ID, nested_2.Birth_Date AS Birth_Date, nested_2.Gender AS Gender, join_CalendarDate.FullDateAlternateKey AS FullDateAlternateKey, join_CalendarDate.daynumberofmonth AS daynumberofmonth, join_CalendarDate.“year” AS “year”, count(distinct nested_2.cookieid) AS Count_cookieid, SUM(nested_2.pageview) AS Sum_pageview, SUM(nested_2.timespent) AS Sum_timespent, COUNT(nested_2.txn_id) AS Count_txn_id
FROM (
SELECT nested_1.txn_id AS txn_id, nested_1.txn_time AS txn_time, CASE WHEN length(substr(nested_1.txn_time, 1, 10)) > 0 THEN substr(nested_1.txn_time, 1, 10) ELSE NULL END AS DateKey, nested_1.ipaddress AS ipaddress, nested_1.pageview AS pageview, nested_1.cookieid AS cookieid, nested_1.client_lang AS client_lang, nested_1.res_vert AS res_vert, nested_1.res_horz AS res_horz, nested_1.http_response AS http_response, nested_1.isp AS isp, nested_1.user_agent AS user_agent, nested_1.timespent AS timespent, nested_1.city AS city, nested_1.country AS country, nested_1.state AS state, nested_1.local_radio AS local_radio, nested_1.local_tv AS local_tv, nested_1.url AS url, nested_1.Product_Line AS Product_Line, nested_1.Category AS Category, nested_1.Customer_ID AS Customer_ID, join_Customer.B AS Birth_Date, join_Customer.C AS Gender
FROM (
SELECT nested_0.txn_id AS txn_id, nested_0.txn_time AS txn_time, nested_0.ipaddress AS ipaddress, nested_0.pageview AS pageview, nested_0.cookieid AS cookieid, nested_0.Customer_ID AS Customer_ID, nested_0.client_lang AS client_lang, nested_0.res_vert AS res_vert, nested_0.res_horz AS res_horz, nested_0.http_response AS http_response, nested_0.isp AS isp, nested_0.user_agent AS user_agent, nested_0.timespent AS timespent, nested_0.city AS city, nested_0.country AS country, nested_0.state AS state, nested_0.local_radio AS local_radio, nested_0.local_tv AS local_tv, nested_0.url AS url, extract_pattern(nested_0.url, ‘\w+’, 4, ‘INDEX’) AS Product_Line, join_Product.B AS Category
FROM (
SELECT txn_id, txn_time, ipaddress, pageview, url, cookieid, CASE WHEN length(substr(cookieid, 2, 36)) > 0 THEN substr(cookieid, 2, 36) ELSE NULL END AS Customer_ID, client_lang, res_vert, res_horz, http_response, isp, user_agent, timespent, city, country, state, local_radio, local_tv
FROM Workspace.Clickstream
) nested_0
INNER JOIN Workspace.Product AS join_Product ON nested_0.url = join_Product.A
) nested_1
INNER JOIN Workspace.Customer AS join_Customer ON nested_1.Customer_ID = join_Customer.A
) nested_2
INNER JOIN Workspace.CalendarDate AS join_CalendarDate ON nested_2.DateKey = join_CalendarDate.FullDateAlternateKey
GROUP BY nested_2.ipaddress, nested_2.client_lang, nested_2.isp, nested_2.user_agent, nested_2.http_response, nested_2.city, nested_2.country, nested_2.state, nested_2.local_radio, nested_2.local_tv, nested_2.Product_Line, nested_2.Category, nested_2.Customer_ID, nested_2.Birth_Date, nested_2.Gender, join_CalendarDate.FullDateAlternateKey, join_CalendarDate.daynumberofmonth, join_CalendarDate.“year”
) nested_0

Thank you very much. We will work on reproducing the issue and finding a fix. I’ll keep you posted with the progress.

sure, to give you more context.
CalendarDate
SELECT A AS DateKey, B AS FullDateAlternateKey, C AS daynumberofweek, D AS daynameofweek, E AS daynumberofmonth, F AS daynumberofyear, G AS monthname, H AS numberofyear, I AS quarter, J AS "year"
FROM HDFS.“user”.dremio.sample.data.omniture.date_omniture."date_omniture.tsv"
Customer
SELECT *
FROM HDFS.“user”.dremio.sample.data.omniture.users."users.tsv"
Product
SELECT *
FROM HDFS.“user”.dremo.sample.data.omniture.products."products.tsv"
Clickstream
SELECT *
FROM Hive.dremio.clickstream

date_omniture.zip (14.3 KB)

products.zip (652 Bytes)

users.zip (931.4 KB)

and for the fact table
https://drive.google.com/file/d/1q8dhlhL7EUQ97X7vRGj449vYTJ3lhHcw/view?usp=sharing