Catalog query timing out

rwidjaja · January 17, 2018, 5:47am

it takes a long time for tableau to refresh metadata
and sometimes it failed to refresh because it exceeded 60 seconds of planning time.
the cluster is sized with 10G of ram for the JVM
small data set
sometimes it ran out of memory during connection/refresh the connection from tableau.

rwidjaja · January 17, 2018, 5:47am

cc24a193-b890-46d3-96c6-b782d15d0807.zip (1.6 KB)

laurent · January 17, 2018, 5:09pm

We are aware that in some situations, metadata queries from Tableau take a very long time and we are working on optimizing this. However people usually don’t observe memory issues for these requests.

Would you mind sharing the content of server.log, server.out and server.gc (if present)? Also if you can give some details about your HDFS and Hive sources (how many datasets, average size, average number of columns and average number of partitions), it would also help better diagnose the issue.

rwidjaja · January 17, 2018, 10:38pm

Hi Laurent
the environment is quite simple
HDP 2.6 running on aws 16 core 64G of ram
data set, < 100Mb of total data size,
4 datasets, (Customer, Product, Date and Clicks)
I created a mixed of 3 HDFS source, 1 Hive (fact -> Clickstream) single file
The Dremio is on 3 cluster
1 master/ Coordinator + 2 Executor (40G each)

rwidjaja · January 17, 2018, 10:44pm

dremio.tar.zip (233.7 KB)

the file is in tar+gzip format

laurent · January 17, 2018, 11:31pm

Looking at the server.log/queries.json files, we are able to observe the metadata queries seem to complete normally.

What I can obvserve however is some cases of very long (>60s) planning for some queries involving Workspace.Clickstream_Analytic dataset. As the query doesn’t seem complex, I wonder if you can share the dataset definition, so we can try to reproduce the issue in-house and possibly come up with a fix.

rwidjaja · January 18, 2018, 12:09am

certainly, I have the data file too if you want.
its about 20Mb total, not sure if I can upload that size. but here is the ddl omniture.tar.zip (1.4 KB)

laurent · January 18, 2018, 1:11am

Thanks for the DDL ! As for Workspace.Clickstream_Analytic, can you just confirm if the dataset is a virtual dataset in the Workspace space (and what the SQL query associated would be), or if this is just one of the datasets you gave the DDL?

rwidjaja · January 18, 2018, 6:09am

That is correct.
Clickstream_Analytics is a join between all those fact + dims

laurent · January 18, 2018, 5:06pm

We would need the SQL query which is defining this dataset if possible, as it is most likely causing the issue when planning the final query.

rwidjaja · January 18, 2018, 6:03pm

SELECT ipaddress, client_lang, CASE WHEN length(substr(client_lang, 1, 5)) > 0 THEN substr(client_lang, 1, 5) ELSE NULL END AS “Language”, isp, user_agent, http_response, city, country, state, local_radio, local_tv, Product_Line, Category, Customer_ID, Birth_Date, Gender, FullDateAlternateKey, daynumberofmonth, nested_0.“year” AS “year”, Count_cookieid, Sum_pageview, Sum_timespent, Count_txn_id
FROM (
SELECT nested_2.ipaddress AS ipaddress, nested_2.client_lang AS client_lang, nested_2.isp AS isp, nested_2.user_agent AS user_agent, nested_2.http_response AS http_response, nested_2.city AS city, nested_2.country AS country, nested_2.state AS state, nested_2.local_radio AS local_radio, nested_2.local_tv AS local_tv, nested_2.Product_Line AS Product_Line, nested_2.Category AS Category, nested_2.Customer_ID AS Customer_ID, nested_2.Birth_Date AS Birth_Date, nested_2.Gender AS Gender, join_CalendarDate.FullDateAlternateKey AS FullDateAlternateKey, join_CalendarDate.daynumberofmonth AS daynumberofmonth, join_CalendarDate.“year” AS “year”, count(distinct nested_2.cookieid) AS Count_cookieid, SUM(nested_2.pageview) AS Sum_pageview, SUM(nested_2.timespent) AS Sum_timespent, COUNT(nested_2.txn_id) AS Count_txn_id
FROM (
SELECT nested_1.txn_id AS txn_id, nested_1.txn_time AS txn_time, CASE WHEN length(substr(nested_1.txn_time, 1, 10)) > 0 THEN substr(nested_1.txn_time, 1, 10) ELSE NULL END AS DateKey, nested_1.ipaddress AS ipaddress, nested_1.pageview AS pageview, nested_1.cookieid AS cookieid, nested_1.client_lang AS client_lang, nested_1.res_vert AS res_vert, nested_1.res_horz AS res_horz, nested_1.http_response AS http_response, nested_1.isp AS isp, nested_1.user_agent AS user_agent, nested_1.timespent AS timespent, nested_1.city AS city, nested_1.country AS country, nested_1.state AS state, nested_1.local_radio AS local_radio, nested_1.local_tv AS local_tv, nested_1.url AS url, nested_1.Product_Line AS Product_Line, nested_1.Category AS Category, nested_1.Customer_ID AS Customer_ID, join_Customer.B AS Birth_Date, join_Customer.C AS Gender
FROM (
SELECT nested_0.txn_id AS txn_id, nested_0.txn_time AS txn_time, nested_0.ipaddress AS ipaddress, nested_0.pageview AS pageview, nested_0.cookieid AS cookieid, nested_0.Customer_ID AS Customer_ID, nested_0.client_lang AS client_lang, nested_0.res_vert AS res_vert, nested_0.res_horz AS res_horz, nested_0.http_response AS http_response, nested_0.isp AS isp, nested_0.user_agent AS user_agent, nested_0.timespent AS timespent, nested_0.city AS city, nested_0.country AS country, nested_0.state AS state, nested_0.local_radio AS local_radio, nested_0.local_tv AS local_tv, nested_0.url AS url, extract_pattern(nested_0.url, ‘\w+’, 4, ‘INDEX’) AS Product_Line, join_Product.B AS Category
FROM (
SELECT txn_id, txn_time, ipaddress, pageview, url, cookieid, CASE WHEN length(substr(cookieid, 2, 36)) > 0 THEN substr(cookieid, 2, 36) ELSE NULL END AS Customer_ID, client_lang, res_vert, res_horz, http_response, isp, user_agent, timespent, city, country, state, local_radio, local_tv
FROM Workspace.Clickstream
) nested_0
INNER JOIN Workspace.Product AS join_Product ON nested_0.url = join_Product.A
) nested_1
INNER JOIN Workspace.Customer AS join_Customer ON nested_1.Customer_ID = join_Customer.A
) nested_2
INNER JOIN Workspace.CalendarDate AS join_CalendarDate ON nested_2.DateKey = join_CalendarDate.FullDateAlternateKey
GROUP BY nested_2.ipaddress, nested_2.client_lang, nested_2.isp, nested_2.user_agent, nested_2.http_response, nested_2.city, nested_2.country, nested_2.state, nested_2.local_radio, nested_2.local_tv, nested_2.Product_Line, nested_2.Category, nested_2.Customer_ID, nested_2.Birth_Date, nested_2.Gender, join_CalendarDate.FullDateAlternateKey, join_CalendarDate.daynumberofmonth, join_CalendarDate.“year”
) nested_0

laurent · January 18, 2018, 6:05pm

Thank you very much. We will work on reproducing the issue and finding a fix. I’ll keep you posted with the progress.

rwidjaja · January 18, 2018, 6:09pm

sure, to give you more context.
CalendarDate
SELECT A AS DateKey, B AS FullDateAlternateKey, C AS daynumberofweek, D AS daynameofweek, E AS daynumberofmonth, F AS daynumberofyear, G AS monthname, H AS numberofyear, I AS quarter, J AS "year"
FROM HDFS.“user”.dremio.sample.data.omniture.date_omniture."date_omniture.tsv"
Customer
SELECT *
FROM HDFS.“user”.dremio.sample.data.omniture.users."users.tsv"
Product
SELECT *
FROM HDFS.“user”.dremo.sample.data.omniture.products."products.tsv"
Clickstream
SELECT *
FROM Hive.dremio.clickstream

rwidjaja · January 18, 2018, 6:11pm

date_omniture.zip (14.3 KB)

rwidjaja · January 18, 2018, 6:11pm

products.zip (652 Bytes)

rwidjaja · January 18, 2018, 6:12pm

users.zip (931.4 KB)

rwidjaja · January 18, 2018, 6:18pm

and for the fact table
https://drive.google.com/file/d/1q8dhlhL7EUQ97X7vRGj449vYTJ3lhHcw/view?usp=sharing

Topic		Replies	Views
Performance Issues connected to Tableau	1	1225	February 27, 2018
Tableau Online Support	9	1610	October 30, 2018
DREMIO Query Performance not stable with hive data source	1	1107	July 25, 2020
Query was cancelled planning time exceeded 60 seconds	3	1991	June 7, 2018
Performance Issues while connected to Tableau	4	1672	November 3, 2017

Catalog query timing out

Related topics