Simple Query with in clauses running extremely slow

meabhishek80 · May 20, 2020, 3:07am

I am completely new to Big data and so to dremio. Apologies if my queries are basic.

I pushed my data at a certain location in hadoop. This has almost 900 million records distributed automatically by Spark API in almost 1500 file. Each records has appx 470 columns mix of String and integer and date fields

Now I created a VDS in dremio pointing to this HDFS location. Now, If I run a simple query like
select * from tableName where isin=‘isin1’, it runs very fine and comes back in 1-2 minutes but as soon as I modify this query to lookup multiple isin, it takes almost 50 minutes.

If i further modify the query to include date fields in filter, it further worsens.
select * from tableName where isin=‘isin1’ and ‘2020-05-02’ between from_dt and to_dt

Noe:- The VDS doe snot have any reflection or indexes till now

how can we solve it? Should we create indexes/refelction over these columns.

Please advise.

desi · May 20, 2020, 4:21am

Your query needs to be modified as below. The existing query will scan all partitions record by record evaluating the between clause. Instead you want your query to range scan partitions even if it does not prune paritions (due to your partitioning scheme). See below:

select * from tableName where isin=‘isin1’
and ‘2020-05-02’ >= from_dt and ‘2020-05-02’ <= to_dt

meabhishek80 · May 20, 2020, 5:26am

Can you also suggest why query with in clause also takes lot of time

Query lile

select * from tablebame where isin in (‘isin1’, isin2’,…). This also takes lot of time

meabhishek80 · May 20, 2020, 5:35am

Do we need to create reflection/index on these column for such queries to work faster. Infact I created one Dimension on the column. After this, whenever I query with column isin as filter, it started giving me “Error setting up remote fragment execution”

Also uploading the profile when i got above error

e5511542-2c5d-4464-99aa-7019fe294f71.zip (150.8 KB)

Topic		Replies	Views
Error when querying Redshift with IN clause	2	841	March 4, 2022
VDS using native SQL	11	3052	November 27, 2018
Dremio become slow when using where clause	1	1549	February 27, 2019
Speed of execution of a query for oracle dataset	5	1769	November 11, 2019
VDS sending Full Table Scan to Oracle	3	918	July 2, 2021

Simple Query with in clauses running extremely slow

Related topics