Simple Query with in clauses running extremely slow

I am completely new to Big data and so to dremio. Apologies if my queries are basic.

I pushed my data at a certain location in hadoop. This has almost 900 million records distributed automatically by Spark API in almost 1500 file. Each records has appx 470 columns mix of String and integer and date fields

Now I created a VDS in dremio pointing to this HDFS location. Now, If I run a simple query like
select * from tableName where isin=‘isin1’, it runs very fine and comes back in 1-2 minutes but as soon as I modify this query to lookup multiple isin, it takes almost 50 minutes.

If i further modify the query to include date fields in filter, it further worsens.
select * from tableName where isin=‘isin1’ and ‘2020-05-02’ between from_dt and to_dt

Noe:- The VDS doe snot have any reflection or indexes till now

how can we solve it? Should we create indexes/refelction over these columns.

Please advise.

1 Like

Your query needs to be modified as below. The existing query will scan all partitions record by record evaluating the between clause. Instead you want your query to range scan partitions even if it does not prune paritions (due to your partitioning scheme). See below:

select * from tableName where isin=‘isin1’
and ‘2020-05-02’ >= from_dt and ‘2020-05-02’ <= to_dt

Can you also suggest why query with in clause also takes lot of time

Query lile

select * from tablebame where isin in (‘isin1’, isin2’,…). This also takes lot of time

1 Like

Do we need to create reflection/index on these column for such queries to work faster. Infact I created one Dimension on the column. After this, whenever I query with column isin as filter, it started giving me “Error setting up remote fragment execution”

Also uploading the profile when i got above error

e5511542-2c5d-4464-99aa-7019fe294f71.zip (150.8 KB)

1 Like