I am completely new to Big data and so to dremio. Apologies if my queries are basic.
I pushed my data at a certain location in hadoop. This has almost 900 million records distributed automatically by Spark API in almost 1500 file. Each records has appx 470 columns mix of String and integer and date fields
Now I created a VDS in dremio pointing to this HDFS location. Now, If I run a simple query like
select * from tableName where isin=‘isin1’, it runs very fine and comes back in 1-2 minutes but as soon as I modify this query to lookup multiple isin, it takes almost 50 minutes.
If i further modify the query to include date fields in filter, it further worsens.
select * from tableName where isin=‘isin1’ and ‘2020-05-02’ between from_dt and to_dt
Noe:- The VDS doe snot have any reflection or indexes till now
how can we solve it? Should we create indexes/refelction over these columns.
Your query needs to be modified as below. The existing query will scan all partitions record by record evaluating the between clause. Instead you want your query to range scan partitions even if it does not prune paritions (due to your partitioning scheme). See below:
select * from tableName where isin=‘isin1’
and ‘2020-05-02’ >= from_dt and ‘2020-05-02’ <= to_dt
Do we need to create reflection/index on these column for such queries to work faster. Infact I created one Dimension on the column. After this, whenever I query with column isin as filter, it started giving me “Error setting up remote fragment execution”