I am pulling in data from ES. An issue that I have been having for the past few months (and it has made Dremio unusable) is that I am not getting updated data on these different files. When I look at this same instance in Kibana (pulling from the exact same location of ES) I am seeing all of the updated data.
I was hoping that this version of Dremio would solve this issue but unfortunately, I just upgraded and found the same issue. My one last hope here is that I have configured something wrong… Would someone be able to help me troubleshoot this?
Can you say more about what query you are pushing into ES?
Are you using Data Reflections? If so, those are updated according to the refresh policy you have configured and will not always have the freshest data.
After I add my sources, I click on one of the the folders. Once I get into that table I just sort by descending view on the date. On some of the datasets the newest data is 3-4 months old while in reality there should be data from today. Let me know what other info I can provide…
When you open up a dataset, by default, we preview the data (instead of showing you the entire dataset) for performance reasons. Can you try “Running” the query (top right icon)? Also, you can double check by applying a filter and running on that as well.
So… Is it a known issue with this new update that after saving a dataset and then trying to RUN a query afterwards that it just keeps trying? I have tried on 3 occasions and the “spinning wheel” just keeps going for 10+ minutes. The initial “RUN” before saving the dataset only took 30 seconds or so.
Just to add to this, definitely the UX of the button Preview needs to be reviewed and challenged. Even when limiting the number of records retrieved is very slow (in my case I am using 10 records and still preview queries for some datasets - 100 million - takes 30-90 seconds to preview). To my mind, I find completely pointless the concept of Preview (perhaps we can do something like SQL Server / ORACLE where by default they just do SELECT * FROM TABLE LIMIT 1000 and it is quite obvious for the Data Analyst what’s going on).
To be fair with the Dremio team, this is maybe something that we the community should go and implement in the code? (Shouldn’t be that hard - or maybe not)
BTW, it is shocking the amount issues generated by this simple UX/Usability issue of the “Preview” button (assuming that this is indeed a UX problem and not a functionality driven by design and/or complexity)
@akikax The purpose of previewing the data (instead of showing you the entire dataset by default) is for UX performance reasons. If you are saying it takes 30s to just preview the data, it may take 3minutes+ to run the data, which is an even worse experience just to see some records.
If your retrieval is very slow, a job profile would help us investigate.
Lastly, Dremio does exactly what you said about SQL/Oracle. When you are previewing a RDBMS source, we automatically insert a SELECT TOP or WHERE ROWNUM <= to limit the data retrieval. You can verify this by going to the raw query profile and seeing the query we pushdown.
@anthony, you are completely missing the point here (you are looking at it with a techy hat). Hear me out for a second:
Dremio is NOT doing exactly as SQL Server or ORACLE or similar client tools. When using any of these tools, from a UX perspective, it is quite clear what the product is doing (see image below), there is not need to see a query profile to learn how the function works or what it is doing as it is quite OBVIOUS where the result set came from (clearly the user knows what’s going on).
With the preview functionality in Dremio I wonder is this a LIMIT 100, OR 10000. Is this a completely different thing? Is this being sorted? What is it?
Finally, trust me on this one, I did multiple tests and I was able to fully replicate the issue, doing a preview on a SQL Server table with 35 million records took over a minute, where the top 1000 took only seconds in SQL Server. (Yeah, maybe there is an issue here with the product but this is a technical issue (potentially) not an usability issue). You have TWO issues going on here (a technical one and an UX one)
Perhaps something as simple as renaming the Preview button to “Preview N” where N is the user defined setting for the number of records that he/she has set up to see in Dremio’s settings will do the trick.
Sorry for misinterpreting your post. I understand where you are coming from and you do make a good point - it is not as clear and obvious as other tools (comparing with your screenshot) what Preview actually does. I have funneled your feedback to our PM, thanks.
Make sure that the keyword “LIMIT N” is included in the preview query. E.g. select * from table LIMIT 1000.
1 and 2 will make quite obvious to everyone what is the intended behaviour of the tool.
BTW, kudos to the Dremio team, again I feel this is something that we us the community should go and enhance. We are benefiting from this awesome Open Source initiative, this is the least we can do.
Hey @akikax thanks for the feedback and sorry you’re running into issues here. Improving the preview experience (across all the sources we support) is something that’s top of mind for us and central for the Dremio experience.
A few things about how previews work:
Dremio doesn’t do previews using the standard SQL LIMIT because this ends up being expensive if you have a join/aggregation/sort in the virtual dataset/query. In most cases, this would require us to complete that whole operation (i.e. blocking operations) before doing a LIMIT due to SQL semantics.
Instead, we put a limit on how much data we scan at the SCAN operator level (i.e. scan 1000 rows from Oracle). This has the advantage of being quicker in scenarios where you have blocking operations. The downside is that sometimes you get empty results because within the sample sets, join keys don’t match, values in filter are not available, etc.
To improve performance, one option is decreasing the sample size we use for previews. You can configure this option by going to Admin > Advanced Settings > Dremio Support and entering this key: planner.leaf_limit_size . Defaults to 10000 records per thread. We determine thread count based on dataset size. You can also configure thread count using planner.leaf_limit_width – however I’d recommend starting with the size parameter first.
I’ve also included your feedback in our internal repo – appreciate all the details. We have some thoughts around improving preview caching, better sample size calculations per data source type and also providing ways to do more fine-grained control over when and how we do previews.
Hope this helps, let me know if you have any questions.