I’m running Dremio on a regular Digital Ocean bucket and the “SF Incidents” query from the tutorial is taking 4-5+ minutes. This was taking just a few seconds running locally on my Mac so I think this is probably something with my setup potentially? Here’s the profile:
You can see it’s the “JSON_SUB_SCAN” process, specifically the “4m47s” wait time (average, min, and max)
Does that tell me anything about my setup that I should change to improve this? I assume my mac is a little more powerful than a standard DO droplet but not by much, and the internet connection on my droplet is confirmed to be very fast (800MBPS down / 200 up)
EDIT: just re-sized my instance to have 4GB of RAM and it’s maybe a little faster for some queries but not others. Do I just need to throw RAM at the problem?
What are the exact specs (CPU/mem) of your Digital Ocean box where you are doing your testing? I’m guessing the JSON_SUB_SCAN may be taking awhile due to inablilty to multi-thread.
Totally makes sense - I’m mostly trying to figure out the “minimum setup where this will work for some light testing”. My macbook seems to work fine, for POC purposes it’d be great to get a served version at parity with my local version basically.
Just upgrade to these specs:
8 GB
4 vCPUs
25 GB
And still getting really laggy, long queries for the same reason. > 4 minutes for simple stuff that takes 4-5 seconds locally
It is a shared machine though, do you think it just won’t work with that setup? Maybe need to go to dedicated CPU droplet or something? I suppose it could be the specs, but upgrading pretty significantly twice now has made ~ no difference so I’m wondering if something else is going on, like whether it’s possible multi-threading is disabled on my setup somehow.
If it’s helpful, here’s my cpuinfo at the moment (not sure this is helpful though)
Just upgraded again to the dedicated CPU threads with these specs:
8 GB
4 vCPUs
50 GB
And still the same situation, that should be a better setup than my laptop so I feel like something is actually wrong with my config or the program itself possibly.
Ok had a minor breakthrough at least - I had previously been running the newest TAR of Dremio (2.0.5) using these instructions, which were written for v1.4.4 but worked fine:
I had a hunch, so I went back and destroyed my Droplet, started a new one with 8GB of RAM, 4vCPU (same specs I had been using unsuccessfully), and 50GB of disk but installed it with v.1.4.4 and things are much better. The query from the SF Incidents Tutorial now takes about 30 - 40 seconds, still slow but similar to what is happening on my laptop (which is about 10-20 seconds).
Is it possible that:
A. the setup instructions I’m using (the GIST) are leading me to do something in the config that is breaking things
OR
B. something on v.2.0.5 is materially different and less performant on Ubuntu 1.6. machines or some other aspect of my dev setup.
?
Ran another droplet, same specs, on CentOS with 2.0.5 installed via the RPM, same bad performance as my previous installations of 2.0.5. Upgrading the CentOS installation to
32 GB
16 vCPUs
160 GB
CPU Optimized Droplet
Brings 2.0.5 on CentOS to a good state, queries taking 2-10 seconds.
I guess this answers my question? Is that just sort of the “minimum viable droplet” essentially?
I can’t replicate getting this setup to run quickly, I’ve re-built it a few times to be sure and it becomes slow every time. I thought I had the solution but clearly something else is going on.
32 GB
16 vCPUs
160 GB
CPU Optimized Droplet
Still extremely slow. As I mentioned above, I got this setup to work well one time, but upon re-building that same setup a few more times, none of the subsequent installations work as expected.
The only other info I have is it’s only JSON that is having this issue, which makes sense sense it’s the JSON_SUB_SCAN that is the blocker
Man I’ve really tried everything I can think of at this point - I’ve installed every version of Dremio going back to 1.4.4, tried every size Droplet possible, CentOS vs. Ubuntu, etc. Is there some configuration I’m missing? This is my build process which I simplified, maybe I’m missing something? In every single configuration and setup I have the JSON queries take 30 seconds - 10 minutes, except for two times when, for some reason, the server could process them quickly. I thought I had a deterministic reason why that was happening but I couldn’t repeat it.
sudo yum install java-1.8.0-openjdk
I know you mentioned you are running a “query from the SF Incidents Tutorial”. I assume you are following our tutorial here? If yes, that dataset is actually kind of small (50mb, 150491 records), small to the point that your original setup (I recall 4vcpu/8gb) should run in less than 10seconds. Also small to the point that multi-threading isn’t necessary.
A big random guess here - you are running your Dremio in DigitalOcean and you are trying to connect to a AWS S3 dataset. Those 2 cloud providers are direct competitors. Maybe some type of network throttling?
Here is the direct download link to the SF JSON. It may be worth trying to upload this into DigitalOcean via our UI and testing that directly rather than going through AWS S3.
I agree it’s weird! I’ll post a profile below. It’s only JSON and always because of the “JSON_SUB_SCAN” operation taking up all the time. I’m going to try putting that data source in my own S3 bucket as well as in Digital Ocean directly in the UI and see if it makes a difference. I’ll also share the full profile not just the text.
@anthony here is the profile the query from the Tutorial I just ran.
Of course, when I’m actually going to post the profile it runs the fastest it has run in 10 deployments at 1 minute 6 seconds, but that’s still 10x longer than it takes on my Mac so I assume this shares the problem this thread is about.
Wow ok I think I can stop the search. Running this same data from my own S3 bucket solved this problem 100%. Wow! What a wild goose chase!
I think something is up with running the sample data in your public S3 bucket at Dremio, S3 or Digital Ocean (must be them?) or someone may be rate limiting it. That’s crazy. One of the weirdest bugs I’ve encountered, I should have tried my own dataset way earlier in this process.