Thanks for responding @kelly. It’s not clear to me what nodes are responsible for what when it comes to reflections, but here’s the topology:
1 master node
3 coordinator nodes
3 executor nodes
All are running in Docker containers on ECS, using host networking. Each has a local EBS volume and a shared EFS attached.
master, coordinator, and executor nodes all have the same paths in dremio.conf:
paths: {
local: "/host", # /host is a Docker volume that points to the local EBS volume
dist: "pdfs:///share", # /share is a Docker volume that points to the shared EFS drive
db: "pdfs:///share/db",
accelerator: "pdfs:///share/accelerator",
downloads: "pdfs:///share/downloads",
uploads: "pdfs:///share/downloads",
results: "pdfs:///share/results",
scratch: "pdfs:///share/scratch"
}
It wasn’t clear to me based on the docs which nodes should utilize distributed storage and which shouldn’t. This https://docs.dremio.com/deployment/distributed-storage.html only talks about configuring the different distributed storage types. I’m also not clear on how service discovery works in Dremio, and how nodes are discovering each other.
Also, if you tell us more about your deployment we may be able to weigh in on the number of coordinators/executors you have configured. There may be a better use of resources, or maybe you’ve got it all figured out.
My assumption (untested, unverified, just a hunch) was that latency would be a problem if storing reflections on S3, vs a NAS. I can certainly try S3 though.
Our goal is very simple: analytic queries must respond in under a second. We have a moving window of a year’s worth of data that we need to query in an ad hoc fashion. We have a table of about 550M posts, and about 50M profiles that will need to be joined. There may be join tables to denormalize things like hash tags associated with a post, but I haven’t gotten that far yet with Dremio.
So whatever hardware, topology, or configurations we need to make to meet that sub second query goal happen, we’re all ears!
Sounds like the data can be partitioned by time? In what increment does your yearly window move? Daily, weekly, monthly?
1.5 will have some important features that I think may be useful for your use case, and that’s coming pretty soon. Meanwhile, I think that if you try using S3 to store your reflections instead of EFS, that’s worth trying. Also, I think you’ll want to create one or more aggregation reflections to support your group by queries. You may not need raw reflections at all if all queries are group by, but we can explore that as you get going.