We are using a legacy version of dremio community edition, build 3.0.6-201812082352540436-1f684f9 and running into an issue of “Too many open files”.
java.lang.RuntimeException: java.nio.file.FileSystemException: /opt/dremio/data/db/search/dac-namespace/core/_3giz_Lucene54_0.dvd: Too many open files
I think it’s related to issue Error - Too many open files and our dremio running for long time (more than 1 year).
And I didn’t find other related result from google. Now, we modified max NOFILE number and restarted dremio
My question is that, why too many files opened in dremio, since over 65535, is this expected?
We found that, there are too many pipe/eventpoll opened (nearly 60000),
19930 [eventpoll]
39860 pipe
Can we get the info which operations will cause so many fd opened and is there any fix related to this in later version?
BTW, we have a cron job to call dremio per hour through odbc and I found that there are many logs shown retrieving s3 bucket failed (IOException) and also several logs(once per hour) for jdbc query failure since column data type not supported(maybe jdbc query is not related to fd issue)
The earlierst log I can find is 12/01. such as,
grep IOException syslog.7 | grep dremio | wc -l
5466
grep IOException syslog.7 | grep dremio | head -n 1
Dec 1 06:34:11 dremio[26910]: java.io.IOException: Unable to retrieve bucket named xxxx.
grep IOException syslog.7 | grep dremio | tail -n 1
Dec 2 05:54:44 dremio[26910]: Caused by: java.io.IOException: Unable to find bucket named xxxx.
The full stack is
java.io.IOException: Unable to retrieve bucket named xxxx.
com.dremio.plugins.util.ContainerFileSystem.getFileSystemForPath(ContainerFileSystem.java:305)
com.dremio.plugins.util.ContainerFileSystem.getFileStatus(ContainerFileSystem.java:479)
com.dremio.exec.store.dfs.FileSystemWrapper.getFileStatusSafe(FileSystemWrapper.java:304)
com.dremio.exec.store.dfs.FileSystemPlugin.checkMultifileStatus(FileSystemPlugin.java:734)
com.dremio.exec.store.dfs.FileSystemPlugin.checkReadSignature(FileSystemPlugin.java:694)
com.dremio.exec.catalog.SourceMetadataManager.refreshDataset(SourceMetadataManager.java:514)
com.dremio.exec.catalog.ManagedStoragePlugin.refreshDataset(ManagedStoragePlugin.java:395)
com.dremio.exec.catalog.CatalogImpl.refreshDataset(CatalogImpl.java:455)
com.dremio.exec.catalog.DelegatingCatalog.refreshDataset(DelegatingCatalog.java:164)
com.dremio.exec.planner.sql.handlers.direct.RefreshTableHandler.toResult(RefreshTableHandler.java:58)
com.dremio.exec.planner.sql.handlers.commands.DirectCommand.plan(DirectCommand.java:61)
com.dremio.exec.work.foreman.AttemptManager.run(AttemptManager.java:303)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_292]
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_292]
java.lang.Thread.run(Thread.java:748) [na:1.8.0_292]
Caused by: java.io.IOException: Unable to find bucket named xxxx.
At the same time, the s3 bucket can be visited from VM successfully.
We just changed the s3 source from an old one to new one one month ago on GUI without dremio restarted. I’m not sure if this is related to the issue
I think we don’t set the value of “store.accurate.partition_stats” since the support keys list is empty on support GUI page
fd leak (let’s call the raw issue this) is the reason for dremio not working today, after restarted, the number of fd increased not very obviously (seems to be increased) in past 2-3 hours, from 1400 to 1500(may be 30-50 increased) even the above IOExceptions already run 2-3 times
I think the cron job is one of the way to trigger the IOException happened. The true operations done by dremio is to refresh the s3/postgres internally even with the periodical refresh by dremio itself (Interval metadata refresh)
The fd increased most is “pipe” and “eventpoll”, 2:1
48 [eventpoll]
96 pipe
I’m not sure if any relationship between the 2 issues, just happened at the same time.
The code of the stack shows maybe the IOException is expected and not sure if true IO applied,
Simply decompiled the class related to stack
private ContainerHolder getFileSystemForPath(Path path) throws IOException {
String name = getContainer(path);
ContainerHolder container = (ContainerHolder)this.containerMap.get(name);
if (container == null)
try {
synchronized (this.refreshLock) {
container = (ContainerHolder)this.containerMap.get(name);
if (container == null) {
container = getUnknownContainer(name);
if (container == null)
throw new IOException(String.format("Unable to find %s named %s.", new Object[] { this.containerName, name }));
ImmutableMap.Builder<String, ContainerHolder> newMap = ImmutableMap.builder();
newMap.putAll((Map)this.containerMap);
newMap.put(name, container);
this.containerMap = newMap.build();
}
}
} catch (Exception ex) {
throw new IOException(String.format("Unable to retrieve %s named %s.", new Object[] { this.containerName, name }), ex);
}
return container;
}
@qingdaojunzuo Problem is you are on such an old version of Dremio that the code base has entirely changed. Unrelated to this issue, any reason you are still on version 3.x?emphasized text
@balaji.ramaswamy . Yes. the code base changed a lot. Since we built too many PDS/VDS on this version, may have an expensive cost to move to new version and test currently, so…
We may plan to do upgrade in long term but not for now
We monitored the leak these days and found that, it seems that the pipe/eventpoll increased accompanies the following error when refreshing redshift data - socket timed out when doing metadata refresh
date, time, total, epoll, pipe
2022-12-12,05:24:46,1978,201,402
2022-12-12,05:25:18,1987,202,404
Dec 12 13:25:15 ip-10-2-41-53 dremio[7009]: 2022-12-12 13:25:15,593 [metadata-refresh-Redshift xxx] WARN c.d.e.store.jdbc.JdbcSchemaFetcher - Took longer than 5 seconds to query row count for xxx.xxx, Using default value of 1000000000.
java.sql.SQLTimeoutException: [Amazon](500053) The TCP Socket has timed out while waiting for response
It seems that the statistic on table failed since timed out. I found the following parameter may relate to,
public static final String JDBC_ROW_COUNT_QUERY_TIMEOUT = "store.jdbc.row_count_query_timeout_seconds";
public static final TypeValidators.LongValidator JDBC_ROW_COUNT_QUERY_TIMEOUT_VALIDATOR = (TypeValidators.LongValidator)new TypeValidators.PositiveLongValidator("store.jdbc.row_count_query_timeout_seconds", 2147483647L, 5L);
Is there a way to configure this value from configuration files or other ways without code change so that we can do some testing to see if this is related to issue?
I know the codebase changed a lot, just want to have a try if the value can be set without code change