On our server time to time we get this issue:
2023-05-15 08:27:07,244 [pool-15-thread-2] ERROR c.d.s.c.CacheEvictionThread - eviction-thread exiting on error, cm-state ERROR
java.lang.IllegalStateException: request to lock already locked key 1764785267 in in-flight-key-map.
at com.google.common.base.Preconditions.checkState(Preconditions.java:502)
at com.dremio.service.cachemanager.CacheMemoryLockController.lambda$getInFlightRefOnKey$0(CacheMemoryLockController.java:357)
at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1877)
at com.dremio.service.cachemanager.CacheMemoryLockController.getInFlightRefOnKey(CacheMemoryLockController.java:350)
at com.dremio.service.cachemanager.CacheEvictionThread.doWork(CacheEvictionThread.java:553)
at com.dremio.service.cachemanager.CacheEvictionThread.run(CacheEvictionThread.java:193)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
2023-05-15 08:27:07,394 [pool-15-thread-1] INFO c.d.s.cachemanager.CacheManager - check-point thread exiting, cm-state ERROR
After some research it’s probably related to the way Dremio cache manager compute hash keys for evicted entries (or files?):
- In
CacheMemoryLockController#getInFlightRefOnKey()
, functionCacheTranslationKey#hashCodeUnsignedInt()
is called on the key. - in short
hashCodeUnsignedInt()
generate a hash key on 31 bits (Math.abs() removes 1 bit) - this cause a HUGE probability of collision on key spaces (even more when using only 31 bits)!
I suggest you to change hash key for a 128bits or at least a 64bits (or 63 if you absolutely want positive values). Changing the type used by the map from Integer to Long
As you can see with the test below, we get a collision almost after:
- ~6500 keys using 31 bits hash code
- ~8500 keys using 32 bits hash code
- not able to produce a collision (at least with my test below) for 63 and 64 bits hash code
Code used for my tests:
package com.dpd.dremio.test;
import java.util.HashSet;
import java.util.Random;
import java.util.Set;
public class HashCollision {
public static void main(String[] args) {
Set hashes = new HashSet();
Random random = new Random();
long sum = 0;
int count = 100;
int effCount = 0;
int data = 100000;
for (int j = 0; j < count; ++j) {
for (int i = 0; i < data; ++i) {
// long hashCodeUnsignedInt = random.nextLong(); // 64 bits
// long hashCodeUnsignedInt = Math.abs(random.nextLong()); // 63 bits
// int hashCodeUnsignedInt = random.nextInt(); // 32 bits
int hashCodeUnsignedInt = Math.abs(random.nextInt()); // 31 bits, "Dremio code"
if (!hashes.add(hashCodeUnsignedInt)) {
System.out.printf("%03d - collision at index %d%n", j, i);
sum += i;
effCount++;
break;
}
}
}
if (effCount > 0) {
System.out.printf("collision average %.3f (%d data - %d tries)%n", (double) sum / effCount, data, effCount);
} else {
System.out.printf("no collision detected after %d tries for %d keys%n", effCount, data);
}
}
}
And his output (test with int and abs, so 31 bits):
000 - collision at index 54759
001 - collision at index 28397
002 - collision at index 5088
...
096 - collision at index 309
097 - collision at index 1981
098 - collision at index 2357
099 - collision at index 4917
collision average 6722.700 (100000 data - 100 tries)
best regards
fred