Cache manager eviction thread exit

On our server time to time we get this issue:

2023-05-15 08:27:07,244 [pool-15-thread-2] ERROR c.d.s.c.CacheEvictionThread - eviction-thread exiting on error, cm-state ERROR
java.lang.IllegalStateException: request to lock already locked key 1764785267 in in-flight-key-map.
	at com.google.common.base.Preconditions.checkState(Preconditions.java:502)
	at com.dremio.service.cachemanager.CacheMemoryLockController.lambda$getInFlightRefOnKey$0(CacheMemoryLockController.java:357)
	at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1877)
	at com.dremio.service.cachemanager.CacheMemoryLockController.getInFlightRefOnKey(CacheMemoryLockController.java:350)
	at com.dremio.service.cachemanager.CacheEvictionThread.doWork(CacheEvictionThread.java:553)
	at com.dremio.service.cachemanager.CacheEvictionThread.run(CacheEvictionThread.java:193)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
2023-05-15 08:27:07,394 [pool-15-thread-1] INFO  c.d.s.cachemanager.CacheManager - check-point thread exiting, cm-state ERROR

After some research it’s probably related to the way Dremio cache manager compute hash keys for evicted entries (or files?):

  1. In CacheMemoryLockController#getInFlightRefOnKey(), function CacheTranslationKey#hashCodeUnsignedInt() is called on the key.
  2. in short hashCodeUnsignedInt() generate a hash key on 31 bits (Math.abs() removes 1 bit)
  3. this cause a HUGE probability of collision on key spaces (even more when using only 31 bits)!

I suggest you to change hash key for a 128bits or at least a 64bits (or 63 if you absolutely want positive values). Changing the type used by the map from Integer to Long :slight_smile:

As you can see with the test below, we get a collision almost after:

  • ~6500 keys using 31 bits hash code
  • ~8500 keys using 32 bits hash code
  • not able to produce a collision (at least with my test below) for 63 and 64 bits hash code

Code used for my tests:

package com.dpd.dremio.test;

import java.util.HashSet;
import java.util.Random;
import java.util.Set;

public class HashCollision {

  public static void main(String[] args) {
    Set hashes = new HashSet();
    Random random = new Random();
    long sum = 0;
    int count = 100;
    int effCount = 0;
    int data = 100000;
    for (int j = 0; j < count; ++j) {
      for (int i = 0; i < data; ++i) {
//        long hashCodeUnsignedInt = random.nextLong();  // 64 bits
//        long hashCodeUnsignedInt = Math.abs(random.nextLong()); // 63 bits
//        int hashCodeUnsignedInt = random.nextInt();  // 32 bits
        int hashCodeUnsignedInt = Math.abs(random.nextInt());  // 31 bits, "Dremio code"
        if (!hashes.add(hashCodeUnsignedInt)) {
          System.out.printf("%03d - collision at index %d%n", j, i);
          sum += i;
          effCount++;
          break;
        }
      }
    }

    if (effCount > 0) {
      System.out.printf("collision average %.3f (%d data - %d tries)%n", (double) sum / effCount, data, effCount);
    } else {
      System.out.printf("no collision detected after %d tries for %d keys%n", effCount, data);
    }
  }

}

And his output (test with int and abs, so 31 bits):

000 - collision at index 54759
001 - collision at index 28397
002 - collision at index 5088
... 
096 - collision at index 309
097 - collision at index 1981
098 - collision at index 2357
099 - collision at index 4917
collision average 6722.700 (100000 data - 100 tries)

best regards
fred

1 Like

@fred This does look like an issue already seen but I also see that it is not seen in 23.x, what version are you on?

@balaji.ramaswamy Issue thrown here was for a Dremio OSS 22.1.1.

I checked the latest v24.0.0, and the code for the cache manager is still the same (CacheTranslationKey & CacheMemoryLockController).
So if there are too many keys to process this exception will throw again.

1 Like

Same issue and observation here and we are using Dremio software V22.1.8 on AWS EKS.

@taylorliqixin It’s fix in v24.1, they now use a long (absolute value, so 63 bits).
There’s still code using the 31 bits, but not in the dangerous path.