Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use SIEVE as underlying LRU? #446

Open
Swatinem opened this issue Jul 21, 2024 · 4 comments
Open

Use SIEVE as underlying LRU? #446

Swatinem opened this issue Jul 21, 2024 · 4 comments

Comments

@Swatinem
Copy link
Contributor

I recently read about SIEVE (https://cachemon.github.io/SIEVE-website/, https://www.usenix.org/system/files/nsdi24-zhang-yazhuo.pdf) which is advertised as a simpler and better alternative for LRU.

It is also mentioned that it can serve as an underlying primitive for more advanced eviction algorithms.


So I wanted to give it a shot to actually implement it within moka but got sidetracked (#445 😅).

I’m still trying to fully piece together all the moving parts in my head, and I stumbled across the access-order and write-order dques, and now I’m wondering if any other piece of functionality has a hard requirement on having the elements sorted in access order, vs write order? SIEVE does not reorder elements at all, so it maintains write-order naturally.

Does it make sense to continue with this effort? Is there a possibility this could be worthwhile?

@tatsuya6502
Copy link
Member

tatsuya6502 commented Jan 14, 2025

Hi. Sorry for the long delay. I was very busy with my work between July and September last year. I also had a health problem between August and December. Now I feel better and can work on moka again.

Thanks for the information about SIEVE. I will use the Caffeine simulator to measure the performance (hit rate) of SIEVE and moka's policies (TinyLFU, LRU and Window-TinyLFU). The Caffeine simulator implements many cache policies including SIEVE and S3-FIFO, and it can run moka with each policy.

As for the moka's internal (access-order and write-order deques), they are used for the following purposes:

  • The access-order deque:
    • This is the LRU double-ended queue.
    • Used for:
      • All policies: LRU, TinyLFU, Window-TinyLFU.
      • The TTI (Time-To-Idle) expiration policy, to find the least recently used entry in O(1) time.
        • However, this can also be done by the hierarchical timer wheels, which was added for the per-entry expiration policy.
  • The write-order deque:
    • Used for:
      • The TTL (Time-To-Live) expiration policy, to find the oldest entry in O(1) time.
        • However, this can be done by the hierarchical timer wheels.
      • The invalidate_entries_if method.
      • The invalidate_all method.
        • If the write-order deque is disabled, the invalidate_all method will use the access-order deque instead.
    • The write order deque is only enabled if:
      • The TTL expiration policy is configured.
      • And/or, the cache builder's support_invalidation_closures method is called (to enable invalidate_entries_if).

These dequeues are implemented as a doubly-linked list.

@tatsuya6502
Copy link
Member

tatsuya6502 commented Jan 15, 2025

I ran the Caffeine simulator using two different cache trace files ARC S3 and Corda.

How to run

https://github.com/moka-rs/caffeine-sim-drivers/tree/w-tinylfu

Results

ARC S3

  • Search engine (disk access)
  • Biased towards popularity

hit_rate-arc-s3-2025-01-15

Policy 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000
opt.Clairvoyant 25.42 39.79 50.92 59.96 67.09 72.97 77.57 81.27
product.Moka (Window-TinyLFU, 1%) 10.41 22.46 32.92 42.17 50.53 58.17 64.62 69.94
product.Moka (TinyLFU) 10.44 22.57 33.15 42.47 50.85 58.44 64.88 70.33
product.Moka (LRU) 2.33 4.63 7.59 12.04 22.77 34.63 46.04 56.60
linked.Sieve 10.12 17.93 23.94 29.36 37.51 46.27 55.40 63.76
two-queue.S3Fifo 11.36 20.43 28.54 31.86 37.53 45.35 54.27 62.69
  • Clairvoyant is not a cache policy. It is here to show the theoretical upper bound.
  • For S3-FIFO, the default configuration of Caffeine simulator was used.
  • Scan resistant policies (Window-TinyLFU, TinyLFU, S3-FIFO) perform better than those without it (SIEVE, LRU).

Corda

  • Streaming data processing (?) Blockchain
  • Biased towards recency

hit_rate-corda-2025-01-15

Policy 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000
opt.Clairvoyant 33.33 33.33 33.33 33.33 33.33 33.33 33.33 33.33
product.Moka (Window-TinyLFU, 1%) 33.33 33.33 33.33 33.33 33.33 33.33 33.33 33.33
product.Moka (TinyLFU) 8.22 16.14 24.11 31.94 32.02 33.33 33.33 33.33
product.Moka (LRU) 33.33 33.33 33.33 33.33 33.33 33.33 33.33 33.33
linked.Sieve 10.68 21.36 32.04 33.33 33.33 33.33 33.33 33.33
two-queue.S3Fifo 33.33 33.33 33.33 33.33 33.33 33.33 33.33 33.33
  • Plain LRU and the ones with a LRU window (Window-TinyLFU, S3-FIFO) perform better than others.

@tatsuya6502
Copy link
Member

Use SIEVE as underlying LRU?

In terms of performance (hit rate), I think it will be better to keep the underlying LRU as is. This is because of the second result above (Corda). Replacing the underlying LRU with SIEVE will not improve the hit rate.

In terms of simplicity, SIEVE may help to eliminate the need for a batch read/write application on the underlying LRU, especially for reads. However, I would rather keep the current moka implementation, because supporting those rich features like expiration and eviction listeners will still need the batch application. So just replacing the underlying LRU will not be enough(?).

CC: @ben-manes

@ben-manes
Copy link

Corda is blockchain, so I think it is recent transactions being validated by multiple miners. This makes it very recency biased like stream processing where there is a small fanout per item. The workloads only need a small LRU to become optimal so I mainly use it for the stress test (corda => 5x loop => corda) to force the cache to reconfigure itself between LRU and MRU patterns.

The simulator's version of these algorithms were verified against the author's simulator. Note that after publication and promoting, they changed the algorithm in their code base but kept the same name. That option has not been reimplemented, so it is v0 in their new terminology. The difference is only on initial wam-up: "This version (S3FIFO.c) differs from the original S3-FIFO (S3FIFOv0.c) in that when the small queue is full, but the cache is not full, the original S3-FIFO will insert into the small queue".

Sieve is a very minor adjustment to the Clock algorithm from the 1960s (A Paging Experiment with the Multics System), which mimicked LRU for the OS page buffer. In this era the miss penalty was a "drum-core transfers" (8–20 milliseconds), which was program memory. Therefore recent memory pages were important as programmers would actively design for that because page faults were incredibly costly to the program's running time. This small amount of memory, using very predictable algorithms, were important to fit within the scheduled allocation of computer time (pre-timesharing). Sieve adjusts this slightly by decoupling the insertion hand from the victim hand, a well technique called a two-handed clock in Solaris (1990s), to more aggressively evict recent arrivals that are inactive. The intent from the original authors was the same as Sieve's, that recent arrivals may be "one-hit wonders" and to more aggressively evict them.

S3-FIFO is a very minor adjustment to the 2Q algorithm from the early 1990s (A Low Overhead High Performance Buffer Management Replacement Algorithm). This algorithm was used broadly in databases and introduced the concept of using the recent eviction history in the replacement decision. The only slight change was to use Clock queues instead of LRU queues, due to its well known concurrency characteristics. That wasn't a concern for the original authors and isn't noteworthy as developers will typically adjust an algorithm for their needs, so this would be a pragmatic change to expect to see in existing systems.

Clock-based concurrency is very nice for reads, at the penalty of O(n) eviction scans on write. That write is typically under a lock, but the S3-FIFO authors instead suggest using lock-free queues. This becomes challenging because (a) queue head/tail contention, (b) moving "weighted" entries between queues breaks the capacity accounting, (c) it often does not combine well with other features like expiration. Instead of using clock or random sampling policies to cope with concurrency, Caffeine/Moka use request buffering to batch and replay the events. This avoids contention, does not require that writers wait or block on eviction, offers similar performance, and allows for more efficient O(1) algorithms without race concerns. It is a more complicated technique for high performance, but more flexible than the 1960s approach (in fact, its a 1970s idea of a write-ahead log).

S3-FIFO does not use Sieve for its queues as the overall 2Q structure of an probation and protected queues serves the same purpose. W-TinyLFU used in Moka has a very similar structure, but replaces ghost entries with a more compact frequency sketch to allow for a larger history and to make better replacement decision. All of these require some heuristic, such as the hand speed or the queue sizes, to determine how aggressively they bias for or against recency. We later solved this by using hill climbing rather than a static configuration. This allows for higher hit rates across a larger set of workloads without manual tuning.

There's really no benefit for adopting Sieve / S3-FIFO because they are renames of historic algorithms and modern techniques work better. Hit rate simulations and concurrency benchmark have shown that, e.g. Otter attempted to implement S3-FIFO's concurrency with its author's help, but had to replace it with Caffeine / Moka's approach due to poor performance. Research papers are often biased or incorrect because the authors often have financial motives, so performing your own analysis like @tatsuya6502 did here is critical for making good engineering decisions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants