Use SIEVE as underlying LRU? #446

Swatinem · 2024-07-21T13:30:33Z

I recently read about SIEVE (https://cachemon.github.io/SIEVE-website/, https://www.usenix.org/system/files/nsdi24-zhang-yazhuo.pdf) which is advertised as a simpler and better alternative for LRU.

It is also mentioned that it can serve as an underlying primitive for more advanced eviction algorithms.

So I wanted to give it a shot to actually implement it within moka but got sidetracked (#445 😅).

I’m still trying to fully piece together all the moving parts in my head, and I stumbled across the access-order and write-order dques, and now I’m wondering if any other piece of functionality has a hard requirement on having the elements sorted in access order, vs write order? SIEVE does not reorder elements at all, so it maintains write-order naturally.

Does it make sense to continue with this effort? Is there a possibility this could be worthwhile?

The text was updated successfully, but these errors were encountered:

tatsuya6502 · 2025-01-14T14:16:53Z

Hi. Sorry for the long delay. I was very busy with my work between July and September last year. I also had a health problem between August and December. Now I feel better and can work on moka again.

Thanks for the information about SIEVE. I will use the Caffeine simulator to measure the performance (hit rate) of SIEVE and moka's policies (TinyLFU, LRU and Window-TinyLFU). The Caffeine simulator implements many cache policies including SIEVE and S3-FIFO, and it can run moka with each policy.

As for the moka's internal (access-order and write-order deques), they are used for the following purposes:

The access-order deque:
- This is the LRU double-ended queue.
- Used for:
  - All policies: LRU, TinyLFU, Window-TinyLFU.
  - The TTI (Time-To-Idle) expiration policy, to find the least recently used entry in O(1) time.
    - However, this can also be done by the hierarchical timer wheels, which was added for the per-entry expiration policy.
The write-order deque:
- Used for:
  - The TTL (Time-To-Live) expiration policy, to find the oldest entry in O(1) time.
    - However, this can be done by the hierarchical timer wheels.
  - The invalidate_entries_if method.
  - The invalidate_all method.
    - If the write-order deque is disabled, the invalidate_all method will use the access-order deque instead.
- The write order deque is only enabled if:
  - The TTL expiration policy is configured.
  - And/or, the cache builder's support_invalidation_closures method is called (to enable invalidate_entries_if).

These dequeues are implemented as a doubly-linked list.

tatsuya6502 · 2025-01-15T00:29:34Z

I ran the Caffeine simulator using two different cache trace files ARC S3 and Corda.

How to run

https://github.com/moka-rs/caffeine-sim-drivers/tree/w-tinylfu

Results

ARC S3

Search engine (disk access)
Biased towards popularity

Policy	100,000	200,000	300,000	400,000	500,000	600,000	700,000	800,000
opt.Clairvoyant	25.42	39.79	50.92	59.96	67.09	72.97	77.57	81.27
product.Moka (Window-TinyLFU, 1%)	10.41	22.46	32.92	42.17	50.53	58.17	64.62	69.94
product.Moka (TinyLFU)	10.44	22.57	33.15	42.47	50.85	58.44	64.88	70.33
product.Moka (LRU)	2.33	4.63	7.59	12.04	22.77	34.63	46.04	56.60
linked.Sieve	10.12	17.93	23.94	29.36	37.51	46.27	55.40	63.76
two-queue.S3Fifo	11.36	20.43	28.54	31.86	37.53	45.35	54.27	62.69

Clairvoyant is not a cache policy. It is here to show the theoretical upper bound.
For S3-FIFO, the default configuration of Caffeine simulator was used.
Scan resistant policies (Window-TinyLFU, TinyLFU, ~~S3-FIFO~~) perform better than those without it (SIEVE, LRU).

Corda

~~Streaming data processing (?)~~ Blockchain
Biased towards recency

Policy	200,000	400,000	600,000	800,000	1,000,000	1,200,000	1,400,000	1,600,000
opt.Clairvoyant	33.33	33.33	33.33	33.33	33.33	33.33	33.33	33.33
product.Moka (Window-TinyLFU, 1%)	33.33	33.33	33.33	33.33	33.33	33.33	33.33	33.33
product.Moka (TinyLFU)	8.22	16.14	24.11	31.94	32.02	33.33	33.33	33.33
product.Moka (LRU)	33.33	33.33	33.33	33.33	33.33	33.33	33.33	33.33
linked.Sieve	10.68	21.36	32.04	33.33	33.33	33.33	33.33	33.33
two-queue.S3Fifo	33.33	33.33	33.33	33.33	33.33	33.33	33.33	33.33

Plain LRU and the ones with a LRU window (Window-TinyLFU, S3-FIFO) perform better than others.

tatsuya6502 · 2025-01-15T01:00:09Z

Use SIEVE as underlying LRU?

In terms of performance (hit rate), I think it will be better to keep the underlying LRU as is. This is because of the second result above (Corda). Replacing the underlying LRU with SIEVE will not improve the hit rate.

In terms of simplicity, SIEVE may help to eliminate the need for a batch read/write application on the underlying LRU, especially for reads. However, I would rather keep the current moka implementation, because supporting those rich features like expiration and eviction listeners will still need the batch application. So just replacing the underlying LRU will not be enough(?).

CC: @ben-manes

ben-manes · 2025-01-15T02:17:56Z

Corda is blockchain, so I think it is recent transactions being validated by multiple miners. This makes it very recency biased like stream processing where there is a small fanout per item. The workloads only need a small LRU to become optimal so I mainly use it for the stress test (corda => 5x loop => corda) to force the cache to reconfigure itself between LRU and MRU patterns.

The simulator's version of these algorithms were verified against the author's simulator. Note that after publication and promoting, they changed the algorithm in their code base but kept the same name. That option has not been reimplemented, so it is v0 in their new terminology. The difference is only on initial wam-up: "This version (S3FIFO.c) differs from the original S3-FIFO (S3FIFOv0.c) in that when the small queue is full, but the cache is not full, the original S3-FIFO will insert into the small queue".

Sieve is a very minor adjustment to the Clock algorithm from the 1960s (A Paging Experiment with the Multics System), which mimicked LRU for the OS page buffer. In this era the miss penalty was a "drum-core transfers" (8–20 milliseconds), which was program memory. Therefore recent memory pages were important as programmers would actively design for that because page faults were incredibly costly to the program's running time. This small amount of memory, using very predictable algorithms, were important to fit within the scheduled allocation of computer time (pre-timesharing). Sieve adjusts this slightly by decoupling the insertion hand from the victim hand, a well technique called a two-handed clock in Solaris (1990s), to more aggressively evict recent arrivals that are inactive. The intent from the original authors was the same as Sieve's, that recent arrivals may be "one-hit wonders" and to more aggressively evict them.

S3-FIFO is a very minor adjustment to the 2Q algorithm from the early 1990s (A Low Overhead High Performance Buffer Management Replacement Algorithm). This algorithm was used broadly in databases and introduced the concept of using the recent eviction history in the replacement decision. The only slight change was to use Clock queues instead of LRU queues, due to its well known concurrency characteristics. That wasn't a concern for the original authors and isn't noteworthy as developers will typically adjust an algorithm for their needs, so this would be a pragmatic change to expect to see in existing systems.

Clock-based concurrency is very nice for reads, at the penalty of O(n) eviction scans on write. That write is typically under a lock, but the S3-FIFO authors instead suggest using lock-free queues. This becomes challenging because (a) queue head/tail contention, (b) moving "weighted" entries between queues breaks the capacity accounting, (c) it often does not combine well with other features like expiration. Instead of using clock or random sampling policies to cope with concurrency, Caffeine/Moka use request buffering to batch and replay the events. This avoids contention, does not require that writers wait or block on eviction, offers similar performance, and allows for more efficient O(1) algorithms without race concerns. It is a more complicated technique for high performance, but more flexible than the 1960s approach (in fact, its a 1970s idea of a write-ahead log).

S3-FIFO does not use Sieve for its queues as the overall 2Q structure of an probation and protected queues serves the same purpose. W-TinyLFU used in Moka has a very similar structure, but replaces ghost entries with a more compact frequency sketch to allow for a larger history and to make better replacement decision. All of these require some heuristic, such as the hand speed or the queue sizes, to determine how aggressively they bias for or against recency. We later solved this by using hill climbing rather than a static configuration. This allows for higher hit rates across a larger set of workloads without manual tuning.

There's really no benefit for adopting Sieve / S3-FIFO because they are renames of historic algorithms and modern techniques work better. Hit rate simulations and concurrency benchmark have shown that, e.g. Otter attempted to implement S3-FIFO's concurrency with its author's help, but had to replace it with Caffeine / Moka's approach due to poor performance. Research papers are often biased or incorrect because the authors often have financial motives, so performing your own analysis like @tatsuya6502 did here is critical for making good engineering decisions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use SIEVE as underlying LRU? #446

Use SIEVE as underlying LRU? #446

Swatinem commented Jul 21, 2024

tatsuya6502 commented Jan 14, 2025 •

edited

Loading

tatsuya6502 commented Jan 15, 2025 •

edited

Loading

tatsuya6502 commented Jan 15, 2025

ben-manes commented Jan 15, 2025

Use SIEVE as underlying LRU? #446

Use SIEVE as underlying LRU? #446

Comments

Swatinem commented Jul 21, 2024

tatsuya6502 commented Jan 14, 2025 • edited Loading

tatsuya6502 commented Jan 15, 2025 • edited Loading

How to run

Results

ARC S3

Corda

tatsuya6502 commented Jan 15, 2025

ben-manes commented Jan 15, 2025

tatsuya6502 commented Jan 14, 2025 •

edited

Loading

tatsuya6502 commented Jan 15, 2025 •

edited

Loading