Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trie: reduce allocations in stacktrie #30743

Merged
merged 11 commits into from
Jan 23, 2025
64 changes: 64 additions & 0 deletions trie/bytepool.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
// Copyright 2024 The go-ethereum Authors
// This file is part of the go-ethereum library.
//
// The go-ethereum library is free software: you can redistribute it and/or modify
// it under the terms of the GNU Lesser General Public License as published by
// the Free Software Foundation, either version 3 of the License, or
// (at your option) any later version.
//
// The go-ethereum library is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU Lesser General Public License for more details.
//
// You should have received a copy of the GNU Lesser General Public License
// along with the go-ethereum library. If not, see <http://www.gnu.org/licenses/>.

package trie

// bytesPool is a pool for byte slices. It is safe for concurrent use.
type bytesPool struct {
c chan []byte
holiman marked this conversation as resolved.
Show resolved Hide resolved
w int
}

// newBytesPool creates a new bytesPool. The sliceCap sets the capacity of
// newly allocated slices, and the nitems determines how many items the pool
// will hold, at maximum.
func newBytesPool(sliceCap, nitems int) *bytesPool {
return &bytesPool{
c: make(chan []byte, nitems),
w: sliceCap,
}
}

// Get returns a slice. Safe for concurrent use.
func (bp *bytesPool) Get() []byte {
select {
case b := <-bp.c:
return b
default:
return make([]byte, 0, bp.w)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that there is an opportunity for reducing the amount of allocations even further:

  1. Pre-allocate a ~page-sized buffer of slices:
const nitems = syscall.Getpagesize()/sliceCap - 1 // reserve 1 for the slice header
var buf = make([]byte, nitems * sliceCap)
  1. Use subslices of it, you can still feed all these subslices into a channel if that's your favorite synchronization primitive.
func addBuf(bp *bytesPool, buf []byte) {
  for (i := 0; i < nitems; i++) {
     bp.c <- buf[i*bp.w:(i+1)*bp.w]
  }
}
  1. If the channel is empty, allocate another buffer:
func (bp *bytesPool) Get() []byte {
  // add a new buffer if all buffers are allocated
  if len(c) == 0 {
    var newbuf = make([]byte, bp.w * nitems)
    addBuf(bp, newbuf[:])
  }
  
  return <- bp.c
}

Advantages and issues of that approach:

  • it's possible to hold references to several almost-empty buffers, which if they are kept over a long time could cause to OOM. This is unlikely here, since a) the stacktrie is a transient structure b) it only "expands" one branch at a time c) the average depth at this time is ~13, which is enough not to need an extra buffer
  • less allocations. Note that the benchmarks are of only 100K, so it might not make a dent in the reported allocations... but if you benchmark snapshot allocation (over 1 billion keys now), it should.
  • It's easy to change the target size of the buffer allocation, depending on where it's used and how often it needs to allocate a new buf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stacktrie is already nearly alloc-free. I think you can hit it with a billion keys and not have any allocations after the first 100K elements or so. I might be wrong, but I don't think there's any room for meaningful further improvement (by which I mean: other than changing a constant alloc-count to a different constant).

}
}

// GetWithSize returns a slice with specified byte slice size.
func (bp *bytesPool) GetWithSize(s int) []byte {
b := bp.Get()
if cap(b) < s {
return make([]byte, s)
}
return b[:s]
}

// Put returns a slice to the pool. Safe for concurrent use. This method
// will ignore slices that are too small or too large (>3x the cap)
func (bp *bytesPool) Put(b []byte) {
if c := cap(b); c < bp.w || c > 3*bp.w {
return
}
select {
case bp.c <- b:
default:
}
}
12 changes: 12 additions & 0 deletions trie/encoding.go
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,18 @@ func keybytesToHex(str []byte) []byte {
return nibbles
}

// writeHexKey writes the hexkey into the given slice.
// OBS! This method omits the termination flag.
// OBS! The dst slice must be at least 2x as large as the key
gballet marked this conversation as resolved.
Show resolved Hide resolved
func writeHexKey(dst []byte, key []byte) []byte {
_ = dst[2*len(key)-1]
for i, b := range key {
dst[i*2] = b / 16
dst[i*2+1] = b % 16
}
return dst[:2*len(key)]
}

// hexToKeybytes turns hex nibbles into key bytes.
// This can only be used for keys of even length.
func hexToKeybytes(hex []byte) []byte {
Expand Down
8 changes: 8 additions & 0 deletions trie/hasher.go
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,14 @@ func (h *hasher) hashData(data []byte) hashNode {
return n
}

// hashDataTo hashes the provided data to the given destination buffer. The caller
// must ensure that the dst buffer is of appropriate size.
func (h *hasher) hashDataTo(dst, data []byte) {
h.sha.Reset()
h.sha.Write(data)
h.sha.Read(dst)
}

// proofHash is used to construct trie proofs, and returns the 'collapsed'
// node (for later RLP encoding) as well as the hashed node -- unless the
// node is smaller than 32 bytes, in which case it will be returned as is.
Expand Down
35 changes: 22 additions & 13 deletions trie/node.go
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,27 @@ type (
}
hashNode []byte
valueNode []byte

// fullnodeEncoder is a type used exclusively for encoding fullNode.
// Briefly instantiating a fullnodeEncoder and initializing with
// existing slices is less memory intense than using the fullNode type.
fullnodeEncoder struct {
Children [17][]byte
}

// extNodeEncoder is a type used exclusively for encoding extension node.
// Briefly instantiating a extNodeEncoder and initializing with existing
// slices is less memory intense than using the shortNode type.
extNodeEncoder struct {
Key []byte
Val []byte
}

// leafNodeEncoder is a type used exclusively for encoding leaf node.
leafNodeEncoder struct {
Key []byte
Val []byte
}
)

// nilValueNode is used when collapsing internal trie nodes for hashing, since
Expand Down Expand Up @@ -89,6 +110,7 @@ func (n *fullNode) fstring(ind string) string {
}
return resp + fmt.Sprintf("\n%s] ", ind)
}

func (n *shortNode) fstring(ind string) string {
return fmt.Sprintf("{%x: %v} ", n.Key, n.Val.fstring(ind+" "))
}
Expand All @@ -99,19 +121,6 @@ func (n valueNode) fstring(ind string) string {
return fmt.Sprintf("%x ", []byte(n))
}

// rawNode is a simple binary blob used to differentiate between collapsed trie
// nodes and already encoded RLP binary blobs (while at the same time store them
// in the same cache fields).
type rawNode []byte

func (n rawNode) cache() (hashNode, bool) { panic("this should never end up in a live trie") }
func (n rawNode) fstring(ind string) string { panic("this should never end up in a live trie") }

func (n rawNode) EncodeRLP(w io.Writer) error {
_, err := w.Write(n)
return err
}

// mustDecodeNode is a wrapper of decodeNode and panic if any error is encountered.
func mustDecodeNode(hash, buf []byte) node {
n, err := decodeNode(hash, buf)
Expand Down
45 changes: 41 additions & 4 deletions trie/node_enc.go
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,20 @@ func (n *fullNode) encode(w rlp.EncoderBuffer) {
w.ListEnd(offset)
}

func (n *fullnodeEncoder) encode(w rlp.EncoderBuffer) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you estimate the performance and allocation differences between using this encoder and the standard full node encoder?

The primary distinction seems to be that this encoder inlines the children encoding directly, rather than invoking the children encoder recursively. I’m curious about the performance implications of this approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, so if I change it back (adding back rawNode, and then switching back the case branchNode: so that it looks like it did earlier, then the difference is:

[user@work go-ethereum]$ benchstat pr.bench pr.bench..2
goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/trie
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
             │   pr.bench   │          pr.bench..2           │
             │    sec/op    │    sec/op     vs base          │
Insert100K-8   88.67m ± 33%   88.20m ± 11%  ~ (p=0.579 n=10)

             │   pr.bench   │                pr.bench..2                 │
             │     B/op     │      B/op        vs base                   │
Insert100K-8   3.451Ki ± 3%   2977.631Ki ± 0%  +86178.83% (p=0.000 n=10)

             │  pr.bench  │                pr.bench..2                 │
             │ allocs/op  │   allocs/op     vs base                    │
Insert100K-8   22.00 ± 5%   126706.50 ± 0%  +575838.64% (p=0.000 n=10)

When we build the children struct, all the values are copied, as opposed if we just use the encoder-type which just uses the same child without triggering a copy.

offset := w.List()
for _, c := range n.Children {
if c == nil {
w.Write(rlp.EmptyString)
} else if len(c) < 32 {
w.Write(c) // rawNode
} else {
w.WriteBytes(c) // hashNode
}
}
w.ListEnd(offset)
}

func (n *shortNode) encode(w rlp.EncoderBuffer) {
offset := w.List()
w.WriteBytes(n.Key)
Expand All @@ -51,14 +65,37 @@ func (n *shortNode) encode(w rlp.EncoderBuffer) {
w.ListEnd(offset)
}

func (n *extNodeEncoder) encode(w rlp.EncoderBuffer) {
offset := w.List()
w.WriteBytes(n.Key)

if n.Val == nil {
w.Write(rlp.EmptyString)
} else if len(n.Val) < 32 {
w.Write(n.Val) // rawNode
} else {
w.WriteBytes(n.Val) // hashNode
}
w.ListEnd(offset)
}

func (n *leafNodeEncoder) encode(w rlp.EncoderBuffer) {
offset := w.List()

// Encode the key to Compact format. The assumption is held
// that it's safe to modify the key in place.
n.Key = append(n.Key, 16) // termination flag
n.Key = hexToCompactInPlace(n.Key)

w.WriteBytes(n.Key) // Compact format key
w.WriteBytes(n.Val) // Value node, must be non-nil
w.ListEnd(offset)
}

func (n hashNode) encode(w rlp.EncoderBuffer) {
w.WriteBytes(n)
}

func (n valueNode) encode(w rlp.EncoderBuffer) {
w.WriteBytes(n)
}

func (n rawNode) encode(w rlp.EncoderBuffer) {
w.Write(n)
}
75 changes: 51 additions & 24 deletions trie/stacktrie.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ import (

var (
stPool = sync.Pool{New: func() any { return new(stNode) }}
bPool = newBytesPool(32, 100)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any particular reason to not use sync.Pool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! You'd be surprised, but the whole problem I had with extra alloc :

Screenshot 2024-11-11 at 22-18-40 trie test alloc_space

I solved that by not using a sync.Pool. I suspect it's due to the interface-conversion, but I don't know any deeper details about the why of it.

Copy link
Member

@rjl493456442 rjl493456442 Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have a try to use the pool for slice pointer,

e.g.

// slicePool is a shared pool of hash slice, for reducing the GC pressure.
var slicePool = sync.Pool{
	New: func() interface{} {
		slice := make([]byte, 0, 32) // Pre-allocate a slice with a reasonable capacity.
		return &slice
	},
}

// getSlice obtains the hash slice from the shared pool.
func getSlice(n int) []byte {
	slice := *slicePool.Get().(*[]byte)
	if cap(slice) < n {
		slice = make([]byte, 0, n)
	}
	slice = slice[:n]
	return slice
}

// returnSlice returns the hash slice back to the shared pool for following usage.
func returnSlice(slice []byte) {
	slicePool.Put(&slice)
}

Copy link
Member

@rjl493456442 rjl493456442 Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, it doesn't work

(base) ➜  go-ethereum git:(stacktrie_allocs_1) ✗ benchstat bench1.txt bench2.txt
goos: darwin
goarch: arm64
pkg: github.com/ethereum/go-ethereum/trie
             │  bench1.txt  │           bench2.txt            │
             │    sec/op    │    sec/op     vs base           │
Insert100K-8   28.41m ± ∞ ¹   29.94m ± ∞ ¹  ~ (p=1.000 n=1) ²
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

             │  bench1.txt   │             bench2.txt              │
             │     B/op      │       B/op        vs base           │
Insert100K-8   3.276Ki ± ∞ ¹   2972.266Ki ± ∞ ¹  ~ (p=1.000 n=1) ²
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

             │ bench1.txt  │             bench2.txt             │
             │  allocs/op  │    allocs/op     vs base           │
Insert100K-8   21.00 ± ∞ ¹   126692.00 ± ∞ ¹  ~ (p=1.000 n=1) ²

bench1: channel
bench2: slice pointer pool

Copy link
Member

@rjl493456442 rjl493456442 Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue for using native sync pool is:

  • we indeed reuse the underlying byte slice
  • the slice descriptor (metadata, 24bytes) must be passed around when pool.Get is called

The key difference is: in channel approach, the slice is passed as a reference, without descriptor construction involved; in the sync pool approach, the slice is passed as a value, descriptor must be re-created for every Get

_ = types.TrieHasher((*StackTrie)(nil))
)

Expand All @@ -47,6 +48,8 @@ type StackTrie struct {
h *hasher
last []byte
onTrieNode OnTrieNode
kBuf []byte // buf space used for hex-key during insertions
pBuf []byte // buf space used for path during insertions
}

// NewStackTrie allocates and initializes an empty trie. The committed nodes
Expand All @@ -56,6 +59,17 @@ func NewStackTrie(onTrieNode OnTrieNode) *StackTrie {
root: stPool.Get().(*stNode),
h: newHasher(false),
onTrieNode: onTrieNode,
kBuf: make([]byte, 64),
pBuf: make([]byte, 64),
}
}

func (t *StackTrie) grow(key []byte) {
if cap(t.kBuf) < 2*len(key) {
t.kBuf = make([]byte, 2*len(key))
}
if cap(t.pBuf) < len(key) {
t.pBuf = make([]byte, 2*len(key))
}
}

Expand All @@ -64,7 +78,8 @@ func (t *StackTrie) Update(key, value []byte) error {
if len(value) == 0 {
return errors.New("trying to insert empty (deletion)")
}
k := t.TrieKey(key)
t.grow(key)
k := writeHexKey(t.kBuf, key)
if bytes.Compare(t.last, k) >= 0 {
return errors.New("non-ascending key order")
}
Expand All @@ -73,7 +88,7 @@ func (t *StackTrie) Update(key, value []byte) error {
} else {
t.last = append(t.last[:0], k...) // reuse key slice
}
t.insert(t.root, k, value, nil)
t.insert(t.root, k, value, t.pBuf[:0])
return nil
}

Expand Down Expand Up @@ -129,6 +144,12 @@ const (
)

func (n *stNode) reset() *stNode {
if n.typ == hashedNode {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between HashedNode and Embedded tiny node?

For embedded node, we also allocate the buffer from the bPool and the buffer is owned by the node itself right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the thing is, that there's only one place in the stacktrie where we convert a node into a hashedNode.

	st.typ = hashedNode
	st.key = st.key[:0]

	st.val = nil // Release reference to potentially externally held slice.

This is the only place. So we know that once something has been turned into a hashedNode, the val is never externally held and thus it can be returned to the pool. The big problem we had earlier, is that we need to overwrite the val with a hash, but we are not allowed to mutate val. So this was a big cause of runaway allocs.

But now we reclaim those values-which-are-hashes since we know that they're "ours".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For embedded node, we also allocate the buffer from the bPool and the buffer is owned by the node itself right?

To answer your question: yes. So from the perspective of slice-reuse, there is zero difference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, in my follow-up PR to reduce allocs in derivesha:#30747 , I remove this trick again.

In that PR, I always copy the value as it enters the stacktrie. So we always own the val slice, and are free to reuse via pooling.

Doing so is less hacky, we get rid of this special case "val is externally held unless it's an hashedNode because then we own it".


That makes it possible for the derivesha-method to reuse the input-buffer

// On hashnodes, we 'own' the val: it is guaranteed to be not held
// by external caller. Hence, when we arrive here, we can put it back
// into the pool
bPool.Put(n.val)
}
n.key = n.key[:0]
n.val = nil
for i := range n.children {
Expand All @@ -150,8 +171,12 @@ func (n *stNode) getDiffIndex(key []byte) int {
return len(n.key)
}

// Helper function to that inserts a (key, value) pair into
// the trie.
// Helper function to that inserts a (key, value) pair into the trie.
//
// - The key is not retained by this method, but always copied if needed.
// - The value is retained by this method, as long as the leaf that it represents
// remains unhashed. However: it is never modified.
// - The path is not retained by this method.
func (t *StackTrie) insert(st *stNode, key, value []byte, path []byte) {
switch st.typ {
case branchNode: /* Branch */
Expand Down Expand Up @@ -283,7 +308,7 @@ func (t *StackTrie) insert(st *stNode, key, value []byte, path []byte) {

case emptyNode: /* Empty */
st.typ = leafNode
st.key = key
st.key = append(st.key, key...) // deep-copy the key as it's volatile
st.val = value

case hashedNode:
Expand Down Expand Up @@ -318,35 +343,32 @@ func (t *StackTrie) hash(st *stNode, path []byte) {
return

case branchNode:
var nodes fullNode
var nodes fullnodeEncoder
for i, child := range st.children {
if child == nil {
nodes.Children[i] = nilValueNode
continue
}
t.hash(child, append(path, byte(i)))

if len(child.val) < 32 {
nodes.Children[i] = rawNode(child.val)
} else {
nodes.Children[i] = hashNode(child.val)
nodes.Children[i] = child.val
}
nodes.encode(t.h.encbuf)
blob = t.h.encodedBytes()
for i, child := range st.children {
if child == nil {
continue
}
st.children[i] = nil
stPool.Put(child.reset()) // Release child back to pool.
}
nodes.encode(t.h.encbuf)
blob = t.h.encodedBytes()

case extNode:
// recursively hash and commit child as the first step
t.hash(st.children[0], append(path, st.key...))

// encode the extension node
n := shortNode{Key: hexToCompactInPlace(st.key)}
if len(st.children[0].val) < 32 {
n.Val = rawNode(st.children[0].val)
} else {
n.Val = hashNode(st.children[0].val)
n := extNodeEncoder{
Key: hexToCompactInPlace(st.key),
Val: st.children[0].val,
}
n.encode(t.h.encbuf)
blob = t.h.encodedBytes()
Expand All @@ -355,9 +377,10 @@ func (t *StackTrie) hash(st *stNode, path []byte) {
st.children[0] = nil

case leafNode:
st.key = append(st.key, byte(16))
n := shortNode{Key: hexToCompactInPlace(st.key), Val: valueNode(st.val)}

n := leafNodeEncoder{
Key: st.key,
Val: st.val,
}
n.encode(t.h.encbuf)
blob = t.h.encodedBytes()

Expand All @@ -368,15 +391,19 @@ func (t *StackTrie) hash(st *stNode, path []byte) {
st.typ = hashedNode
st.key = st.key[:0]

st.val = nil // Release reference to potentially externally held slice.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but if it's created e.g. on L400, it should be returned to the pool. I'm not worried that it will leak, since there is a garbage collector, but if feels like we could avoid extra allocations by maybe ensuring that the passed values are always allocated with the bytes pool ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note line 392. That is the only place we ever set the type to be a hashNode. Only on hashNode types, do we "own" the val. And if the type was already hashedNode, it will exit early (at line 334).

For hashNodes, the return-to-pool happens during reset.

Copy link
Member

@gballet gballet Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean leafnode on line 382? nvm I see what you mean.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point I'm making only stands if you're also using the pool in between calls to Update. And it seems that this is already what you're saying here: #30743 (comment)


// Skip committing the non-root node if the size is smaller than 32 bytes
// as tiny nodes are always embedded in their parent except root node.
if len(blob) < 32 && len(path) > 0 {
st.val = common.CopyBytes(blob)
st.val = bPool.GetWithSize(len(blob))
copy(st.val, blob)
return
}
// Write the hash to the 'val'. We allocate a new val here to not mutate
// input values.
st.val = t.h.hashData(blob)
st.val = bPool.GetWithSize(32)
t.h.hashDataTo(st.val, blob)

// Invoke the callback it's provided. Notably, the path and blob slices are
// volatile, please deep-copy the slices in callback if the contents need
Expand Down
Loading