-
Notifications
You must be signed in to change notification settings - Fork 20.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trie: reduce allocations in stacktrie #30743
Conversation
Update
|
New progress
|
f332426
to
4a2a72b
Compare
Ah wait 24B that is the size of a slice: a pointer and two ints. It is the slice passed by value, escaped to heap (correctly so). But how to avoid it?? |
Solved!
|
aacb8e6
to
af2feef
Compare
258987d
to
b34643f
Compare
Did a sync on |
@holiman I think we should compare the relevant metrics (memory allocation, etc) during the snap sync and to see the overall impact brought by this change. |
@@ -40,6 +40,20 @@ func (n *fullNode) encode(w rlp.EncoderBuffer) { | |||
w.ListEnd(offset) | |||
} | |||
|
|||
func (n *fullnodeEncoder) encode(w rlp.EncoderBuffer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you estimate the performance and allocation differences between using this encoder and the standard full node encoder?
The primary distinction seems to be that this encoder inlines the children encoding directly, rather than invoking the children encoder recursively. I’m curious about the performance implications of this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, so if I change it back (adding back rawNode, and then switching back the case branchNode:
so that it looks like it did earlier, then the difference is:
[user@work go-ethereum]$ benchstat pr.bench pr.bench..2
goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/trie
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
│ pr.bench │ pr.bench..2 │
│ sec/op │ sec/op vs base │
Insert100K-8 88.67m ± 33% 88.20m ± 11% ~ (p=0.579 n=10)
│ pr.bench │ pr.bench..2 │
│ B/op │ B/op vs base │
Insert100K-8 3.451Ki ± 3% 2977.631Ki ± 0% +86178.83% (p=0.000 n=10)
│ pr.bench │ pr.bench..2 │
│ allocs/op │ allocs/op vs base │
Insert100K-8 22.00 ± 5% 126706.50 ± 0% +575838.64% (p=0.000 n=10)
When we build the children struct, all the values are copied, as opposed if we just use the encoder-type which just uses the same child without triggering a copy.
trie/stacktrie.go
Outdated
n := shortNode{Key: hexToCompactInPlace(st.key), Val: valueNode(st.val)} | ||
|
||
n.encode(t.h.encbuf) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also use shortNodeEncoder here?
I would recommend not to do the inline encoding directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but here we know we have a valueNode
. And the valueNode just does:
func (n valueNode) encode(w rlp.EncoderBuffer) {
w.WriteBytes(n)
}
If we change that to a method which checks the size and theoretically does different things, then the semantics become slightly changed.
trie/node.go
Outdated
//shortNodeEncoder is a type used exclusively for encoding. Briefly instantiating | ||
// a shortNodeEncoder and initializing with existing slices is less memory | ||
// intense than using the shortNode type. | ||
shortNodeEncoder struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq, does shortNodeEncoder
need to implement the node interface?
It's just an encoder which coverts the node object to byte format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just have a try, it's technically possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -27,6 +27,7 @@ import ( | |||
|
|||
var ( | |||
stPool = sync.Pool{New: func() any { return new(stNode) }} | |||
bPool = newBytesPool(32, 100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any particular reason to not use sync.Pool
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! You'd be surprised, but the whole problem I had with extra alloc :
I solved that by not using a sync.Pool
. I suspect it's due to the interface-conversion, but I don't know any deeper details about the why of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have a try to use the pool for slice pointer,
e.g.
// slicePool is a shared pool of hash slice, for reducing the GC pressure.
var slicePool = sync.Pool{
New: func() interface{} {
slice := make([]byte, 0, 32) // Pre-allocate a slice with a reasonable capacity.
return &slice
},
}
// getSlice obtains the hash slice from the shared pool.
func getSlice(n int) []byte {
slice := *slicePool.Get().(*[]byte)
if cap(slice) < n {
slice = make([]byte, 0, n)
}
slice = slice[:n]
return slice
}
// returnSlice returns the hash slice back to the shared pool for following usage.
func returnSlice(slice []byte) {
slicePool.Put(&slice)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, it doesn't work
(base) ➜ go-ethereum git:(stacktrie_allocs_1) ✗ benchstat bench1.txt bench2.txt
goos: darwin
goarch: arm64
pkg: github.com/ethereum/go-ethereum/trie
│ bench1.txt │ bench2.txt │
│ sec/op │ sec/op vs base │
Insert100K-8 28.41m ± ∞ ¹ 29.94m ± ∞ ¹ ~ (p=1.000 n=1) ²
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05
│ bench1.txt │ bench2.txt │
│ B/op │ B/op vs base │
Insert100K-8 3.276Ki ± ∞ ¹ 2972.266Ki ± ∞ ¹ ~ (p=1.000 n=1) ²
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05
│ bench1.txt │ bench2.txt │
│ allocs/op │ allocs/op vs base │
Insert100K-8 21.00 ± ∞ ¹ 126692.00 ± ∞ ¹ ~ (p=1.000 n=1) ²
bench1: channel
bench2: slice pointer pool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue for using native sync pool is:
- we indeed reuse the underlying byte slice
- the slice descriptor (metadata, 24bytes) must be passed around when pool.Get is called
The key difference is: in channel approach, the slice is passed as a reference, without descriptor construction involved; in the sync pool approach, the slice is passed as a value, descriptor must be re-created for every Get
@@ -129,6 +143,12 @@ const ( | |||
) | |||
|
|||
func (n *stNode) reset() *stNode { | |||
if n.typ == hashedNode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between HashedNode and Embedded tiny node?
For embedded node, we also allocate the buffer from the bPool and the buffer is owned by the node itself right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the thing is, that there's only one place in the stacktrie where we convert a node into a hashedNode
.
st.typ = hashedNode
st.key = st.key[:0]
st.val = nil // Release reference to potentially externally held slice.
This is the only place. So we know that once something has been turned into a hashedNode, the val
is never externally held and thus it can be returned to the pool. The big problem we had earlier, is that we need to overwrite the val
with a hash, but we are not allowed to mutate val
. So this was a big cause of runaway allocs.
But now we reclaim those values-which-are-hashes since we know that they're "ours".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For embedded node, we also allocate the buffer from the bPool and the buffer is owned by the node itself right?
To answer your question: yes. So from the perspective of slice-reuse, there is zero difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, in my follow-up PR to reduce allocs in derivesha:#30747 , I remove this trick again.
In that PR, I always copy the value as it enters the stacktrie. So we always own the val
slice, and are free to reuse via pooling.
Doing so is less hacky, we get rid of this special case "val is externally held unless it's an hashedNode because then we own it".
That makes it possible for the derivesha-method to reuse the input-buffer
lint is read after the latest commit |
goos: linux goarch: amd64 pkg: github.com/ethereum/go-ethereum/trie cpu: 12th Gen Intel(R) Core(TM) i7-1270P │ stacktrie.3 │ stacktrie.4 │ │ sec/op │ sec/op vs base │ Insert100K-8 69.50m ± 12% 74.59m ± 14% ~ (p=0.128 n=7) │ stacktrie.3 │ stacktrie.4 │ │ B/op │ B/op vs base │ Insert100K-8 4.640Mi ± 0% 3.112Mi ± 0% -32.93% (p=0.001 n=7) │ stacktrie.3 │ stacktrie.4 │ │ allocs/op │ allocs/op vs base │ Insert100K-8 226.7k ± 0% 126.7k ± 0% -44.11% (p=0.001 n=7)
fc0705f
to
bfc3aae
Compare
Running a snapsync benchmarkers now, using
|
So, here are some charts. The first third of the charts is the snapsync, the latter two thirds are snap-gen + block by block sync. The yellow (this PR) finished slightly faster (13 minutes or so). This is despite that the The Memory charts doesn't show much, except that maybe the lead that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be curious to find out how that performs over a full snapshot regeneration? 100K is a big number for a benchmark, but the scale of what the stacktrie is being used for is
case b := <-bp.c: | ||
return b | ||
default: | ||
return make([]byte, 0, bp.w) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that there is an opportunity for reducing the amount of allocations even further:
- Pre-allocate a ~page-sized buffer of slices:
const nitems = syscall.Getpagesize()/sliceCap - 1 // reserve 1 for the slice header
var buf = make([]byte, nitems * sliceCap)
- Use subslices of it, you can still feed all these subslices into a channel if that's your favorite synchronization primitive.
func addBuf(bp *bytesPool, buf []byte) {
for (i := 0; i < nitems; i++) {
bp.c <- buf[i*bp.w:(i+1)*bp.w]
}
}
- If the channel is empty, allocate another buffer:
func (bp *bytesPool) Get() []byte {
// add a new buffer if all buffers are allocated
if len(c) == 0 {
var newbuf = make([]byte, bp.w * nitems)
addBuf(bp, newbuf[:])
}
return <- bp.c
}
Advantages and issues of that approach:
- it's possible to hold references to several almost-empty buffers, which if they are kept over a long time could cause to OOM. This is unlikely here, since a) the stacktrie is a transient structure b) it only "expands" one branch at a time c) the average depth at this time is ~13, which is enough not to need an extra buffer
- less allocations. Note that the benchmarks are of only 100K, so it might not make a dent in the reported allocations... but if you benchmark snapshot allocation (over 1 billion keys now), it should.
- It's easy to change the target size of the buffer allocation, depending on where it's used and how often it needs to allocate a new buf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stacktrie is already nearly alloc-free. I think you can hit it with a billion keys and not have any allocations after the first 100K elements or so. I might be wrong, but I don't think there's any room for meaningful further improvement (by which I mean: other than changing a constant alloc-count to a different constant).
trie/stacktrie.go
Outdated
kBuf: make([]byte, 0, 64), | ||
pBuf: make([]byte, 0, 32), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it not make sense to reuse the bytes allocator here as well ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are two small slices that are used for the duration of the stacktrie runtime. Also, during 'Update', they are continuously in use. Returning them to the pool after use is pointless: at the next call to Update
, the first thing we will do is borrow them back again, and hold them for the duration of the op.
So no, I don't see any benefit in having them be pooled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at the next call to Update, the first thing we will do is borrow them back again, and hold them for the duration of the op.
that's the point I'm making: you can save allocations between calls to Update
.
@@ -368,15 +392,23 @@ func (t *StackTrie) hash(st *stNode, path []byte) { | |||
st.typ = hashedNode | |||
st.key = st.key[:0] | |||
|
|||
st.val = nil // Release reference to potentially externally held slice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but if it's created e.g. on L400, it should be returned to the pool. I'm not worried that it will leak, since there is a garbage collector, but if feels like we could avoid extra allocations by maybe ensuring that the passed values are always allocated with the bytes pool ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note line 392
. That is the only place we ever set the type to be a hashNode
. Only on hashNode
types, do we "own" the val
. And if the type was already hashedNode
, it will exit early (at line 334).
For hashNode
s, the return-to-pool happens during reset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean leafnode on line 382? nvm I see what you mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point I'm making only stands if you're also using the pool in between calls to Update. And it seems that this is already what you're saying here: #30743 (comment)
Also worth noting: #30747 . Whereas this PR takes ownership of Which means that it's less elaborate: the stacktrie always owns the |
Actually, for the most part, we use the stacktrie to validate sequences of accounts / slots during snap sync. These are typically on the order of 10K elements, IIRC. We don't typically feed the entire trie into a stacktrie: you'd only do that if you want to do some of the deep verifications. When we generate snapshot data from the trie data, I'm not sure the stacktrie is involved, really. |
I will redeploy it for a quick snap sync |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
K, now that I see the context with the derivesha pr, I guess we don't need to futher optimize.
This PR uses various tweaks and tricks to make the stacktrie near alloc-free.
Also improves derivesha: