-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up ZooKeeper tooling #5192
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not finished reviewing, but here are the comments I have so far.
core/src/main/java/org/apache/accumulo/core/clientImpl/ClientContext.java
Show resolved
Hide resolved
core/src/main/java/org/apache/accumulo/core/fate/zookeeper/ZooCache.java
Show resolved
Hide resolved
193e99a
to
f3b8454
Compare
As it turns out, there was another thing that ZooSession was doing, that I didn't realize, that I found with @dlmarion 's help through testing. Specifically, when the session is "expired" upon reconnection to a ZK cluster after something like a ZK cluster restart, or a session timeout on the server-side, the ZooKeeper object would be replaced with a new one when requested. However, that's not entirely safe to do automatically, and I will create a new issue about that once I've had a chance to write it up in more detail. |
I ended up addressing my last comment here, so I won't create a follow-on issue. Basically, the concern was that Watchers will not trigger if the session is dropped and we have to replace it with a new session (a new ZooKeeper client instance, since the session is tied to the instance). But Watchers will still see the session expired event. Poorly written watchers are at risk of not being fired ever again, if they don't monitor for session expired events, but well-written watchers should be fine. They just need to use a new ZooKeeper client object to re-set the Watcher if they want to start getting new event notifications. The old ZooSession kind of did that behind the scenes, but it was a little messy, because there was a lot of criss-crossing and tangled APIs. This PR changes that now to abstract ZooKeeper objects behind a new ZooSession object. This new object can be used in place of ZooKeeper, and holds a delegate ZooKeeper internally. When that one's session expires, it creates a replacement seamlessly internally. The methods are a subset of the methods on ZooKeeper. We can add more, as needed, and we can consider rolling in the APIs from ZooReader and ZooReaderWriter into this. A lot of other changes were included to remove the direct use of ZooKeeper objects everywhere in the code, and to replace them with this. The other major changes included here are a cleanup of ClientInfo/ServerInfo in the way they bootstrap ClientContext/ServerContext, so they can more easily reuse ZooKeeper objects (specifically, this new ZooSession object) efficiently, and all the different entry points into ServerInfo, in particular, to support the various use cases (normal server, testing, server-side utility with client properties config file, initialize etc.) are streamlined in the implementation and all lazily loaded via Guava Supplier memoization, with care taken to ensure that there are no cyclic dependencies on the instanceId/instanceName lookups and other things, in particular during Initialize where that's the greatest risk. The lazy loading also ensures that we're not wasting resources on establishing any connections to HDFS/ZK for AccumuloClient objects that are created, but not used (in case the user creates a lot of them or something). This change also uses separate ZooSession objects for the instanceName/instanceId lookups in ZK, on demand, as needed, from the persistent ZooSession used for the ServerContext/ClientContext after they are established. The reason for that is to support future work for chroot'ing the majority of the code (#4809). The logic for handling those lookups was moved to ZooUtil and simplified, replacing multiple similar implementations scattered throughout the code. Some follow-on work that could be done after this is fully tested and merged:
|
c15e334
to
859ff86
Compare
* Replace existing ZooSession with a new one that is a facade for ZooKeeper. The methods from ZooKeeper we need are placed on it and internally it maintains a delegate ZooKeeper instance that handles automatically creating a new ZooKeeper session (client object) when the old session has died. This enables us to maintain fewer ZooKeeper objects by keeping only one at a time per ClientContext/ServerContext, and to reduce complexity substantially, where it was hard to reason about which ZooKeeper session we were using at any given moment. This no longer requires passing around ZooKeeper objects via ZooReader that called to the old ZooSession to construct a ZooKeeper session on demand. The new ZooSession is now a singleton field whose lifecycle is part of the context, and ZooReader/ZooReaderWriter are substantially simplified. * Lazily construct objects in ServerInfo/ClientInfoImpl to simplify the implementation for the various use cases, and to ensure that we don't create more objects than needed. * Improve debugging information to tie the ZooSession instances with their purpose (for example, for use with ServerContext, or for use with ClientContext, with the user's name) * Get rid of ZooCacheFactory in favor of a lazily constructed ZooCache instance in the ClientContext, to make it more clear when a ZooCache is being shared or reused, and to remove a static singleton * Move instanceId and instanceName lookup logic into ZooUtil * Make ZookeeperLockChecker use its own ZooSession, because it is still a static singleton and must continue to operate after the context that constructed it is closed (should be fixed when apache#2301 is done) * Perform some minor improvements to ZooCache to simplify its constructors now that it uses ZooSession, and to change the type of external watchers to Consumer, so they aren't as easily confused with actual ZooKeeper watchers * Improve a lot of ZooKeeper related test code Potential future work after this could include: * Roll ZooReader and ZooReaderWriter functions directly into ZooSession * Add support to ZooSession for more ZooKeeper APIs * Handle KeeperException thrown from delegate that signals the session is disconnected, instead of relying only on the verifyConnected() method in ZooSession to update the delegate * Handle InterruptedExceptions directly in ZooSession, so they don't propagate everywhere in the code in ways that are inconvenient to handle This fixes apache#5124, apache#2299, and apache#2298
859ff86
to
de9e4c5
Compare
I updated the description above, and squashed the changes to make rebasing subsequent work onto this branch a bit easier. |
All ITs passed again (only two pre-existing flakes: CompactionIT and SuspendedTabletsIT, failed once, but passed on subsequent run) |
core/src/main/java/org/apache/accumulo/core/fate/zookeeper/ZooCache.java
Show resolved
Hide resolved
core/src/main/java/org/apache/accumulo/core/fate/zookeeper/ZooUtil.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/accumulo/core/fate/zookeeper/ZooUtil.java
Outdated
Show resolved
Hide resolved
server/base/src/test/java/org/apache/accumulo/server/conf/SystemConfigurationTest.java
Show resolved
Hide resolved
replay(sconf, zk, fs); | ||
assertThrows(IOException.class, () -> Initialize.checkInit(zrw, fs, initConfig)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could change this to the following to remove the zrw
instance variable:
expect(zk.asReaderWriter()).andReturn(new ZooReaderWriter(zk));
replay(sconf, zk, fs);
assertThrows(IOException.class, () -> Initialize.checkInit(zk.asReaderWriter(), fs, initConfig));
I see now why you still need the ZR and ZRW constructors to be public. However, I think you could probably make them protected
so that they are only constructed in ZooSession
. For the tests, you could subclass them.
In other tests you are using a mock ZRW, not sure if this is possible here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No matter what we do, it's going to be a big change on its own. But I think those ZR/ZRW changes can be done as a follow-on, since 1) I'm not sure exactly which direction we should go (collapse their APIs into ZooSession vs. relocate them and make their constructors protected, etc., and 2) this change is already big enough.
I did notice that a lot of tests don't do things the same way... some create a custom ZooReader around a mock ZooSession, some create a mock ZooReader, and others use the MockServerContext and pass around ZooKeeper that way. I tried to make some of these tests consistent, but that takes you down a very big rabbit hole, and I came to the realization that the reason why all these are so different is because our internal code passes around different things. If we can simplify the internal code by reducing the number of objects passed around (collapsing on a ServerContext or a ZooSession, and not passing around ZR/ZRW), the test code also becomes simpler. So, I'd like to leave that as follow-on work and not go down that rabbit hole (again) in this PR.
@@ -678,7 +678,7 @@ public long getFlushID() throws NoNodeException { | |||
try { | |||
String zTablePath = tabletServer.getContext().getZooKeeperRoot() + Constants.ZTABLES + "/" | |||
+ extent.tableId() + Constants.ZTABLE_FLUSH_ID; | |||
String id = new String(context.getZooReaderWriter().getData(zTablePath), UTF_8); | |||
String id = new String(context.getZooSession().asReaderWriter().getData(zTablePath), UTF_8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need ReaderWriter to getData()
? There is a getData()
on ZooSession
. There may actually be a lot of places where methods on ZooReader
are being called where there is a similar method on ZooSession
. I'm wondering if it might make sense to get rid of ZooReader
and use only ZooSession
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a difference between these methods, though. ZR/ZRW's purpose was to simplify ZK interactions by providing similar methods that 1) reduce the need to provide useless required parameters that were always passed with a specific value (often null
or false
for a watcher or -1
for a version), and 2) to provide an automatic retry mechanism on transient failures.
So, the code that is using ZR/ZRW should be assumed to be written with the retry behavior in mind (either because the developer explicitly wanted retry or doesn't care), and the code calling the similar methods on ZooSession directly (previously it would have called ZooKeeper directly) should be assumed to be written with no desire to retry (often because it has its own retry behavior) or because it needs to pass specific values. Changing from one to the other should be done very carefully, and should be done in a way that reviews the calling code that is being changed to examine the implications. I didn't change any of those, because that's too much work for this PR. However, that was one of my suggestions for follow-on work, to perhaps collapse ZR/ZRW with ZooSession to provide retry-on-transient failures to all methods automatically on ZooSession. And also possibly to provide the overloaded method versions from ZR/ZRW to ZooSession that simplify the need to pass parameters with common values. Doing those could possibly lead to the removal of ZR/ZRW, but might diverge ZooSession from being a ZooKeeper API look-alike to a more general utility around ZooKeeper for all our needs. I'm not sure how far we want to take that, but it can be done as a follow-on either way, with care taken to examine the needs of the calling code in each case.
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Tablet.java
Show resolved
Hide resolved
test/src/main/java/org/apache/accumulo/test/conf/store/PropStoreZooKeeperIT.java
Outdated
Show resolved
Hide resolved
This was merged to main in e745a5d ; compare the differences in main branch with |
I checked CI and I am some failures in a few ITs in main after this merge.
|
Clean up ZooKeeper tooling
ZooKeeper. The methods from ZooKeeper we need are placed on it and
internally it maintains a delegate ZooKeeper instance that handles
automatically creating a new ZooKeeper session (client object)
when the old session has died.
This enables us to maintain fewer ZooKeeper objects by keeping only
one at a time per ClientContext/ServerContext, and to reduce
complexity substantially, where it was hard to reason about
which ZooKeeper session we were using at any given moment. This
no longer requires passing around ZooKeeper objects via ZooReader
that called to the old ZooSession to construct a ZooKeeper session
on demand. The new ZooSession is now a singleton field whose lifecycle
is part of the context, and ZooReader/ZooReaderWriter are
substantially simplified.
implementation for the various use cases, and to ensure that we don't
create more objects than needed.
their purpose (for example, for use with ServerContext, or for use
with ClientContext, with the user's name)
instance in the ClientContext, to make it more clear when a ZooCache
is being shared or reused, and to remove a static singleton
a static singleton and must continue to operate after the context that
constructed it is closed (should be fixed when Explore refactoring TabletLocator code #2301 is done)
constructors now that it uses ZooSession, and to change the type of
external watchers to Consumer, so they aren't as easily confused with
actual ZooKeeper watchers
Potential future work after this could include:
is disconnected, instead of relying only on the verifyConnected()
method in ZooSession to update the delegate
propagate everywhere in the code in ways that are inconvenient to
handle
This fixes #5124, #2299, and #2298