-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checks for corruption earlier and always report errors #5227
Conversation
Was looking into an issue were a mutation was corrupted on the network. This caused the mutation to be written to the write ahead log and then a failure occurred in the tablet server which left the tablet in an inconsistent state. Modifed the tablet server code to deserialize mutations as early as possible (it used to deserialize after writing to the walog, now it does it before). Wrote an IT to recreate this problem and found another bug. Writing data to Accumulo does the following. 1. Make a startUpdate RPC to create an update session 2. Make one or more applyUpdates RPCs to add data to the session. These RPCs are thrift oneway calls, so nothing is reported back. 3. Call closeUpdate on the session to see what happened with all of the applyUpdates RPCs done in step 2. If an unexpected exception happened in step 2 above then it would not be reported back to the client. These changes fix and test that as part of testing the corrupt mutation. After these changes if there was an error in step 2, then step 3 now throws an exception.
@keith-turner - the milestone and target branch don't match on this PR |
Good catch,maybe that is why the build was failing. I swtiched the branch. |
Being on the wrong branch was causing the fast build to fail. I tried rerunning the build and it kept using the old branch even though I switched. I pushed a new commit and that caused the build to switch to the 2.1 branch and get further. |
Was looking into an issue were a mutation was corrupted on the network. This caused the mutation to be written to the write ahead log and then a failure occurred in the tablet server which left the tablet in an inconsistent state. Modifed the tablet server code to deserialize mutations as early as possible (it used to deserialize after writing to the walog, now it does it before).
Wrote an IT to recreate this problem and found another bug. Writing data to Accumulo does the following.
If an unexpected exception happened in step 2 above then it would not be reported back to the client. These changes fix and test that as part of testing the corrupt mutation. After these changes if there was an error in step 2, then step 3 now throws an exception.