Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checks for corruption earlier and always report errors #5227

Merged
merged 9 commits into from
Jan 8, 2025

Conversation

keith-turner
Copy link
Contributor

Was looking into an issue were a mutation was corrupted on the network. This caused the mutation to be written to the write ahead log and then a failure occurred in the tablet server which left the tablet in an inconsistent state. Modifed the tablet server code to deserialize mutations as early as possible (it used to deserialize after writing to the walog, now it does it before).

Wrote an IT to recreate this problem and found another bug. Writing data to Accumulo does the following.

  1. Make a startUpdate RPC to create an update session
  2. Make one or more applyUpdates RPCs to add data to the session. These RPCs are thrift oneway calls, so nothing is reported back.
  3. Call closeUpdate on the session to see what happened with all of the applyUpdates RPCs done in step 2.

If an unexpected exception happened in step 2 above then it would not be reported back to the client. These changes fix and test that as part of testing the corrupt mutation. After these changes if there was an error in step 2, then step 3 now throws an exception.

Was looking into an issue were a mutation was corrupted on the network.
This caused the mutation to be written to the write ahead log and then a
failure occurred in the tablet server which left the tablet in an
inconsistent state.  Modifed the tablet server code to deserialize
mutations as early as possible (it used to deserialize after writing to
the walog, now it does it before).

Wrote an IT to recreate this problem and found another bug.  Writing
data to Accumulo does the following.

 1. Make a startUpdate RPC to create an update session
 2. Make one or more applyUpdates RPCs to add data to the session.  These
    RPCs are thrift oneway calls, so nothing is reported back.
 3. Call closeUpdate on the session to see what happened with all of the
    applyUpdates RPCs done in step 2.

If an unexpected exception happened in step 2 above then it would not be
reported back to the client.  These changes fix and test that as part of
testing the corrupt mutation. After these changes if there was an error
in step 2, then step 3 now throws an exception.
@keith-turner keith-turner added this to the 2.1.4 milestone Jan 7, 2025
@dlmarion
Copy link
Contributor

dlmarion commented Jan 7, 2025

@keith-turner - the milestone and target branch don't match on this PR

@keith-turner keith-turner changed the base branch from main to 2.1 January 7, 2025 15:52
@keith-turner
Copy link
Contributor Author

@keith-turner - the milestone and target branch don't match on this PR

Good catch,maybe that is why the build was failing. I swtiched the branch.

@keith-turner
Copy link
Contributor Author

Being on the wrong branch was causing the fast build to fail. I tried rerunning the build and it kept using the old branch even though I switched. I pushed a new commit and that caused the build to switch to the 2.1 branch and get further.

@keith-turner keith-turner merged commit 36c9740 into apache:2.1 Jan 8, 2025
8 checks passed
@keith-turner keith-turner deleted the corrupt-mutation branch January 8, 2025 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RuntimeException in Tablet.commit does not decrement writes in progress
2 participants