Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix job-master leak memory when submitting distributed jobs #18639

Merged

Conversation

liiuzq-xiaobai
Copy link
Contributor

@liiuzq-xiaobai liiuzq-xiaobai commented Jun 27, 2024

fix:fix job-master leak memory when submitting a large number of distributed jobs(DIST_LOAD/DIST_CP/Persist jobs)

What changes are proposed in this pull request?

Start a periodic thread to clear expired jobs information that cannot be trace by the client in CmdJobTracker.The default retention time is 1day,which is the same configuration as LoadV2.

Why are the changes needed?

When many jobs are submitted,the job master finally will have an oom problem, we can find that the cmdJobTracker retains the residual job information and not cleaned regularly, resulting in memory leaks.

Does this PR introduce any user facing changes?

Please list the user-facing changes introduced by your change, including
1.add Configuration:
alluxio.job.master.job.trace.retention.time=xx,the default value is 1d.

Related issue: #18635

@alluxio-bot
Copy link
Contributor

Thank you for your pull request.
In order for us to evaluate and accept your PR, we ask that you sign a contribution license agreement (CLA).
It's all electronic and will take just a few minutes. Please download CLA form here, sign, and e-mail back to [email protected]

@jja725 jja725 self-requested a review June 27, 2024 04:54
@liiuzq-xiaobai liiuzq-xiaobai changed the title fix:fix job-master leak memory when submitting distributed jobs Fix job-master leak memory when submitting distributed jobs Jun 27, 2024
@alluxio-bot
Copy link
Contributor

Automated checks report:

  • PR title follows the conventions: PASS
  • Commits associated with Github account: FAIL
    • It looks like your commits can't be linked to a valid Github account.
      Your commits are made with the email [email protected], which does not allow your contribution to be tracked by Github.
      See this link for possible reasons this might be happening.
      To change the author email address that your most recent commit was made under, you can run:
      git -c user.name="Name" -c user.email="Email" commit --amend --reset-author
      See this answer for more details about how to change commit email addresses.
      Once the author email address has been updated, update the pull request by running:
      git push --force https://github.com/liiuzq-xiaobai/alluxio.git fix/job-master_oom_fix

Some checks failed. Please fix the reported issues and reply
alluxio-bot, check this please
to re-run checks.

@liiuzq-xiaobai liiuzq-xiaobai force-pushed the fix/job-master_oom_fix branch from c176c77 to f444eec Compare June 27, 2024 05:30
@alluxio-bot
Copy link
Contributor

Automated checks report:

  • PR title follows the conventions: PASS
  • Commits associated with Github account: PASS

All checks passed!

@alluxio-bot
Copy link
Contributor

Thank you for your pull request.
In order for us to evaluate and accept your PR, we ask that you sign a contribution license agreement (CLA).
It's all electronic and will take just a few minutes. Please download CLA form here, sign, and e-mail back to [email protected]

@YichuanSun YichuanSun added the type-bug This issue is about a bug label Jul 2, 2024
@liiuzq-xiaobai liiuzq-xiaobai force-pushed the fix/job-master_oom_fix branch from f444eec to d159bfb Compare July 3, 2024 15:15
@liiuzq-xiaobai liiuzq-xiaobai force-pushed the fix/job-master_oom_fix branch from d159bfb to 9729ae2 Compare July 4, 2024 07:03
JobDoesNotExistException, ResourceExhaustedException {
checkActiveSetReplicaJobs(jobConfig);
if (removeFinished()) {
if (mCoordinators.size() < mCapacity) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there the capacity limit. the running job limits is here:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compatibility with some unit tests.

@ccy00808
Copy link
Contributor

ccy00808 commented Jul 5, 2024

@jja725 Can help us review the PR?

Copy link
Contributor

@jja725 jja725 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the improvement!

@liiuzq-xiaobai liiuzq-xiaobai force-pushed the fix/job-master_oom_fix branch 2 times, most recently from 674269c to cac30e1 Compare July 8, 2024 03:09
fix:Only the DIST_LOAD cmdInfo will be deleted.
@ccy00808 ccy00808 force-pushed the fix/job-master_oom_fix branch from cac30e1 to 149db94 Compare July 8, 2024 03:26
@ccy00808
Copy link
Contributor

ccy00808 commented Jul 8, 2024

alluxio-bot, merge this please.

@alluxio-bot alluxio-bot merged commit a4ec456 into Alluxio:master-2.x Jul 8, 2024
19 checks passed
@ccy00808
Copy link
Contributor

ccy00808 commented Jul 8, 2024

alluxio-bot, cherry-pick this to branch-2.10 please

alluxio-bot pushed a commit that referenced this pull request Jul 8, 2024
fix:fix job-master leak memory when submitting a large number of distributed jobs(DIST_LOAD/DIST_CP/Persist jobs)

### What changes are proposed in this pull request?

Start a periodic thread to clear expired jobs information that cannot be trace by the client in CmdJobTracker.The default retention time is 1day,which is the same configuration as LoadV2.

### Why are the changes needed?

When many jobs are submitted,the job master finally will have an oom problem, we can find that the cmdJobTracker retains the residual job information and not cleaned regularly, resulting in memory leaks.

### Does this PR introduce any user facing changes?

Please list the user-facing changes introduced by your change, including
1.add Configuration:
          alluxio.job.master.job.trace.retention.time=xx,the default value is 1d.

Related issue: #18635
			pr-link: #18639
			change-id: cid-d4e5853a1818a22c8a0411a27bfe1141c6f24ebd
@alluxio-bot
Copy link
Contributor

Auto cherry-pick to branch branch-2.10 successfully opened PR: #18651

alluxio-bot added a commit that referenced this pull request Jul 8, 2024
Cherry-pick of existing commit.
orig-pr: #18639
orig-commit: a4ec456
orig-commit-author: Echo🌟 <[email protected]>

			pr-link: #18651
			change-id: cid-d4e5853a1818a22c8a0411a27bfe1141c6f24ebd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug This issue is about a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants