Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] Use .clone() for sampling params and deepcopy XGrammarLogitsProcessor #11380

Conversation

tjohnson31415
Copy link
Contributor

@tjohnson31415 tjohnson31415 commented Dec 20, 2024

XGrammar is now the default decoding backend, but it breaks when doing parallel decoding with n>1 in Server mode. There is internal state in self.prefilled and per-sequence state in the matchers, but parallel decoding uses the same XGrammarLogitsProcessor instance for every parallel sequence.

The fix here for prefilled is just to remove the flag and use the length of input_ids as the check to avoid an IndexError: tuple index out of range. The fix for the matchers state is to deepcopy the processor for each sequence instead of sharing the reference to prevent an AssertionError at assert self.matchers[i].accept_token(sampled_token).

NOTE: I noticed the batch_size field on the processor and the creation of mulitple matchers based on that, but those don't seem to be used (i.e. batch_size==1 even if n>1). I'm not sure if the "batch" was intended to handle parallel decoding instead of doing a deepcopy like I do in this fix, but I also didn't see a good way to index into the batch based on the sequence in the sequence group.

DRAFT: Looking to add a test that would have caught this and I'm looking to understand the differences between LP processing for server mode vs oflfine mode.

FIX #11312

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@tjohnson31415
Copy link
Contributor Author

Closing since #11637 was merged with a fix.

@tjohnson31415 tjohnson31415 deleted the fix-xgrammar-parallel-decoding branch January 2, 2025 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Chat with n>1 breaks xgrammar
1 participant