Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tokenizers] String type support in Tokenizers #781

Merged
merged 14 commits into from
Dec 20, 2023

Conversation

slyalin
Copy link
Contributor

@slyalin slyalin commented Dec 12, 2023

Tensors with ov::element::string are supported as input and output of tokenizer and detokenizer correspondingly. It is implemented via modifying StringTensorPack and StringTensorUnpack ops.

TODO:

  • Python support, requires [PyOV] String type support openvino#21532
  • Sentence piece tokenizer op should accept string tensors in addition to u8 packed tensors
  • Migrate tests from pack/unpack strings to native string tensor support

@slyalin slyalin requested a review from a team as a code owner December 12, 2023 07:27
@github-actions github-actions bot added the category: custom operations OpenVINO Runtime Extension with custom operations label Dec 12, 2023
@slyalin slyalin marked this pull request as draft December 12, 2023 07:29
@@ -90,7 +90,7 @@ void set_ragged_output(Node* node, size_t output_index, const PartialShape& shap
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally -- yes. I would like to avoid it now because we still support this legacy format as an input and OVMS adopted this format for MUSE model with sentence piece tokenizer. When all dependent components will use string tensors we can remove it. As an alternative, or the next step, we can move these functions to OVMS side.

@dtrawins, do you think we can do that in this release? Do we know about native users of MUSE model besides OVMS? Do we have other extensions based on that u8 packed tensor in OVMS besides ops in openvino_contrib?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old users can stick to old commits, while in master we can drop all these things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dtrawins, is this a requirement from OVMS still support u8 packed string format instead of supporting native strings? As this is contrib repo we allow breaking changes and we would like to make it as clean as possible.

@ilya-lavrenov
Copy link
Contributor

ilya-lavrenov commented Dec 12, 2023

@apaniukov apaniukov changed the title String type support in Tokenizers [Tokenizers] String type support in Tokenizers Dec 12, 2023
@slyalin
Copy link
Contributor Author

slyalin commented Dec 13, 2023

@@ -40,7 +40,7 @@ compiled_tokenzier = compile_model(ov_tokenizer)
text_input = "Test string"

hf_output = hf_tokenizer([text_input], return_tensors="np")
ov_output = compiled_tokenzier(pack_strings([text_input]))
ov_output = compiled_tokenzier([[text_input]]) # TODO: Remove the second pair of square brackets when Python API is ready
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiwaszki, left TODO for now with the hope you mange to provide a way to avoid this extra pair of brackets.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me!

@slyalin
Copy link
Contributor Author

slyalin commented Dec 19, 2023

Pass rate has been completely fixed with openvinotoolkit/openvino#21761.

@slyalin slyalin marked this pull request as ready for review December 19, 2023 16:21
@slyalin
Copy link
Contributor Author

slyalin commented Dec 19, 2023

@apaniukov, please merge it after openvinotoolkit/openvino#21761 is merged (expected tomorrow morning). Will require changes in llm_bench and genai. @Wovchena, @eaidova, FYI.

Comment on lines +12 to +20
# Left these two methods for convenient transition from legay u8 representation to native string tensors
# TODO: Remove the methods when transition is over
def pack_strings(strings):
return strings

def unpack_strings(strings):
return list(strings)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Del all (un)pack_strings functions from tests. I think we also should delete them from openvino_tokenizers.__init__.py file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do it in a separate commit please. I'm on vacation. This testing functionality -- I really found it useful to keep these two functions for debugging purposes as gates for all string tensors.

@ilya-lavrenov
Copy link
Contributor

This PR is unblocked

@ilya-lavrenov
Copy link
Contributor

build_jenkins

@ilya-lavrenov ilya-lavrenov merged commit 80f9bc1 into openvinotoolkit:master Dec 20, 2023
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: custom operations OpenVINO Runtime Extension with custom operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants