-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tokenizers] String type support in Tokenizers #781
[Tokenizers] String type support in Tokenizers #781
Conversation
@@ -90,7 +90,7 @@ void set_ragged_output(Node* node, size_t output_index, const PartialShape& shap | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also remove https://github.com/openvinotoolkit/openvino_contrib/blob/352a4e3ff9bac77a68062371faa5c1b7c1c2bbac/modules/custom_operations/user_ie_extensions/include/openvino_extensions/strings.hpp and similar things in python code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally -- yes. I would like to avoid it now because we still support this legacy format as an input and OVMS adopted this format for MUSE model with sentence piece tokenizer. When all dependent components will use string tensors we can remove it. As an alternative, or the next step, we can move these functions to OVMS side.
@dtrawins, do you think we can do that in this release? Do we know about native users of MUSE model besides OVMS? Do we have other extensions based on that u8 packed tensor in OVMS besides ops in openvino_contrib?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
old users can stick to old commits, while in master we can drop all these things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dtrawins, is this a requirement from OVMS still support u8 packed string format instead of supporting native strings? As this is contrib repo we allow breaking changes and we would like to make it as clean as possible.
Tests haven't migrated yet. |
This reverts commit e01ddbf.
@@ -40,7 +40,7 @@ compiled_tokenzier = compile_model(ov_tokenizer) | |||
text_input = "Test string" | |||
|
|||
hf_output = hf_tokenizer([text_input], return_tensors="np") | |||
ov_output = compiled_tokenzier(pack_strings([text_input])) | |||
ov_output = compiled_tokenzier([[text_input]]) # TODO: Remove the second pair of square brackets when Python API is ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiwaszki, left TODO for now with the hope you mange to provide a way to avoid this extra pair of brackets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To check with: openvinotoolkit/openvino#21734
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me!
Pass rate has been completely fixed with openvinotoolkit/openvino#21761. |
@apaniukov, please merge it after openvinotoolkit/openvino#21761 is merged (expected tomorrow morning). Will require changes in llm_bench and genai. @Wovchena, @eaidova, FYI. |
# Left these two methods for convenient transition from legay u8 representation to native string tensors | ||
# TODO: Remove the methods when transition is over | ||
def pack_strings(strings): | ||
return strings | ||
|
||
def unpack_strings(strings): | ||
return list(strings) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Del all (un)pack_strings
functions from tests. I think we also should delete them from openvino_tokenizers.__init__.py
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do it in a separate commit please. I'm on vacation. This testing functionality -- I really found it useful to keep these two functions for debugging purposes as gates for all string tensors.
This PR is unblocked |
build_jenkins |
Tensors with
ov::element::string
are supported as input and output of tokenizer and detokenizer correspondingly. It is implemented via modifyingStringTensorPack
andStringTensorUnpack
ops.TODO: