Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tokenizer] Add CLI Tokenizer Converter #792

Merged
merged 4 commits into from
Dec 20, 2023

Conversation

apaniukov
Copy link
Contributor

@apaniukov apaniukov commented Dec 18, 2023

usage: convert_tokenizer [-h] [-o OUTPUT] [--with-detokenizer] [--skip_special_tokens] [--use-fast-false] [--trust-remote-code]
                         [--tokenizer-output-type {i32,i64}] [--detokenizer-input-type {i32,i64}] [--streaming-detokenizer STREAMING_DETOKENIZER]
                         name

Converts tokenizers from Huggingface Hub to OpenVINO Tokenizer model.

positional arguments:
  name                  The model id of a tokenizer hosted inside a model repo on huggingface.co or a path to a saved Huggingface tokenizer
                        directory

options:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Output directory
  --with-detokenizer    Add a detokenizer model to the output
  --skip_special_tokens
                        Produce detokenizer that will skip special tokens during decoding, similar to huggingface_tokenizer.decode(token_ids,
                        skip_special_tokens=True).
  --use-fast-false      Pass `use_fast=False` to `AutoTokenizer.from_pretrained`. It will initialize legacy HuggingFace tokenizer and then
                        converts it to OpenVINO. Might result in slightly different tokenizer. See models with _slow suffix https://github.com/op
                        envinotoolkit/openvino_contrib/tree/master/modules/custom_operations/user_ie_extensions/tokenizer/python#coverage-by-
                        model-type to check the potential difference between original and OpenVINO tokenizers
  --trust-remote-code   Pass `trust_remote_code=True` to `AutoTokenizer.from_pretrained`. It will execute code present on the Hub on your local
                        machine
  --tokenizer-output-type {i32,i64}
                        Type of the output tensors for tokenizer
  --detokenizer-input-type {i32,i64}
                        Type of the input tensor for detokenizer
  --streaming-detokenizer STREAMING_DETOKENIZER
                        [Experimental] Modify SentencePiece based detokenizer to keep spaces leading space. Can be used to stream a model output
                        without TextStreamer buffer

@apaniukov apaniukov requested a review from a team as a code owner December 18, 2023 18:30
@github-actions github-actions bot added the category: custom operations OpenVINO Runtime Extension with custom operations label Dec 18, 2023
@apaniukov apaniukov requested a review from Wovchena December 18, 2023 18:43
Copy link
Contributor

@ilya-lavrenov ilya-lavrenov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we need to integrate this package to HF and use optimum-cli instead of custom tool?

@apaniukov apaniukov marked this pull request as draft December 20, 2023 15:02
@apaniukov apaniukov marked this pull request as ready for review December 20, 2023 16:45
Copy link
Contributor

@ilya-lavrenov ilya-lavrenov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also migrate to release branch

@ilya-lavrenov ilya-lavrenov merged commit 9c6cce9 into openvinotoolkit:master Dec 20, 2023
6 checks passed
@ilya-lavrenov ilya-lavrenov added the port to 2023.3 Need port from master to 2023.3 LTS label Dec 21, 2023
@ilya-lavrenov ilya-lavrenov added this to the 2024.0 milestone Dec 21, 2023
ilya-lavrenov pushed a commit to ilya-lavrenov/openvino_contrib that referenced this pull request Dec 21, 2023
* Add CLI Tokenizer Converter

* Fix space

* Add more flags to CLI tool
@ilya-lavrenov ilya-lavrenov removed the port to 2023.3 Need port from master to 2023.3 LTS label Dec 21, 2023
ilya-lavrenov added a commit that referenced this pull request Dec 22, 2023
* [TOKENIZERS] Disabled  C4703 (#796)

* disabled error C4703

* Update modules/custom_operations/user_ie_extensions/tokenizer/CMakeLists.txt

---------

Co-authored-by: Ilya Lavrenov <[email protected]>

* [Tokenizer] Add CLI Tokenizer Converter (#792)

* Add CLI Tokenizer Converter

* Fix space

* Add more flags to CLI tool

* [TOKENIZERS] Update license field and version (#793)

* update license field and version

* flex dependency in master

* removed version of openvino in master

* [TOKENIZERS] Extended extension searching paths (#797)

* extended site-packages pathes

* Apply suggestions from code review

Co-authored-by: Artur Paniukov <[email protected]>

---------

Co-authored-by: Artur Paniukov <[email protected]>

* [TOKENIZERS] added build of tokenizers in wheel (#798)

* added build tokenizers for wheel

* fixed azure pipeline

* [Tokenizers] Update README.md (#799)

* Fix --streaming-detokenizer flag

* Rewrite README.md

* Rewrite README.md

* Rewrite README.md

---------

Co-authored-by: Mikhail Ryzhov <[email protected]>
Co-authored-by: Artur Paniukov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: custom operations OpenVINO Runtime Extension with custom operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants