-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Tokenizers #687
Closed
apaniukov
wants to merge
142
commits into
openvinotoolkit:master
from
apaniukov:tokenizer-fix-decode
Closed
Add Tokenizers #687
Changes from all commits
Commits
Show all changes
142 commits
Select commit
Hold shift + click to select a range
70f867a
Added string tensor implementation with explicit pointer unpack
slyalin 1fac3de
Merged from master
slyalin 821dee5
Started to migrate to extension-only support of string operations wit…
slyalin b9b0693
Started to merge string/tokenizer related stuff from a dedicated OV b…
slyalin c785ec1
Rename CaseFoldUTF8 to name from opset proposal: CaseFold, added Norm…
slyalin 1d129ac
Added a stub for RegexNormalization operation, WA for CPU bug with em…
slyalin 71bc5bf
Implemented Reshape for decomposed string tensors
slyalin 6c5eec0
Added RaggedTensorPack, sophisticated stup for RegexSplit and overrid…
slyalin 29dfe38
Fixes for both master and element::string branches of OpenVINO; bette…
slyalin 40063c1
Debug output of indices in RaggedTensorPack
slyalin cc47b12
Implemented a stub for WordpieceTokenizer. Supported conversion of a …
slyalin 7644231
Disabled debug output
slyalin 80b8023
Define default values for custom operations attributes to make attrib…
slyalin 46c82b8
Added fast_tokenizer lib to the build. Implemented CaseFold based on …
slyalin d7ca2ab
Removed debug output
slyalin 2baac3d
Implemented RaggedToDense always in pad_right=true mode and with bool…
slyalin d270dd6
Provided real implementations for NormalizeUnicode, RegexNormalizatio…
slyalin 119d6e9
Implemented WordpieceTokenizer with fast_tokenizer library
slyalin 4d4ad89
Renamed behaviours to be verbs instead of adjectives
slyalin f4eee84
Added modified version of HF tokenizer parser from Artur; implemented…
slyalin 1e50352
Renamed apply_tokenizer to connect_tokeniser and removed obsolete han…
slyalin 0966b8a
CombineSegments is implemented, used in HF converter. Stitching of to…
slyalin 61d7983
Fixed stitching of two models by connecting with names of inputs/outp…
slyalin 5609ee6
WA for CPU bug with scalar inputs, correct truncation and dynamic pad…
slyalin 062acf3
Fixed conversion of HF tokenizer if part of outputs are omitted. Disa…
slyalin 0f772dc
Add BPE Tokenizer
apaniukov 10e3d18
Add BytesToChars Node for BBPE
apaniukov c413cb6
Delete print
apaniukov 8c8994c
Clip max value for max_length to int32
apaniukov 8750ae6
Fix RegexNormalization and Splitter, Add Digits Splitter
apaniukov be6dc3f
Bug fixes
apaniukov e4dcdda
Add decoding step, BytesToChars refactoring
apaniukov b45e5ec
Fix some regex bugs for byte-level splitter
apaniukov 5f03ed0
Fix bug with VocabDecoder shape
apaniukov 2a65502
Minor changes for natively supported strings
slyalin 2e34b92
Merge remote-tracking branch 'artur/string_tensors_add_bpe' into stri…
slyalin a6f9110
Suppressed minor^Carnings about int32 -> unsigned implicit
slyalin 5c29254
Restructured sentence_piece directory to tokenizer directory: split a…
slyalin f8d0e0d
Add regex to detokenizer pipeline, all splitters have 5 inputs
apaniukov 10c10c5
Add Caching for RegexNormalization
apaniukov 4eb12f8
Add Caching for RegexSplit
apaniukov c5efaf0
Add Wordpiece Cache
apaniukov 239acc4
Add NodeFactory
apaniukov 38552b0
Fix regex nodes init
apaniukov 597ccd4
Fix Wordpiece Cache
apaniukov e6933b7
Add BPE Cache
apaniukov bd7f9d9
Fix RegexNormalization
apaniukov 99c603f
Refactor CombineSegments and Padding
apaniukov 6cc9b36
Refactoring
apaniukov 973c52d
Clean-up commented code
apaniukov 1fa02b2
Sentencepiece Model Encoder from Transformers Tokenizer
apaniukov e37f89d
Add tests for tokenizers
apaniukov 88bf7c6
Add detokenizer for Sentencepiece models
apaniukov bb1b57a
Update README.md
apaniukov 6b4be05
Update README.md
apaniukov 539797f
Update README.md
apaniukov 79c3e09
OVTokenizer as python package
apaniukov 203ffbb
Merge branch 'openvinotoolkit:master' into tokenizer-fix-decode
apaniukov 45c0068
Update README.md
apaniukov 372465b
Merge branch 'master' into tokenizer-fix-decode
apaniukov 64567ea
Add sentencepiece detokenizer test
apaniukov f54076e
Merge remote-tracking branch 'origin/tokenizer-fix-decode' into token…
apaniukov c42d1bd
Unified interface for fast and sentencepiece tokenizers
apaniukov 8b29443
Add Full Pipeline example for Sentencepiece
apaniukov 2ee3707
Update third-party-programs.txt
apaniukov 4b57fcc
Merge branch 'master' into tokenizer-fix-decode
apaniukov 803d831
Add Constants
apaniukov 72f6d9f
Add CPP pack/unpack_strings functions
apaniukov 386cb02
Merge branch 'master' into tokenizer-fix-decode
apaniukov 79bd05f
Move tests to tokenizer dir
apaniukov 24a60b3
Fix import
apaniukov f01afee
Merge remote-tracking branch 'origin/tokenizer-fix-decode' into token…
apaniukov b22569f
Fix imports
apaniukov 96673f5
Sort Imports
apaniukov 0e7ae87
Add Streaming Sentencepiece Decoder
apaniukov 5ebdb1f
Change Authors
apaniukov 6a55877
Update modules/custom_operations/user_ie_extensions/tokenizer/utils.cpp
apaniukov 06d5159
Configure tests
apaniukov fa5360d
Skip Java Tests
apaniukov e855193
Add Regression Test
apaniukov d495d3b
Skip traceback
apaniukov d7bebd0
Add Win64 Fast Tokenizer lib
apaniukov b2e35ed
Fix WorkingDir
apaniukov f81bd18
Return TB
apaniukov 0bd23b5
Fix dependencies install
apaniukov 12ac9f8
Add byte tokens handling for sentencepiece
apaniukov 9e6ae6f
Drop black, use ruff format instead
apaniukov f5d2d4c
Temp remove tokenizers from windows CI
apaniukov cf039b9
CI check
apaniukov 795306d
Compile fast_tokenizers from source code
ilya-lavrenov 9c200c2
Export pack_strings() and unpack_strings()
Wovchena 0e9b960
Merge pull request #1 from ilya-lavrenov/tokenizer-fix-decode
apaniukov 95aa47c
Merge branch 'master' into tokenizer-fix-decode
apaniukov e1de338
Merge branch 'tokenizer-fix-decode' into export-pack_strings-and-unpa…
Wovchena f23e59b
Build tokenizer target on windows
apaniukov dbec117
Merge branch 'tokenizer-fix-decode' into export-pack_strings-and-unpa…
Wovchena ce25397
Add icu4c patch
apaniukov d46f594
Added include dir to nlohmann headers
ilya-lavrenov 6f213ab
Fixed compilation on ubuntu 18.04 arm64
ilya-lavrenov 6ed52e4
Fixed Windows
ilya-lavrenov ca62321
Merge pull request #3 from ilya-lavrenov/nlohmann
apaniukov 52bfe5a
Supported prebuild Fast Tokenizers on all platforms
ilya-lavrenov b504013
Merge branch 'master' into tokenizer-fix-decode
apaniukov cc663dc
Add tiktoken support WIP
apaniukov 4c9ceed
Merge remote-tracking branch 'origin/tokenizer-fix-decode' into token…
apaniukov 745e969
Unskip java tests
apaniukov 48564b7
Merge pull request #4 from ilya-lavrenov/prebuilt-fast-tokenizers
apaniukov 056eb9f
Fixed compilation with re2 on Windows
ilya-lavrenov 309b8e9
Merge pull request #5 from ilya-lavrenov/windows-re2
apaniukov b193cb2
Merge branch 'tokenizer-fix-decode' into export-pack_strings-and-unpa…
Wovchena debcb5d
Move unpack_strings(), create sepparate include dir
Wovchena b739ffd
openvino_extensions
Wovchena e70a3f2
Fixed link stage on Windows
ilya-lavrenov 2ce27cd
Merge pull request #6 from ilya-lavrenov/windows-linkage
apaniukov 3022a5a
i64 is default tokenizer output type
apaniukov c467a8c
Add support for more tiktoken tokenizers
apaniukov 1ec4c5f
Merge branch 'tokenizer-fix-decode' into export-pack_strings-and-unpa…
Wovchena 8505b51
Check Azure CI
apaniukov 82639e6
Fix Azure Win CI
apaniukov fb37580
Merge pull request #2 from Wovchena/export-pack_strings-and-unpack_st…
apaniukov a45b826
Define python version for setupvars.bat
apaniukov 35cc136
Merge remote-tracking branch 'origin/tokenizer-fix-decode' into token…
apaniukov 244a593
Add support for tiktoken detokenizers
apaniukov 84686f4
Merge branch 'master' into tokenizer-fix-decode
andrei-kochin ad1c589
Add ChatGLM tokenization support.
apaniukov 0f63c3d
Add ChatGLM detokenization and tests
apaniukov 0f1c1cc
Add ChatGLM detokenization and tests
apaniukov 3edb73b
Fix mac sha256
apaniukov 48bba34
Skip Lin Java Tests
apaniukov fe507ff
Add Mac Tokenziers Tests and Skip Mac Java Step
apaniukov 4b0c4ec
Fix Mac SHA
apaniukov 4656238
Del WA for CPU Bug
apaniukov 2f5cc1c
Fix Mac CI Pipeline
apaniukov 1568727
Change Mac CI
apaniukov fa822c2
Fixed compilation
ilya-lavrenov 6ddb2a6
Merge pull request #7 from ilya-lavrenov/compilation-fix
apaniukov 14f993b
Add setupvars to mac CI
apaniukov cae3098
Merge branch 'master' into tokenizer-fix-decode
apaniukov b59204d
Change detokenizer output type
apaniukov e54b42e
Merge remote-tracking branch 'origin/tokenizer-fix-decode' into token…
apaniukov 6c3bae3
Fix SegFault on AddedTokens For BPE tokenizer
apaniukov d34d401
Add SP Space handling for decoder
apaniukov File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
89 changes: 89 additions & 0 deletions
89
modules/custom_operations/user_ie_extensions/cmake/platforms.cmake
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,89 @@ | ||
|
||
# Copyright (C) 2023 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
# | ||
|
||
if(CMAKE_CL_64) | ||
set(MSVC64 ON) | ||
endif() | ||
|
||
if(WIN32 AND CMAKE_CXX_COMPILER_ID STREQUAL "GNU") | ||
execute_process(COMMAND ${CMAKE_CXX_COMPILER} -dumpmachine | ||
OUTPUT_VARIABLE OPENVINO_GCC_TARGET_MACHINE | ||
OUTPUT_STRIP_TRAILING_WHITESPACE) | ||
if(OPENVINO_GCC_TARGET_MACHINE MATCHES "amd64|x86_64|AMD64") | ||
set(MINGW64 ON) | ||
endif() | ||
endif() | ||
|
||
if(CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "amd64.*|x86_64.*|AMD64.*") | ||
set(OV_HOST_ARCH X86_64) | ||
elseif(CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "i686.*|i386.*|x86.*|amd64.*|AMD64.*") | ||
set(OV_HOST_ARCH X86) | ||
elseif(CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "^(arm64.*|aarch64.*|AARCH64.*|ARM64.*)") | ||
set(OV_HOST_ARCH AARCH64) | ||
elseif(CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "^(arm.*|ARM.*)") | ||
set(OV_HOST_ARCH ARM) | ||
elseif(CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "^riscv64$") | ||
set(OV_HOST_ARCH RISCV64) | ||
endif() | ||
|
||
macro(_ov_user_ext_detect_arch_by_processor_type) | ||
if(CMAKE_OSX_ARCHITECTURES AND APPLE) | ||
if(CMAKE_OSX_ARCHITECTURES STREQUAL "arm64") | ||
set(OV_ARCH AARCH64) | ||
elseif(CMAKE_OSX_ARCHITECTURES STREQUAL "x86_64") | ||
set(OV_ARCH X86_64) | ||
elseif(CMAKE_OSX_ARCHITECTURES MATCHES ".*x86_64.*" AND CMAKE_OSX_ARCHITECTURES MATCHES ".*arm64.*") | ||
set(OV_ARCH UNIVERSAL2) | ||
else() | ||
message(FATAL_ERROR "Unsupported value: CMAKE_OSX_ARCHITECTURES = ${CMAKE_OSX_ARCHITECTURES}") | ||
endif() | ||
elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "amd64.*|x86_64.*|AMD64.*") | ||
set(OV_ARCH X86_64) | ||
elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "i686.*|i386.*|x86.*|amd64.*|AMD64.*|wasm") | ||
set(OV_ARCH X86) | ||
elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^(arm64.*|aarch64.*|AARCH64.*|ARM64.*|armv8)") | ||
set(OV_ARCH AARCH64) | ||
elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^(arm.*|ARM.*)") | ||
set(OV_ARCH ARM) | ||
elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^riscv64$") | ||
set(OV_ARCH RISCV64) | ||
endif() | ||
endmacro() | ||
|
||
macro(_ov_user_ext_process_msvc_generator_platform) | ||
# if cmake -A <ARM|ARM64|x64|Win32> is passed | ||
if(CMAKE_GENERATOR_PLATFORM STREQUAL "ARM64") | ||
set(OV_ARCH AARCH64) | ||
elseif(CMAKE_GENERATOR_PLATFORM STREQUAL "ARM") | ||
set(OV_ARCH ARM) | ||
elseif(CMAKE_GENERATOR_PLATFORM STREQUAL "x64") | ||
set(OV_ARCH X86_64) | ||
elseif(CMAKE_GENERATOR_PLATFORM STREQUAL "Win32") | ||
set(OV_ARCH X86) | ||
else() | ||
_ov_user_ext_detect_arch_by_processor_type() | ||
endif() | ||
endmacro() | ||
|
||
if(MSVC64 OR MINGW64) | ||
_ov_user_ext_process_msvc_generator_platform() | ||
elseif(MINGW OR (MSVC AND NOT CMAKE_CROSSCOMPILING)) | ||
_ov_user_ext_process_msvc_generator_platform() | ||
else() | ||
_ov_user_ext_detect_arch_by_processor_type() | ||
endif() | ||
|
||
set(HOST_${OV_HOST_ARCH} ON) | ||
set(${OV_ARCH} ON) | ||
|
||
unset(OV_ARCH) | ||
|
||
if(CMAKE_SYSTEM_NAME STREQUAL "Emscripten") | ||
set(EMSCRIPTEN ON) | ||
endif() | ||
|
||
if(UNIX AND NOT (APPLE OR ANDROID OR EMSCRIPTEN OR CYGWIN)) | ||
set(LINUX ON) | ||
endif() |
61 changes: 61 additions & 0 deletions
61
modules/custom_operations/user_ie_extensions/include/openvino_extensions/strings.hpp
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
// Copyright (C) 2023 Intel Corporation | ||
// SPDX-License-Identifier: Apache-2.0 | ||
// | ||
|
||
#pragma once | ||
|
||
#include <openvino/runtime/tensor.hpp> | ||
|
||
namespace openvino_extensions { | ||
// Pack any container with string to ov::Tensor with element type u8 | ||
// Requirements for BatchOfStrings: .size() with size and .begin(), .end() as iterators, elements with .begin(), .end() and .size() | ||
// so basically any STL container with std::string is compatible | ||
// Tensor destination will be reshaped according the input data | ||
template <typename BatchOfStrings> | ||
void pack_strings(const BatchOfStrings& strings, ov::Tensor& destination) { | ||
auto batch_size = strings.size(); | ||
|
||
// First run over all elements: calculate total memory required to hold all strings | ||
size_t symbols_size = std::accumulate( | ||
strings.begin(), strings.end(), size_t(0), | ||
[](size_t accum, typename BatchOfStrings::const_reference str) | ||
{ return accum + str.size(); }); | ||
|
||
size_t total_size = 4 * (1 + 1 + batch_size) + symbols_size; | ||
destination.set_shape({total_size}); | ||
|
||
int32_t* pindices = reinterpret_cast<int32_t*>(destination.data<uint8_t>()); | ||
pindices[0] = batch_size; | ||
pindices[1] = 0; | ||
pindices += 2; | ||
char* psymbols = reinterpret_cast<char*>(pindices + batch_size); | ||
size_t current_symbols_pos = 0; | ||
|
||
for (const auto& str: strings) { | ||
psymbols = std::copy(str.begin(), str.end(), psymbols); | ||
current_symbols_pos += str.size(); | ||
*pindices = current_symbols_pos; | ||
++pindices; | ||
} | ||
} | ||
|
||
std::vector<std::string> unpack_strings(const ov::Tensor& source) { | ||
int32_t length = source.get_byte_size(); | ||
// check the format of the input bitstream representing the string tensor | ||
OPENVINO_ASSERT(length >= 4, "Incorrect packed string tensor format: no batch size in the packed string tensor"); | ||
const int32_t* pindices = reinterpret_cast<const int32_t*>(source.data<const uint8_t>()); | ||
int32_t batch_size = pindices[0]; | ||
OPENVINO_ASSERT(length >= 4 + 4 + 4 * batch_size, | ||
"Incorrect packed string tensor format: the packed string tensor must contain first string offset and end indices"); | ||
const int32_t* begin_ids = pindices + 1; | ||
const int32_t* end_ids = pindices + 2; | ||
const char* symbols = reinterpret_cast<const char*>(pindices + 2 + batch_size); | ||
|
||
std::vector<std::string> result; | ||
result.reserve(batch_size); | ||
for (int32_t idx = 0; idx < batch_size; ++idx) { | ||
result.emplace_back(symbols + begin_ids[idx], symbols + end_ids[idx]); | ||
} | ||
return result; | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see Windows compiles now. But
ov_tokenizer.init_extension()
fails for me: