Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Tokenizers #687

Closed
wants to merge 142 commits into from
Closed
Show file tree
Hide file tree
Changes from 75 commits
Commits
Show all changes
142 commits
Select commit Hold shift + click to select a range
70f867a
Added string tensor implementation with explicit pointer unpack
slyalin Apr 25, 2023
1fac3de
Merged from master
slyalin Apr 28, 2023
821dee5
Started to migrate to extension-only support of string operations wit…
slyalin May 2, 2023
b9b0693
Started to merge string/tokenizer related stuff from a dedicated OV b…
slyalin May 10, 2023
c785ec1
Rename CaseFoldUTF8 to name from opset proposal: CaseFold, added Norm…
slyalin May 10, 2023
1d129ac
Added a stub for RegexNormalization operation, WA for CPU bug with em…
slyalin May 11, 2023
71bc5bf
Implemented Reshape for decomposed string tensors
slyalin May 11, 2023
6c5eec0
Added RaggedTensorPack, sophisticated stup for RegexSplit and overrid…
slyalin May 12, 2023
29dfe38
Fixes for both master and element::string branches of OpenVINO; bette…
slyalin May 15, 2023
40063c1
Debug output of indices in RaggedTensorPack
slyalin May 16, 2023
cc47b12
Implemented a stub for WordpieceTokenizer. Supported conversion of a …
slyalin May 17, 2023
7644231
Disabled debug output
slyalin May 17, 2023
80b8023
Define default values for custom operations attributes to make attrib…
slyalin May 18, 2023
46c82b8
Added fast_tokenizer lib to the build. Implemented CaseFold based on …
slyalin May 20, 2023
d7ca2ab
Removed debug output
slyalin May 20, 2023
2baac3d
Implemented RaggedToDense always in pad_right=true mode and with bool…
slyalin May 20, 2023
d270dd6
Provided real implementations for NormalizeUnicode, RegexNormalizatio…
slyalin May 23, 2023
119d6e9
Implemented WordpieceTokenizer with fast_tokenizer library
slyalin May 23, 2023
4d4ad89
Renamed behaviours to be verbs instead of adjectives
slyalin May 25, 2023
f4eee84
Added modified version of HF tokenizer parser from Artur; implemented…
slyalin May 25, 2023
1e50352
Renamed apply_tokenizer to connect_tokeniser and removed obsolete han…
slyalin May 25, 2023
0966b8a
CombineSegments is implemented, used in HF converter. Stitching of to…
slyalin May 31, 2023
61d7983
Fixed stitching of two models by connecting with names of inputs/outp…
slyalin May 31, 2023
5609ee6
WA for CPU bug with scalar inputs, correct truncation and dynamic pad…
slyalin Jun 1, 2023
062acf3
Fixed conversion of HF tokenizer if part of outputs are omitted. Disa…
slyalin Jun 1, 2023
0f772dc
Add BPE Tokenizer
apaniukov Jun 19, 2023
10e3d18
Add BytesToChars Node for BBPE
apaniukov Jun 20, 2023
c413cb6
Delete print
apaniukov Jun 20, 2023
8c8994c
Clip max value for max_length to int32
apaniukov Jun 20, 2023
8750ae6
Fix RegexNormalization and Splitter, Add Digits Splitter
apaniukov Jun 22, 2023
be6dc3f
Bug fixes
apaniukov Jun 23, 2023
e4dcdda
Add decoding step, BytesToChars refactoring
apaniukov Jun 29, 2023
b45e5ec
Fix some regex bugs for byte-level splitter
apaniukov Jun 30, 2023
5f03ed0
Fix bug with VocabDecoder shape
apaniukov Jul 7, 2023
2a65502
Minor changes for natively supported strings
slyalin Jul 10, 2023
2e34b92
Merge remote-tracking branch 'artur/string_tensors_add_bpe' into stri…
slyalin Jul 10, 2023
a6f9110
Suppressed minor^Carnings about int32 -> unsigned implicit
slyalin Jul 10, 2023
5c29254
Restructured sentence_piece directory to tokenizer directory: split a…
slyalin Jul 10, 2023
f8d0e0d
Add regex to detokenizer pipeline, all splitters have 5 inputs
apaniukov Jul 17, 2023
10c10c5
Add Caching for RegexNormalization
apaniukov Jul 27, 2023
4eb12f8
Add Caching for RegexSplit
apaniukov Jul 27, 2023
c5efaf0
Add Wordpiece Cache
apaniukov Jul 28, 2023
239acc4
Add NodeFactory
apaniukov Jul 31, 2023
38552b0
Fix regex nodes init
apaniukov Aug 4, 2023
597ccd4
Fix Wordpiece Cache
apaniukov Aug 10, 2023
e6933b7
Add BPE Cache
apaniukov Aug 10, 2023
bd7f9d9
Fix RegexNormalization
apaniukov Aug 11, 2023
99c603f
Refactor CombineSegments and Padding
apaniukov Sep 6, 2023
6cc9b36
Refactoring
apaniukov Sep 7, 2023
973c52d
Clean-up commented code
apaniukov Sep 8, 2023
1fa02b2
Sentencepiece Model Encoder from Transformers Tokenizer
apaniukov Sep 27, 2023
e37f89d
Add tests for tokenizers
apaniukov Sep 27, 2023
88bf7c6
Add detokenizer for Sentencepiece models
apaniukov Oct 2, 2023
bb1b57a
Update README.md
apaniukov Oct 4, 2023
6b4be05
Update README.md
apaniukov Oct 4, 2023
539797f
Update README.md
apaniukov Oct 4, 2023
79c3e09
OVTokenizer as python package
apaniukov Oct 4, 2023
203ffbb
Merge branch 'openvinotoolkit:master' into tokenizer-fix-decode
apaniukov Oct 4, 2023
45c0068
Update README.md
apaniukov Oct 5, 2023
372465b
Merge branch 'master' into tokenizer-fix-decode
apaniukov Oct 6, 2023
64567ea
Add sentencepiece detokenizer test
apaniukov Oct 6, 2023
f54076e
Merge remote-tracking branch 'origin/tokenizer-fix-decode' into token…
apaniukov Oct 6, 2023
c42d1bd
Unified interface for fast and sentencepiece tokenizers
apaniukov Oct 9, 2023
8b29443
Add Full Pipeline example for Sentencepiece
apaniukov Oct 11, 2023
2ee3707
Update third-party-programs.txt
apaniukov Oct 11, 2023
4b57fcc
Merge branch 'master' into tokenizer-fix-decode
apaniukov Oct 12, 2023
803d831
Add Constants
apaniukov Oct 13, 2023
72f6d9f
Add CPP pack/unpack_strings functions
apaniukov Oct 16, 2023
386cb02
Merge branch 'master' into tokenizer-fix-decode
apaniukov Oct 17, 2023
79bd05f
Move tests to tokenizer dir
apaniukov Oct 17, 2023
24a60b3
Fix import
apaniukov Oct 18, 2023
f01afee
Merge remote-tracking branch 'origin/tokenizer-fix-decode' into token…
apaniukov Oct 18, 2023
b22569f
Fix imports
apaniukov Oct 18, 2023
96673f5
Sort Imports
apaniukov Oct 18, 2023
0e7ae87
Add Streaming Sentencepiece Decoder
apaniukov Oct 19, 2023
5ebdb1f
Change Authors
apaniukov Oct 19, 2023
6a55877
Update modules/custom_operations/user_ie_extensions/tokenizer/utils.cpp
apaniukov Oct 23, 2023
06d5159
Configure tests
apaniukov Oct 23, 2023
fa5360d
Skip Java Tests
apaniukov Oct 24, 2023
e855193
Add Regression Test
apaniukov Oct 24, 2023
d495d3b
Skip traceback
apaniukov Oct 24, 2023
d7bebd0
Add Win64 Fast Tokenizer lib
apaniukov Oct 24, 2023
b2e35ed
Fix WorkingDir
apaniukov Oct 24, 2023
f81bd18
Return TB
apaniukov Oct 24, 2023
0bd23b5
Fix dependencies install
apaniukov Oct 24, 2023
12ac9f8
Add byte tokens handling for sentencepiece
apaniukov Oct 24, 2023
9e6ae6f
Drop black, use ruff format instead
apaniukov Oct 25, 2023
f5d2d4c
Temp remove tokenizers from windows CI
apaniukov Oct 26, 2023
cf039b9
CI check
apaniukov Oct 26, 2023
795306d
Compile fast_tokenizers from source code
ilya-lavrenov Oct 28, 2023
9c200c2
Export pack_strings() and unpack_strings()
Wovchena Oct 30, 2023
0e9b960
Merge pull request #1 from ilya-lavrenov/tokenizer-fix-decode
apaniukov Oct 31, 2023
95aa47c
Merge branch 'master' into tokenizer-fix-decode
apaniukov Oct 31, 2023
e1de338
Merge branch 'tokenizer-fix-decode' into export-pack_strings-and-unpa…
Wovchena Oct 31, 2023
f23e59b
Build tokenizer target on windows
apaniukov Oct 31, 2023
dbec117
Merge branch 'tokenizer-fix-decode' into export-pack_strings-and-unpa…
Wovchena Nov 2, 2023
ce25397
Add icu4c patch
apaniukov Nov 3, 2023
d46f594
Added include dir to nlohmann headers
ilya-lavrenov Nov 8, 2023
6f213ab
Fixed compilation on ubuntu 18.04 arm64
ilya-lavrenov Nov 8, 2023
6ed52e4
Fixed Windows
ilya-lavrenov Nov 8, 2023
ca62321
Merge pull request #3 from ilya-lavrenov/nlohmann
apaniukov Nov 8, 2023
52bfe5a
Supported prebuild Fast Tokenizers on all platforms
ilya-lavrenov Nov 8, 2023
b504013
Merge branch 'master' into tokenizer-fix-decode
apaniukov Nov 9, 2023
cc663dc
Add tiktoken support WIP
apaniukov Nov 9, 2023
4c9ceed
Merge remote-tracking branch 'origin/tokenizer-fix-decode' into token…
apaniukov Nov 9, 2023
745e969
Unskip java tests
apaniukov Nov 9, 2023
48564b7
Merge pull request #4 from ilya-lavrenov/prebuilt-fast-tokenizers
apaniukov Nov 9, 2023
056eb9f
Fixed compilation with re2 on Windows
ilya-lavrenov Nov 10, 2023
309b8e9
Merge pull request #5 from ilya-lavrenov/windows-re2
apaniukov Nov 10, 2023
b193cb2
Merge branch 'tokenizer-fix-decode' into export-pack_strings-and-unpa…
Wovchena Nov 10, 2023
debcb5d
Move unpack_strings(), create sepparate include dir
Wovchena Nov 10, 2023
b739ffd
openvino_extensions
Wovchena Nov 10, 2023
e70a3f2
Fixed link stage on Windows
ilya-lavrenov Nov 13, 2023
2ce27cd
Merge pull request #6 from ilya-lavrenov/windows-linkage
apaniukov Nov 13, 2023
3022a5a
i64 is default tokenizer output type
apaniukov Nov 14, 2023
c467a8c
Add support for more tiktoken tokenizers
apaniukov Nov 14, 2023
1ec4c5f
Merge branch 'tokenizer-fix-decode' into export-pack_strings-and-unpa…
Wovchena Nov 15, 2023
8505b51
Check Azure CI
apaniukov Nov 15, 2023
82639e6
Fix Azure Win CI
apaniukov Nov 15, 2023
fb37580
Merge pull request #2 from Wovchena/export-pack_strings-and-unpack_st…
apaniukov Nov 15, 2023
a45b826
Define python version for setupvars.bat
apaniukov Nov 15, 2023
35cc136
Merge remote-tracking branch 'origin/tokenizer-fix-decode' into token…
apaniukov Nov 15, 2023
244a593
Add support for tiktoken detokenizers
apaniukov Nov 15, 2023
84686f4
Merge branch 'master' into tokenizer-fix-decode
andrei-kochin Nov 16, 2023
ad1c589
Add ChatGLM tokenization support.
apaniukov Nov 16, 2023
0f63c3d
Add ChatGLM detokenization and tests
apaniukov Nov 16, 2023
0f1c1cc
Add ChatGLM detokenization and tests
apaniukov Nov 17, 2023
3edb73b
Fix mac sha256
apaniukov Nov 17, 2023
48bba34
Skip Lin Java Tests
apaniukov Nov 17, 2023
fe507ff
Add Mac Tokenziers Tests and Skip Mac Java Step
apaniukov Nov 17, 2023
4b0c4ec
Fix Mac SHA
apaniukov Nov 17, 2023
4656238
Del WA for CPU Bug
apaniukov Nov 17, 2023
2f5cc1c
Fix Mac CI Pipeline
apaniukov Nov 18, 2023
1568727
Change Mac CI
apaniukov Nov 18, 2023
fa822c2
Fixed compilation
ilya-lavrenov Nov 20, 2023
6ddb2a6
Merge pull request #7 from ilya-lavrenov/compilation-fix
apaniukov Nov 20, 2023
14f993b
Add setupvars to mac CI
apaniukov Nov 20, 2023
cae3098
Merge branch 'master' into tokenizer-fix-decode
apaniukov Nov 20, 2023
b59204d
Change detokenizer output type
apaniukov Nov 20, 2023
e54b42e
Merge remote-tracking branch 'origin/tokenizer-fix-decode' into token…
apaniukov Nov 20, 2023
6c3bae3
Fix SegFault on AddedTokens For BPE tokenizer
apaniukov Nov 20, 2023
d34d401
Add SP Space handling for decoder
apaniukov Nov 20, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions modules/custom_operations/user_ie_extensions/CMakeLists.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ov::Core core;
core.add_extension("/home/wov/r/openvino.genai/llm/build/thirdparty/openvino_contrib/modules/custom_operations/user_ie_extensions/libuser_ov_extensions.so");
ov::InferRequest tokenizer = core.compile_model("/home/wov/r/openvino.genai/llm/build/tokenizer.xml", "CPU", {ov::cache_dir("llm-cache")}).create_infer_request();
ov::Tensor tensor = tokenizer.get_input_tensor();
pack_strings(std::vector<std::string>{"asdf"}, tensor);
tokenizer.infer();

causes Segmentation fault. Removing , {ov::cache_dir("llm-cache")} fixes the problem. Strangely CACHE_DIR works for python.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original problem is described here: https://github.com/openvinotoolkit/openvino.genai/blob/4c6c6cb9cf2b64584c797f2ff4ca0b7b658dabca/llm/llm.cpp#L85

Here's my attempt to simplify the reproducer, but it requires Debug build:

#include <openvino/openvino.hpp>

int main(int argc, char* argv[]) try {
    if (argc != 4) {
        throw std::runtime_error(std::string{"Usage: "} + argv[0] + " <openvino_model.xml> <tokenizer.xml> '<prompt>'");
    }
    ov::Core core;
    core.add_extension(USER_OV_EXTENSIONS_PATH);  // USER_OV_EXTENSIONS_PATH is defined in CMakeLists.txt
    ov::InferRequest tolenizer = core.compile_model(argv[2], "CPU").create_infer_request();
    std::shared_ptr<ov::Model> model = core.read_model(argv[1]);
    constexpr size_t BATCH_SIZE = 1;
    std::map<std::string, ov::PartialShape> shapes = {
        {"input_ids", ov::PartialShape{
            BATCH_SIZE, {1, std::numeric_limits<ov::Dimension::value_type>::max()}
        }},
        {"attention_mask", ov::PartialShape{
            BATCH_SIZE, {1, std::numeric_limits<ov::Dimension::value_type>::max()}
        }}
    };
    for (const ov::Output<ov::Node>& input : model->inputs()) {
        for (const std::string& name : input.get_names()) {
            if (name.rfind("past_key_values", 0) == 0) {
                ov::PartialShape shape = input.get_partial_shape();
                shape[0] = BATCH_SIZE;
                shapes.emplace(name, shape);
                break;
            }
        }
    }
    model->reshape(shapes);
    ov::preprocess::PrePostProcessor p3(model);
    p3.input("input_ids").tensor().set_element_type(ov::element::i32);  // cast to the type of tokenyzer's output
    p3.input("attention_mask").tensor().set_element_type(ov::element::i32);
    p3.input("input_ids").preprocess().convert_element_type(ov::element::i64);
    p3.input("attention_mask").preprocess().convert_element_type(ov::element::i64);
    model = p3.build();
    ov::InferRequest ireq = core.compile_model(model, "CPU", {ov::cache_dir("llm-cache")}).create_infer_request();
    for (const ov::Output<ov::Node>& input : model->inputs()) {
        for (const std::string& name : input.get_names()) {
            if (name.rfind("past_key_values", 0) == 0) {
                ireq.get_tensor(input).set_shape(input.get_partial_shape().get_min_shape());
                break;
            }
        }
    }
    float* logits;
    size_t n_vocab;
    int32_t out_token;
    {
        ov::Tensor inp{ov::element::i32, {1, 1}};
        inp.data<int32_t>()[0] = 29528;
        ireq.get_tensor("input_ids").set_shape({1, 1});
        ireq.set_tensor("input_ids", inp);
        ireq.get_tensor("attention_mask").set_shape({1, 1});
        std::fill_n(ireq.get_tensor("attention_mask").data<int32_t>(), inp.get_size(), 1);
        ireq.infer();
        size_t n_vocab = ireq.get_tensor("logits").get_shape().back();
        logits = ireq.get_tensor("logits").data<float>() + (inp.get_size() - 1) * n_vocab;
        out_token = int32_t(std::max_element(logits, logits + n_vocab) - logits);
    }
    ireq.get_tensor("input_ids").set_shape({BATCH_SIZE, 1});
    ireq.get_tensor("attention_mask").set_shape({BATCH_SIZE, 1});
    ireq.get_tensor("attention_mask").data<int32_t>()[0] = 1;
    constexpr int32_t SPECIAL_EOS_TOKEN = 2;
    while (out_token != SPECIAL_EOS_TOKEN) {
        std::cout << out_token << ' ' << std::flush;
        for (const ov::Output<ov::Node>& input : model->inputs()) {
            for (const std::string& name : input.get_names()) {
                if (name.rfind("past_key_values", 0) == 0) {
                    ireq.set_tensor(input, ireq.get_tensor("present" + name.substr(15)));
                    break;
                }
            }
        }
        ireq.get_tensor("input_ids").data<int32_t>()[0] = 29528;
        ireq.infer();
        logits = ireq.get_tensor("logits").data<float>();
        out_token = int32_t(std::max_element(logits, logits + n_vocab) - logits);
    }
    std::cout << '\n';
} catch (const std::exception& error) {
    std::cerr << error.what() << '\n';
    return 1;
} catch (...) {
    std::cerr << "Non-exception object thrown\n";
    return 1;
}

cmake -DCMAKE_BUILD_TYPE=Debug .. && cmake --build . -j && ./llm ~/r/openvino.genai/llm/open_llama_3b_v2/openvino_model.xml tokenizer.xml "Hi" randomly results in Segmentation fault. Tested with l_openvino_toolkit_ubuntu20_2023.2.0.dev20230922_x86_64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using different Core object for llama fixes the problem, removing ov::preprocess::PrePostProcessor usage fixes the problem, storing CompiledModel instead of ireq of tokenizer fixes the problem. Moving ov::Tensor inp{ov::element::i32, {1, 1}}; to outer scope fixes the problem

Original file line number Diff line number Diff line change
Expand Up @@ -80,8 +80,9 @@ if(TBB_FOUND)
target_link_libraries(${TARGET_NAME} PRIVATE TBB::tbb TBB::tbbmalloc)
endif()

if(sentence_piece IN_LIST CUSTOM_OPERATIONS)
add_subdirectory(sentence_piece)
# Left sentence_piece for backward compatibility
if(tokenizer IN_LIST CUSTOM_OPERATIONS)
add_subdirectory(tokenizer)
endif()

target_link_libraries(${TARGET_NAME} PRIVATE openvino::runtime)
Expand Down
34 changes: 29 additions & 5 deletions modules/custom_operations/user_ie_extensions/ov_extension.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -52,14 +52,38 @@
# define S_CONV_EXT
#endif

#ifdef sentence_piece
# include "sentence_piece/sentence_piece.hpp"
# define SENTENSE_PIECE_EXT \
#ifdef tokenizer
# include "tokenizer/tokenizer.hpp"
# define TOKENIZER_EXT \
std::make_shared<ov::OpExtension<StringTensorPack>>(), \
std::make_shared<ov::OpExtension<RaggedTensorPack>>(), \
std::make_shared<ov::OpExtension<StringTensorUnpack>>(), \
std::make_shared<ov::OpExtension<CaseFold>>(), \
std::make_shared<ov::frontend::ConversionExtension>("CaseFoldUTF8", translate_case_fold_utf8), \
std::make_shared<ov::OpExtension<NormalizeUnicode>>(), \
std::make_shared<ov::frontend::ConversionExtension>("NormalizeUTF8", translate_normalize_utf8), \
std::make_shared<ov::OpExtension<RegexNormalization>>(), \
std::make_shared<ov::frontend::ConversionExtension>("StaticRegexReplace", translate_static_regex_replace), \
std::make_shared<ov::OpExtension<RegexSplit>>(), \
std::make_shared<ov::frontend::ConversionExtension>("RegexSplitWithOffsets", translate_regex_split_with_offsets), \
std::make_shared<ov::OpExtension<WordpieceTokenizer>>(), \
std::make_shared<ov::frontend::ConversionExtension>("WordpieceTokenizeWithOffsets", translate_wordpiece_tokenize_with_offsets), \
std::make_shared<ov::OpExtension<BPETokenizer>>(), \
std::make_shared<ov::OpExtension<BytesToChars>>(), \
std::make_shared<ov::frontend::ConversionExtension>("LookupTableFindV2", translate_lookup_table_find_v2), \
std::make_shared<ov::OpExtension<CombineSegments>>(), \
std::make_shared<ov::OpExtension<RaggedToDense>>(), \
std::make_shared<ov::OpExtension<VocabDecoder>>(), \
std::make_shared<ov::OpExtension<CharsToBytes>>(), \
std::make_shared<ov::frontend::ConversionExtension>("Reshape", translate_reshape), \
std::make_shared<ov::frontend::ConversionExtension>("Const", translate_const), \
std::make_shared<ov::OpExtension<TemplateExtension::SentencepieceTokenizer>>(), \
std::make_shared<ov::OpExtension<TemplateExtension::SentencepieceDetokenizer>>(), \
std::make_shared<ov::OpExtension<TemplateExtension::SentencepieceStreamDetokenizer>>(), \
std::make_shared<ov::frontend::ConversionExtension>("SentencepieceOp", translate_sentencepiece_op), \
std::make_shared<ov::frontend::ConversionExtension>("RaggedTensorToSparse", translate_sentencepiece_tokenizer),
#else
# define SENTENSE_PIECE_EXT
# define TOKENIZER_EXT
#endif

OPENVINO_CREATE_EXTENSIONS(std::vector<ov::Extension::Ptr>(
Expand All @@ -69,5 +93,5 @@ OPENVINO_CREATE_EXTENSIONS(std::vector<ov::Extension::Ptr>(
S_CONV_TRANSPOSE_EXT
S_CONV_EXT
COMPLEX_MUL_EXT
SENTENSE_PIECE_EXT
TOKENIZER_EXT
}));

This file was deleted.

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,12 @@ FetchContent_Declare(
URL_HASH SHA256=a7c105aca0131b4a899155a6c44ea9728e63514edaa8d71fa92e7a5de53b6ca0
)

FetchContent_Declare(
fast_tokenizer
URL https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-linux-x64-1.0.2.tgz
URL_HASH SHA256=843a8299b55ef2e06ea50ba0d4ab4cb05b9e4cdb7cb8e29f3d55c494a1b7aecc
)

if(CMAKE_COMPILER_IS_GNUCXX OR CMAKE_CXX_COMPILER_ID MATCHES "^(Apple)?Clang$")
set(cxx_flags "-Wno-undef")
elseif(CMAKE_CXX_COMPILER_ID STREQUAL "MSVC")
Expand All @@ -36,6 +42,9 @@ endif()
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${cxx_flags}")

FetchContent_MakeAvailable(sentencepiece)
FetchContent_MakeAvailable(fast_tokenizer)

include("${fast_tokenizer_SOURCE_DIR}/FastTokenizer.cmake")

# set include dirs for specific source files
target_include_directories(${TARGET_NAME} PRIVATE
Expand All @@ -44,13 +53,15 @@ target_include_directories(${TARGET_NAME} PRIVATE
"${sentencepiece_SOURCE_DIR}/third_party/protobuf-lite"
"${sentencepiece_SOURCE_DIR}"
"${sentencepiece_SOURCE_DIR}"
"${sentencepiece_BINARY_DIR}")
"${sentencepiece_BINARY_DIR}"
"${FAST_TOKENIZER_INCS}")


if(CMAKE_CL_64)
target_compile_definitions(sentencepiece-static PRIVATE _CRT_SECURE_NO_WARNINGS _SCL_SECURE_NO_WARNINGS)
endif()

target_link_libraries(${TARGET_NAME} PRIVATE sentencepiece-static)
target_link_libraries(${TARGET_NAME} PRIVATE sentencepiece-static ${FAST_TOKENIZER_LIBS})

# string_view is used from cxx17
string(REPLACE " " ";" cxx_flags "${cxx_flags}")
Expand Down
Loading
Loading