LLAMA_CPP notebook with Qwen-1.5-7B-Chat #901

vshampor · 2024-04-04T18:53:30Z

Adds a notebook that demonstrates the usage of the LLAMA_CPP plugin to run LLM inference with the Qwen-1.5-7B-Chat model.

vshampor · 2024-04-04T18:53:55Z

@MaximProshin @AlexKoff88

modules/llama_cpp_plugin/notebooks/llama_cpp_plugin_with_qwen.ipynb

AlexKoff88 · 2024-04-15T06:18:59Z

LGTM, thanks.

MaximProshin · 2024-04-15T07:27:25Z

modules/llama_cpp_plugin/notebooks/qwen.ipynb

@@ -0,0 +1,281 @@
+{


@vshampor , I suggest to specify the high-level requirements for this notebooks at the beginning. Is it required to have NVidia GPU+CUDA? Will it work on Windows?

@MaximProshin CUDA and GPU is not necessary in the basic version of the notebook execution, although the user may uncomment a CMake option to compile for CUDA support, in which case they are naturally expected to have CUDA and GPUs on board. Windows is not supported due to my using Linux paths and direct shell calls to build the plugin from source.

apaniukov · 2024-04-15T08:30:24Z

modules/llama_cpp_plugin/notebooks/qwen.ipynb

+    "    curr_token_ids = np.argmax(curr_logits[:, -1, :], axis=1).reshape([1, 1])\n",
+    "    last_token_id = curr_token_ids[0][0]\n",
+    "\n",
+    "ov_model.create_infer_request().reset_state()"


Does all InferRequest share state and one can reset it like this, or does this line create a new request and not affect the original?

The state is shared across all infer requests to the same model object.

Is it a LLAMA_CPP plugin feature?

Maybe the code sample should create an explicit InferRequest object anyway because it does not align with the rest of the OpenVINO (where each request has its own state) and might cause confusion in the future.

The InferRequest object is explicitly created at the line you highlighted, during the evaluation of the ov_model.create_infer_request() subexpression. On the second look it might be possible to have separate KV caches associated with separate infer requests, with corresponding effects on memory consumption scaling with the increase of infer requests alive, but then I don't see how the syntax in line 248 should work in a stateful fashion.

The first (implicit) infer request, which contains KV Cache state created during the first inference on line 229. It populates CompiledModel._infer_requiest attribute and, as I understand, is used by any __call__ method invocation (line 248). So this state lives as long as the CompiledModel object lives.

ov_model.create_infer_request() creates a new request with a blank state, so reset_state() call on line 253 does not affect the CompiledModel._infer_requiest state.

@akuporos does this mean that the __call__ API for OV inference is incompatible with stateful execution and all OV docs and notebooks should be adjusted to explicitly state this?

This could be easily fixed by providing a getter for the internal infer request in the CompiledModel object, I think.

This could be easily fixed by providing a getter for the internal infer request in the CompiledModel object, I think.

CompiledModel.reset_state might be more consistent with the existing API. Otherwise, there is not much sense in a hidden InferRequest in the first place.

@vshampor @apaniukov Yes CompiledModel.reset_state can be more consistent and you can create request for such feature. The CompiledModel.__call__ was added months ago as simplified method to work with OV, not as a functional replacement of all APIs from InferRequest because of some voices that APIs "should align" -- that "call" was the most it could be done, let's keep that in mind.

there is not much sense in a hidden InferRequest in the first place.

It is/was, because simplified API is what the name states.

providing a getter for the internal infer request

If you need access right now, CompiledModel._infer_request will work as WA. If this is None (i.e. no inference was called yet) just skip the call, or you could create placeholder before:

if compiled_model._infer_request is None: compiled_model._infer_request = compiled_model.create_infer_request()

So Python API would be happy to extend CompiledModel interface if there is direct need, but require story in JIRA to back it up.

@akuporos giving back the voice to you.

@jiwaszki in the context of stateful execution this is not a feature request, but a bug report against the new API. I will submit the corresponding ticket. The user does not care about the reasons for introducing this exact version of inference, they care about fulfilling their use case. I think I have illustrated the direct need in this notebook - also
the semi-casual user, familiar with the semantics of other inferencing frameworks, will try the __call__ API first since it is, as you've said, "aligned" with what the other inferencing frameworks provide. Having CompiledModel.reset_state is fine by me.

If you need access right now, CompiledModel._infer_request will work as WA.

That is understood, but surely you can't be recommending coding against internal APIs in a user-facing example.

Meanwhile I rewrote the code to use InferRequests explicitly. The separate states for different infer request objects are introduced in #908.

AlexKoff88 reviewed Apr 5, 2024

View reviewed changes

modules/llama_cpp_plugin/notebooks/llama_cpp_plugin_with_qwen.ipynb Outdated Show resolved Hide resolved

AlexKoff88 reviewed Apr 5, 2024

View reviewed changes

modules/llama_cpp_plugin/notebooks/llama_cpp_plugin_with_qwen.ipynb Outdated Show resolved Hide resolved

AlexKoff88 reviewed Apr 5, 2024

View reviewed changes

modules/llama_cpp_plugin/notebooks/llama_cpp_plugin_with_qwen.ipynb Outdated Show resolved Hide resolved

AlexKoff88 reviewed Apr 5, 2024

View reviewed changes

modules/llama_cpp_plugin/notebooks/llama_cpp_plugin_with_qwen.ipynb Outdated Show resolved Hide resolved

vshampor force-pushed the example_notebook_qwen branch from caa42dc to 3246d0e Compare April 15, 2024 05:30

vshampor added 2 commits April 15, 2024 07:32

LLAMA_CPP notebook with Qwen-7B-Chat

c467276

Addressed comments

36ef503

vshampor force-pushed the example_notebook_qwen branch from 3246d0e to 36ef503 Compare April 15, 2024 05:32

vshampor changed the title ~~LLAMA_CPP notebook with Qwen-7B-Chat~~ LLAMA_CPP notebook with Qwen-1.5-7B-Chat Apr 15, 2024

vshampor requested a review from AlexKoff88 April 15, 2024 05:33

vshampor marked this pull request as ready for review April 15, 2024 05:33

vshampor requested a review from a team as a code owner April 15, 2024 05:33

AlexKoff88 approved these changes Apr 15, 2024

View reviewed changes

MaximProshin reviewed Apr 15, 2024

View reviewed changes

apaniukov reviewed Apr 15, 2024

View reviewed changes

Add a passage about requiring Linux

61a9e13

MaximProshin approved these changes Apr 15, 2024

View reviewed changes

Use verbose InferRequest API for inference

f0645d2

vshampor requested review from apaniukov and ilya-lavrenov April 26, 2024 09:36

apaniukov approved these changes May 7, 2024

View reviewed changes

ilya-lavrenov merged commit a6b9f14 into openvinotoolkit:master May 7, 2024
4 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLAMA_CPP notebook with Qwen-1.5-7B-Chat #901

LLAMA_CPP notebook with Qwen-1.5-7B-Chat #901

vshampor commented Apr 4, 2024 •

edited

Loading

vshampor commented Apr 4, 2024

AlexKoff88 commented Apr 15, 2024

MaximProshin Apr 15, 2024

vshampor Apr 15, 2024

apaniukov Apr 15, 2024

vshampor Apr 15, 2024

apaniukov Apr 15, 2024

vshampor Apr 15, 2024 •

edited

Loading

apaniukov Apr 15, 2024

vshampor Apr 25, 2024

apaniukov Apr 25, 2024

jiwaszki Apr 26, 2024 •

edited

Loading

vshampor Apr 26, 2024 •

edited

Loading

vshampor Apr 26, 2024 •

edited

Loading

LLAMA_CPP notebook with Qwen-1.5-7B-Chat #901

LLAMA_CPP notebook with Qwen-1.5-7B-Chat #901

Conversation

vshampor commented Apr 4, 2024 • edited Loading

vshampor commented Apr 4, 2024

AlexKoff88 commented Apr 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vshampor Apr 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiwaszki Apr 26, 2024 • edited Loading

Choose a reason for hiding this comment

vshampor Apr 26, 2024 • edited Loading

Choose a reason for hiding this comment

vshampor Apr 26, 2024 • edited Loading

Choose a reason for hiding this comment

vshampor commented Apr 4, 2024 •

edited

Loading

vshampor Apr 15, 2024 •

edited

Loading

jiwaszki Apr 26, 2024 •

edited

Loading

vshampor Apr 26, 2024 •

edited

Loading

vshampor Apr 26, 2024 •

edited

Loading