-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLAMA_CPP notebook with Qwen-1.5-7B-Chat #901
LLAMA_CPP notebook with Qwen-1.5-7B-Chat #901
Conversation
modules/llama_cpp_plugin/notebooks/llama_cpp_plugin_with_qwen.ipynb
Outdated
Show resolved
Hide resolved
modules/llama_cpp_plugin/notebooks/llama_cpp_plugin_with_qwen.ipynb
Outdated
Show resolved
Hide resolved
modules/llama_cpp_plugin/notebooks/llama_cpp_plugin_with_qwen.ipynb
Outdated
Show resolved
Hide resolved
modules/llama_cpp_plugin/notebooks/llama_cpp_plugin_with_qwen.ipynb
Outdated
Show resolved
Hide resolved
caa42dc
to
3246d0e
Compare
3246d0e
to
36ef503
Compare
LGTM, thanks. |
@@ -0,0 +1,281 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vshampor , I suggest to specify the high-level requirements for this notebooks at the beginning. Is it required to have NVidia GPU+CUDA? Will it work on Windows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaximProshin CUDA and GPU is not necessary in the basic version of the notebook execution, although the user may uncomment a CMake option to compile for CUDA support, in which case they are naturally expected to have CUDA and GPUs on board. Windows is not supported due to my using Linux paths and direct shell calls to build the plugin from source.
" curr_token_ids = np.argmax(curr_logits[:, -1, :], axis=1).reshape([1, 1])\n", | ||
" last_token_id = curr_token_ids[0][0]\n", | ||
"\n", | ||
"ov_model.create_infer_request().reset_state()" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does all InferRequest
share state and one can reset it like this, or does this line create a new request and not affect the original?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The state is shared across all infer requests to the same model object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a LLAMA_CPP plugin feature?
Maybe the code sample should create an explicit InferRequest
object anyway because it does not align with the rest of the OpenVINO (where each request has its own state) and might cause confusion in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The InferRequest
object is explicitly created at the line you highlighted, during the evaluation of the ov_model.create_infer_request()
subexpression. On the second look it might be possible to have separate KV caches associated with separate infer requests, with corresponding effects on memory consumption scaling with the increase of infer requests alive, but then I don't see how the syntax in line 248 should work in a stateful fashion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first (implicit) infer request, which contains KV Cache state created during the first inference on line 229. It populates CompiledModel._infer_requiest
attribute and, as I understand, is used by any __call__
method invocation (line 248). So this state lives as long as the CompiledModel
object lives.
ov_model.create_infer_request()
creates a new request with a blank state, so reset_state()
call on line 253 does not affect the CompiledModel._infer_requiest
state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akuporos does this mean that the __call__
API for OV inference is incompatible with stateful execution and all OV docs and notebooks should be adjusted to explicitly state this?
This could be easily fixed by providing a getter for the internal infer request in the CompiledModel
object, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be easily fixed by providing a getter for the internal infer request in the CompiledModel object, I think.
CompiledModel.reset_state
might be more consistent with the existing API. Otherwise, there is not much sense in a hidden InferRequest
in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vshampor @apaniukov Yes CompiledModel.reset_state
can be more consistent and you can create request for such feature. The CompiledModel.__call__
was added months ago as simplified method to work with OV, not as a functional replacement of all APIs from InferRequest
because of some voices that APIs "should align" -- that "call" was the most it could be done, let's keep that in mind.
there is not much sense in a hidden InferRequest in the first place.
It is/was, because simplified API is what the name states.
providing a getter for the internal infer request
If you need access right now, CompiledModel._infer_request
will work as WA. If this is None (i.e. no inference was called yet) just skip the call, or you could create placeholder before:
if compiled_model._infer_request is None:
compiled_model._infer_request = compiled_model.create_infer_request()
So Python API would be happy to extend CompiledModel interface if there is direct need, but require story in JIRA to back it up.
@akuporos giving back the voice to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiwaszki in the context of stateful execution this is not a feature request, but a bug report against the new API. I will submit the corresponding ticket. The user does not care about the reasons for introducing this exact version of inference, they care about fulfilling their use case. I think I have illustrated the direct need in this notebook - also
the semi-casual user, familiar with the semantics of other inferencing frameworks, will try the __call__
API first since it is, as you've said, "aligned" with what the other inferencing frameworks provide. Having CompiledModel.reset_state
is fine by me.
If you need access right now, CompiledModel._infer_request will work as WA.
That is understood, but surely you can't be recommending coding against internal APIs in a user-facing example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meanwhile I rewrote the code to use InferRequest
s explicitly. The separate states for different infer request objects are introduced in #908.
Adds a notebook that demonstrates the usage of the
LLAMA_CPP
plugin to run LLM inference with the Qwen-1.5-7B-Chat model.