Skip to content

Commit

Permalink
[GPU] Update docs related to KV-cache quantization (#27821)
Browse files Browse the repository at this point in the history
### Details:
 - Update docs related to KV-cache quantization on GPU
- Allow to use `element::u8` as data type for KV-cache quantization to
be aligned with CPU Plugin
  • Loading branch information
sshlyapn authored Nov 29, 2024
1 parent 71d9463 commit 7670715
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 3 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -276,9 +276,10 @@ includes **Dynamic quantization** of activations of 4/8-bit quantized MatMuls an
ov_config={"KV_CACHE_PRECISION": "u8", "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32", "PERFORMANCE_HINT": "LATENCY"}
)
.. note::
.. note::
Currently, for KV-cache quantization, GPU ignores the DYNAMIC_QUANTIZATION_GROUP_SIZE property, using ``group_size = head_size``. Additionally, it does not support the ``get_state()`` and ``set_state()`` APIs when KV-cache quantization is enabled.

Currently, both Dynamic quantization and KV-cache quantization are available for CPU device.
For GPU, KV-cache quantization is enabled by default on platforms without XMX support, and can be disabled by setting KV_CACHE_PRECISION to ``undefined``.


Working with Models Tuned with LoRA
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@ class KVCacheCompressionMatcher : public ov::pass::MatcherPass {
KVCacheCompressionMatcher::KVCacheCompressionMatcher(ov::element::Type compression_dt) {
using namespace ov::pass::pattern;

if (compression_dt != element::i8)
if (compression_dt != element::i8 && compression_dt != element::u8)
return;

const auto quantization_type = ov::op::internal::DynamicQuantize::QuantizationType::Asymmetric;
Expand Down

0 comments on commit 7670715

Please sign in to comment.