NPUW: Hotfix - delay the original weight memory deallocation (#27886)

### Details: - Since recently (after #27767 ), NPUW drops links to the original weights to avoid memory duplication; - The drop happens after the LazyTensor evaluation which supposedly creates a new tensor; - It is not always the case, as sometimes tensors stay in the original format/precision and all that needs to be done is to copy the buffer into L0 memory as is; - Previously, the `.detach()` only destroyed the associated `Constant` node that held the weight buffer, but in case of memory-mapped weights, the buffer was kept alive until the reference to `ov::Model` was destroyed in the end of the `ov::npuw::CompiledModel` constructors; - There was found a case when, before reaching the NPUW partitioning & transformation pipeline, the model weights were first altered - from `BF16` to `FP16` precision. In this case, the relevant `Constant` nodes in the IR referred to their own weight buffers, not shared via mmap, so in this case detaching a lazy tensor led to these buffers prematurely destroyed, causing a segfault when the "evaluated" tensor (in this case, just a reference to the original one) was copied to L0. - Moving the `.detach()` to the very end of `eval_and_alloc` fixes this problem. ### Tickets: - *ticket-id*
openvinotoolkit · Dec 3, 2024 · f0925bc · f0925bc
1 parent 395340e
commit f0925bc
Showing 1 changed file with 6 additions and 3 deletions.
diff --git a/src/plugins/intel_npu/src/plugin/npuw/weights_bank.cpp b/src/plugins/intel_npu/src/plugin/npuw/weights_bank.cpp
@@ -110,9 +110,6 @@ ov::Tensor Bank::eval_and_alloc(const LazyTensor& tensor,
         return transformed_tensor;
     }
 
-    // Non-CPU case: detach the evaluated LazyTensor from its memory
-    const_cast<LazyTensor&>(tensor).detach();
-
     ov::SoPtr<ov::ITensor> remote_tensor;
     ov::Tensor allocated_tensor;
 
@@ -124,6 +121,12 @@ ov::Tensor Bank::eval_and_alloc(const LazyTensor& tensor,
     guard.unlock();  // Unlock the guard, map update is done - copy can continue in parallel
 
     transformed_tensor.copy_to(allocated_tensor);
+
+    // Detach the evaluated LazyTensor from its memory here - when it is 100%
+    // not needed anymore (transformations, if any, and copies are done)
+    // Note: this is the non-CPU path!
+    const_cast<LazyTensor&>(tensor).detach();
+
     return allocated_tensor;
 }