Skip to content

Commit

Permalink
NVMe namespace management
Browse files Browse the repository at this point in the history
Describe how to view total NVMe space available, used, or leaked. Describe how
to reclaim leaked NVMe space.

Signed-off-by: Dean Roehrich <[email protected]>
  • Loading branch information
roehrich-hpe committed Jul 10, 2024
1 parent 703a2e8 commit 5391920
Show file tree
Hide file tree
Showing 2 changed files with 75 additions and 0 deletions.
2 changes: 2 additions & 0 deletions docs/guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
* [Lustre External MGT](external-mgs/readme.md)
* [Global Lustre](global-lustre/readme.md)
* [Directive Breakdown](directive-breakdown/readme.md)
* [NVMe Namespaces](nvme-namespaces/readme.md)

## NNF User Containers

Expand All @@ -24,3 +25,4 @@
## Node Management

* [Draining A Node](node-management/drain.md)
* [Debugging NVMe Namespaces](node-management/nvme-namespaces.md)
73 changes: 73 additions & 0 deletions docs/guides/node-management/nvme-namespaces.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Debugging NVMe Namespaces

## Total Space Available or Used

Find the total space available, and the total space used, on a Rabbit node using the Redfish API. One way to access the API is to use the `nnf-node-manager` pod on that node.

To view the space on node ee50, find its nnf-node-manager pod and then exec into it to query the Redfish API:

```console
[richerso@ee1:~]$ kubectl get pods -A -o wide | grep ee50 | grep node-manager
nnf-system nnf-node-manager-jhglm 1/1 Running 0 61m 10.85.71.11 ee50 <none> <none>
```

Then query the Redfish API to view the `AllocatedBytes` and `GuaranteedBytes`:

```console
[richerso@ee1:~]$ kubectl exec --stdin --tty -n nnf-system nnf-node-manager-jhglm -- curl -S localhost:50057/redfish/v1/StorageServices/NNF/CapacitySource | jq
{
"@odata.id": "/redfish/v1/StorageServices/NNF/CapacitySource",
"@odata.type": "#CapacitySource.v1_0_0.CapacitySource",
"Id": "0",
"Name": "Capacity Source",
"ProvidedCapacity": {
"Data": {
"AllocatedBytes": 128849888,
"ConsumedBytes": 128849888,
"GuaranteedBytes": 307132496928,
"ProvisionedBytes": 307261342816
},
"Metadata": {},
"Snapshot": {}
},
"ProvidedClassOfService": {},
"ProvidingDrives": {},
"ProvidingPools": {},
"ProvidingVolumes": {},
"Actions": {},
"ProvidingMemory": {},
"ProvidingMemoryChunks": {}
}
```

## Total Orphaned or Leaked Space

To determine the amount of orphaned space, look at the Rabbit node when there are no allocations on it. If there are no allocations then there should be no `NnfNodeBlockStorages` in the k8s namespace with the Rabbit's name:

```console
[richerso@ee1:~]$ kubectl get nnfnodeblockstorage -n ee50
No resources found in ee50 namespace.
```

To check that there are no orphaned namespaces, you can use the nvme command while logged into that Rabbit node:

```console
[root@ee50:~]# nvme list
Node SN Model Namespace Usage Format FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 S666NN0TB11877 SAMSUNG MZ1L21T9HCLS-00A07 1 8.57 GB / 1.92 TB 512 B + 0 B GDC7302Q
```

There should be no namespaces on the kioxia drives:

```console
[root@ee50:~]# nvme list | grep -i kioxia
[root@ee50:~]#
```

If there are namespaces listed, and there weren't any `NnfNodeBlockStorages` on the node, then they need to be deleted through the Rabbit software. The `NnfNodeECData` resource is a persistent data store for the allocations that should exist on the Rabbit. By deleting it, and then deleting the nnf-node-manager pod, it causes nnf-node-manager to delete the orphaned namespaces. This can take a few minutes after you actually delete the pod:

```console
kubectl delete nnfnodeecdata ec-data -n ee50
kubectl delete pod -n nnf-system nnf-node-manager-jhglm
```

0 comments on commit 5391920

Please sign in to comment.