From 6a9df4e362fcc2b26f8d85a6115c044f95b67783 Mon Sep 17 00:00:00 2001 From: Ian Kretz <44385082+ikretz@users.noreply.github.com> Date: Mon, 16 Dec 2024 16:22:45 +0100 Subject: [PATCH 1/3] Clarify meaning of empty version list --- README.md | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 5e4c8051..91deeaa9 100644 --- a/README.md +++ b/README.md @@ -12,17 +12,18 @@ Current ecosystems: ## Usage -Malicious samples are available under the **[samples/](samples/)** folder and compressed as an encrypted ZIP file with the password `infected`. The date indicated as part of the file name is the -discovery date, not necessarily the package publication date. +Malicious samples are available under the **[samples/](samples/)** folder and compressed as an encrypted ZIP file with the password `infected`. The date indicated as part of the file name is the discovery date, not necessarily the package publication date. You can use the script [extract.sh](./samples/pypi/extract.sh) to automatically extract all the samples to perform local analysis on them. Alternatively, you can extract a single sample using: ``` -$ unzip -o -P infected samples/pypi/2023-03-20-pydefender-v1.0.0.zip -d /tmp/ -Archive: samples/pypi/2023-03-20-pydefender-v1.0.0.zip +$ unzip -o -P infected samples/pypi/pydefender/1.0.0/2023-03-20-pydefender-v1.0.0.zip -d /tmp/ +Archive: samples/pypi/pydefender/1.0.0/2023-03-20-pydefender-v1.0.0.zip creating: /tmp/2023-03-20-pydefender-v1.0.0/ ``` +Each [samples/](samples/) subdirectory contains a `manifest.json` file that identifies the packages, and the versions of those packages, that comprise the samples collected for each ecosystem. You can use these files to quickly search the dataset for particular samples. + ## License This dataset is released under the Apache-2.0 license. You're welcome to use it with attribution. @@ -63,6 +64,12 @@ We will be regularly adding new packages to the dataset. Every single software package included in this dataset has been manually triaged by a human. +### What does it mean when the `manifest.json` entry for a package has an empty version list? + +Around 250 packages in the PyPI subset do not have any affected versions listed in their `manifest.json` entries. These cases are holdovers from the earliest days of the project before version information was attached to the sample names. + +In such cases, it should be assumed that **all** versions of the package are malicious. + ### How are you clustering these packages? At the time, we did not make available the clustering algorithm we use internally to group similar samples and ease analysis. If you have interest, please reach out at securitylabs@datadoghq.com - From e3e5fba1c99b7b5968c36a44875fdac88bdd0eee Mon Sep 17 00:00:00 2001 From: Ian Kretz <44385082+ikretz@users.noreply.github.com> Date: Mon, 16 Dec 2024 16:30:20 +0100 Subject: [PATCH 2/3] Improve language --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 91deeaa9..15e847d5 100644 --- a/README.md +++ b/README.md @@ -68,7 +68,7 @@ Every single software package included in this dataset has been manually triaged Around 250 packages in the PyPI subset do not have any affected versions listed in their `manifest.json` entries. These cases are holdovers from the earliest days of the project before version information was attached to the sample names. -In such cases, it should be assumed that **all** versions of the package are malicious. +If you intend to use this dataset to screen packages for known-maliciousness, then **all** versions of packages with empty version lists should be considered malicious. ### How are you clustering these packages? From e2d2f48e74ab475280caab4be147ca6eaf55620a Mon Sep 17 00:00:00 2001 From: Ian Kretz <44385082+ikretz@users.noreply.github.com> Date: Mon, 16 Dec 2024 16:42:36 +0100 Subject: [PATCH 3/3] Shorten question --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 15e847d5..040ba0e5 100644 --- a/README.md +++ b/README.md @@ -64,7 +64,7 @@ We will be regularly adding new packages to the dataset. Every single software package included in this dataset has been manually triaged by a human. -### What does it mean when the `manifest.json` entry for a package has an empty version list? +### What if the `manifest.json` entry for a package has an empty version list? Around 250 packages in the PyPI subset do not have any affected versions listed in their `manifest.json` entries. These cases are holdovers from the earliest days of the project before version information was attached to the sample names.