Skip to content

Commit

Permalink
Merge pull request #138 from DataDog/ikretz/no-versions
Browse files Browse the repository at this point in the history
Clarify meaning of empty version list
  • Loading branch information
ikretz authored Dec 16, 2024
2 parents 3af5036 + e2d2f48 commit 98fdbcc
Showing 1 changed file with 11 additions and 4 deletions.
15 changes: 11 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,18 @@ Current ecosystems:

## Usage

Malicious samples are available under the **[samples/](samples/)** folder and compressed as an encrypted ZIP file with the password `infected`. The date indicated as part of the file name is the
discovery date, not necessarily the package publication date.
Malicious samples are available under the **[samples/](samples/)** folder and compressed as an encrypted ZIP file with the password `infected`. The date indicated as part of the file name is the discovery date, not necessarily the package publication date.

You can use the script [extract.sh](./samples/pypi/extract.sh) to automatically extract all the samples to perform local analysis on them. Alternatively, you can extract a single sample using:

```
$ unzip -o -P infected samples/pypi/2023-03-20-pydefender-v1.0.0.zip -d /tmp/
Archive: samples/pypi/2023-03-20-pydefender-v1.0.0.zip
$ unzip -o -P infected samples/pypi/pydefender/1.0.0/2023-03-20-pydefender-v1.0.0.zip -d /tmp/
Archive: samples/pypi/pydefender/1.0.0/2023-03-20-pydefender-v1.0.0.zip
creating: /tmp/2023-03-20-pydefender-v1.0.0/
```

Each [samples/](samples/) subdirectory contains a `manifest.json` file that identifies the packages, and the versions of those packages, that comprise the samples collected for each ecosystem. You can use these files to quickly search the dataset for particular samples.

## License

This dataset is released under the Apache-2.0 license. You're welcome to use it with attribution.
Expand Down Expand Up @@ -63,6 +64,12 @@ We will be regularly adding new packages to the dataset.

Every single software package included in this dataset has been manually triaged by a human.

### What if the `manifest.json` entry for a package has an empty version list?

Around 250 packages in the PyPI subset do not have any affected versions listed in their `manifest.json` entries. These cases are holdovers from the earliest days of the project before version information was attached to the sample names.

If you intend to use this dataset to screen packages for known-maliciousness, then **all** versions of packages with empty version lists should be considered malicious.

### How are you clustering these packages?

At the time, we did not make available the clustering algorithm we use internally to group similar samples and ease analysis. If you have interest, please reach out at [email protected] -
Expand Down

0 comments on commit 98fdbcc

Please sign in to comment.