Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify meaning of empty version list #138

Merged
merged 3 commits into from
Dec 16, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 11 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,18 @@ Current ecosystems:

## Usage

Malicious samples are available under the **[samples/](samples/)** folder and compressed as an encrypted ZIP file with the password `infected`. The date indicated as part of the file name is the
discovery date, not necessarily the package publication date.
Malicious samples are available under the **[samples/](samples/)** folder and compressed as an encrypted ZIP file with the password `infected`. The date indicated as part of the file name is the discovery date, not necessarily the package publication date.

You can use the script [extract.sh](./samples/pypi/extract.sh) to automatically extract all the samples to perform local analysis on them. Alternatively, you can extract a single sample using:

```
$ unzip -o -P infected samples/pypi/2023-03-20-pydefender-v1.0.0.zip -d /tmp/
Archive: samples/pypi/2023-03-20-pydefender-v1.0.0.zip
$ unzip -o -P infected samples/pypi/pydefender/1.0.0/2023-03-20-pydefender-v1.0.0.zip -d /tmp/
Archive: samples/pypi/pydefender/1.0.0/2023-03-20-pydefender-v1.0.0.zip
creating: /tmp/2023-03-20-pydefender-v1.0.0/
```

Each [samples/](samples/) subdirectory contains a `manifest.json` file that identifies the packages, and the versions of those packages, that comprise the samples collected for each ecosystem. You can use these files to quickly search the dataset for particular samples.

## License

This dataset is released under the Apache-2.0 license. You're welcome to use it with attribution.
Expand Down Expand Up @@ -63,6 +64,12 @@ We will be regularly adding new packages to the dataset.

Every single software package included in this dataset has been manually triaged by a human.

### What if the `manifest.json` entry for a package has an empty version list?

Around 250 packages in the PyPI subset do not have any affected versions listed in their `manifest.json` entries. These cases are holdovers from the earliest days of the project before version information was attached to the sample names.

If you intend to use this dataset to screen packages for known-maliciousness, then **all** versions of packages with empty version lists should be considered malicious.

### How are you clustering these packages?

At the time, we did not make available the clustering algorithm we use internally to group similar samples and ease analysis. If you have interest, please reach out at [email protected] -
Expand Down