Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add datadog certifier #2366

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

robert-cronin
Copy link
Contributor

@robert-cronin robert-cronin commented Dec 12, 2024

Description of the PR

Fixes #2345

I am not sure if there is a need for a parser or attestation since were just ingesting CertifyBad for a particular pURL, but if there is a need to represent the source information in a predicate, I'd be happy to try and figure out how to add that in.

PR Checklist

  • All commits have a Developer Certificate of Origin (DCO) -- they are generated using -s flag to git commit.
  • All new changes are covered by tests
  • If GraphQL schema is changed, make generate has been run
  • If GraphQL schema is changed, GraphQL client updates/additions have been made
  • If OpenAPI spec is changed, make generate has been run
  • If ent schema is changed, make generate has been run
  • If collectsub protobuf has been changed, make proto has been run
  • All CI checks are passing (tests and formatting)
  • All dependent PRs have already been merged

@funnelfiasco
Copy link
Contributor

As a general comment, I wonder if we want to call it something more specific than "DataDog"? "DataDog Malicious Packages DataSet" is unwieldy, but I'm concerned that there might be some future thing that pulls from DataDog proper and the name is already taken. I don't have any great ideas and this may not be a concern worth worrying about right now, but I wanted to raise it.

@robert-cronin
Copy link
Contributor Author

robert-cronin commented Dec 13, 2024

As a general comment, I wonder if we want to call it something more specific than "DataDog"? "DataDog Malicious Packages DataSet" is unwieldy, but I'm concerned that there might be some future thing that pulls from DataDog proper and the name is already taken. I don't have any great ideas and this may not be a concern worth worrying about right now, but I wanted to raise it.

yeah, that is a solid point, if DataDog eventually spin out other datasets, I can see how that might cause some confusion. The data itself mostly comes from GuardDog but I think not exclusively. Maybe we can go with something like datadog-malware-dataset or datadog-mspd but mspd is not a known acronym. The alternative is datadog-malicious-software-packages-dataset but like you said that is a bit unwieldy.
The datadog-malware-dataset one sounds like the best compromise to me between clarity and brevity.

@pxp928
Copy link
Collaborator

pxp928 commented Dec 19, 2024

Thanks @robert-cronin! Sorry for the delay. We will review this soon!

@robert-cronin
Copy link
Contributor Author

Thanks @robert-cronin! Sorry for the delay. We will review this soon!

No problems, thanks @pxp928!

@pxp928 pxp928 added the needs-review Needs writer LGTM label Jan 6, 2025
Copy link
Contributor

@lumjjb lumjjb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a super cool addition. I wasn't aware of this dataset but this was a really cool implementation and its such a good example on how to add another data source easily (or at least you made it look easy! - any feedback on how to make this easier would be super great as well, or any particular frictions you had). Thanks so much for yet another great contribution! 🙌

opt(d)
}

if err := d.fetchManifests(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like manifests are fetched once on initialization. Given the database will be updated regularly - it would be helpful to refresh the manifests based on some frequency. Is this feasible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will add it in!

if pkgInput.Namespace != nil && *pkgInput.Namespace != "" {
namespace := strings.TrimPrefix(*pkgInput.Namespace, "@")
namespace = strings.TrimPrefix(namespace, "%40")
fullName = "@" + namespace + "/" + pkgInput.Name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like it isn't always the case that the packages start with "@" in the dataset, could we add a check here after the trim to see if the namespace had the prefix? and add the "@" only if there was a prefix trim?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly, I've added something to this effect 👍

}

// NewDatadogMalwareCertifier initializes the Datadog Malicious Software Packages certifier
func NewDatadogMalwareCertifier(ctx context.Context, assemblerFunc assemblerFuncType, opts ...CertifierOption) (certifier.Certifier, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a bit of documentation here on what the datadog malicious software packages are, for those that are not familiar.

In addition could you add some details on:

  • The added predicates on the graph
  • The recommended interval times (considering that the current certifier will generate a certifyBad each time indefinitely).
  • Any caveats: see comment on periodic fetching of manifest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, I'll add some documentation along these lines

@robert-cronin
Copy link
Contributor Author

This is a super cool addition. I wasn't aware of this dataset but this was a really cool implementation and its such a good example on how to add another data source easily (or at least you made it look easy! - any feedback on how to make this easier would be super great as well, or any particular frictions you had). Thanks so much for yet another great contribution! 🙌

Thanks @lumjjb ! I really appreciate your encouraging words 😃 In terms of frictions, I think there are some options for improving the scalability of adding new data sources be they collectors or certifiers. Perhaps one idea is to define a common interface that any collector or certifier must implement and then have a registrar similar to how the backend works today in the spirit of dedpulication. There are also some common logic items in the certifiers/collectors like initialising nats/calling ingestion flow/ emitters etc. Not sure how much of that will be changing in v2.0 but it might be worth looking into

Signed-off-by: robert-cronin <[email protected]>
@robert-cronin robert-cronin force-pushed the feat/datadog-certifier branch from d2f86e2 to 4aebd31 Compare January 16, 2025 02:35
@robert-cronin
Copy link
Contributor Author

Hello @lumjjb! Your suggestions have been implemented and all outstanding changes addressed. If you have any other suggestions, let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-review Needs writer LGTM size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feature] Add support for DataDog's malicious software package dataset
4 participants