-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Receiver resource filtering with CEL #948
base: main
Are you sure you want to change the base?
Conversation
@stefanprodan @matheuscscp I reworked it to filter the resources using CEL. I'm not quite sure that this is really that valuable, I suspect label filtering is also more efficient? There's some optimisation of the CEL that could be done within |
4969c46
to
1fe996b
Compare
d87dbfd
to
b1c4e7e
Compare
45bd724
to
1e6929c
Compare
1e6929c
to
d658d4a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for all the work here, @bigkevmcd! I'm very excited for this feature 😄
This is just a preliminary review. I will try to review this again soon.
5a22c4f
to
bddfd56
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @bigkevmcd 🥇
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @bigkevmcd! Pretty cool contribution 👍
8e892ef
to
18c1f31
Compare
@bigkevmcd this is good to go, please signoff all your commits and force push. |
8a9e435
to
57c36b0
Compare
3f15050
to
9d7f593
Compare
internal/server/cel.go
Outdated
return nil, err | ||
} | ||
|
||
out, _, err := prg.Eval(map[string]any{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using ContextEval()
and pass a context to this so that it can be cancelled?
https://pkg.go.dev/github.com/google/[email protected]/cel#Program
As per their docs, this also requires setting InterruptCheckFrequency ProgramOption to be effective. There's an example in https://github.com/google/cel-go/blob/v0.21.0/cel/cel_test.go#L127.
Thinking more about preventing long running expensive evaluations, I'm not sure what would be a good way to set a limit on the evaluation.
We can set a fixed context timeout for just this operation, regardless of the timeout of the incoming request. But deciding on the timeout value is difficult. (UPDATE: Thinking more about it, I don't think it's difficult anymore. There's no reason to allow a single evaluation to take a long time. We can set a limit of 10-15s. For worst case on slow machines to be safe, it could be set to 30s. If that's not enough, we could provide a flag to make this configurable, similar to how we provide flag options for configuring the size of helm index and chart max size limit in source-controller. I think it's reasonable to at least set this to limit any potential abuse.)
There may be some other ways to do it. I see there's some options like cost estimator, recursion limit, expression size limits, etc which may be useful to limit the evaluation. I don't know much about them yet. Need to read more.
err := server.ValidateCELExpression(filter) | ||
if err != nil { | ||
conditions.MarkFalse(obj, meta.ReadyCondition, apiv1.InvalidCELExpressionReason, "%s", err) | ||
obj.Status.WebhookPath = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we perform this validation above first in this function, then we don't need to remove this status value here.
I think that ordering is more correct. We write the status only once we have performed all the validations.
In the case where the Receiver was good at first and got a bad configuration in a subsequent version, since the receiver handler checks the ready status of the receiver before processing requests, it should be okay to have a stale webhook path in the status. We do something similar in other flux APIs, keeping the previously successful result to just have a record, next to a failing status. But I don't see any issue in resetting it with empty value considering how the webhook path is used.
if err != nil { | ||
conditions.MarkFalse(obj, meta.ReadyCondition, apiv1.InvalidCELExpressionReason, "%s", err) | ||
obj.Status.WebhookPath = "" | ||
logger.Error(err, "parsing CEL resource filter expression") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the objects get stalled, in addition to logging, we generally also emit a kubernetes warning event. Something like
r.Event(obj, corev1.EventTypeWarning, apiv1.InvalidCELExpressionReason, "parsing CEL resource filter expression")
Should result in something like
LAST SEEN TYPE REASON OBJECT MESSAGE
3s Warning InvalidCELExpression receiver/test-receiver parsing CEL resource filter expression
if filter := obj.Spec.ResourceFilter; filter != "" { | ||
err := server.ValidateCELExpression(filter) | ||
if err != nil { | ||
conditions.MarkFalse(obj, meta.ReadyCondition, apiv1.InvalidCELExpressionReason, "%s", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking the Ready condition to false results in the following status:
status:
conditions:
- lastTransitionTime: "2024-11-04T16:07:03Z"
message: Reconciliation in progress
observedGeneration: 8
reason: ProgressingWithRetry
status: "True"
type: Reconciling
- lastTransitionTime: "2024-11-04T16:07:03Z"
message: |-
failed to parse expression request.body.tag.contains(: ERROR: <input>:1:27: Syntax error: mismatched input '<EOF>' expecting {'[', '{', '(', ')', '.', '-', '!', 'true', 'false', 'null', NUM_FLOAT, NUM_INT, NUM_UINT, STRING, BYTES, IDENTIFIER}
| request.body.tag.contains(
| ..........................^
observedGeneration: 8
reason: InvalidCELExpression
status: "False"
type: Ready
observedGeneration: 6
It says that the object is still being reconciled. This is due to how the reconciliation result is being interpreted in patch()
at the bottom of this file. In addition to setting Ready=False, marking the object as stalled with
conditions.MarkStalled(obj, apiv1.InvalidCELExpressionReason, "%s", err)
should delete any existing Reconciling condition, which will prevent the reconciling with retry condition reason to be set, refer
notification-controller/internal/controller/receiver_controller.go
Lines 215 to 218 in 2ebbf48
conditions.Has(obj, meta.ReconcilingCondition) { | |
rc := conditions.Get(obj, meta.ReconcilingCondition) | |
rc.Reason = meta.ProgressingWithRetryReason | |
conditions.Set(obj, rc) |
This should result in the status to look like:
status:
conditions:
- lastTransitionTime: "2024-11-04T16:33:55Z"
message: |-
failed to parse expression request.body.tag.contains(: ERROR: <input>:1:27: Syntax error: mismatched input '<EOF>' expecting {'[', '{', '(', ')', '.', '-', '!', 'true', 'false', 'null', NUM_FLOAT, NUM_INT, NUM_UINT, STRING, BYTES, IDENTIFIER}
| request.body.tag.contains(
| ..........................^
observedGeneration: 8
reason: InvalidCELExpression
status: "True"
type: Stalled
- lastTransitionTime: "2024-11-04T16:07:03Z"
message: |-
failed to parse expression request.body.tag.contains(: ERROR: <input>:1:27: Syntax error: mismatched input '<EOF>' expecting {'[', '{', '(', ')', '.', '-', '!', 'true', 'false', 'null', NUM_FLOAT, NUM_INT, NUM_UINT, STRING, BYTES, IDENTIFIER}
| request.body.tag.contains(
| ..........................^
observedGeneration: 8
reason: InvalidCELExpression
status: "False"
type: Ready
observedGeneration: 6
Ready=False and Stalled=True looks good above. But observedGeneration: 6
is not correct for stalled scenario. We are not retrying anymore, we have processed this version of the object, hence observed generation should equal the object generation. To update the observed generation, we can set a patch option if stalled condition is true before patching in
notification-controller/internal/controller/receiver_controller.go
Lines 220 to 221 in 2ebbf48
// Patch the object status, conditions and finalizers. |
// Update observed generation when stalled.
if conditions.IsStalled(obj) {
patchOpts = append(patchOpts, patch.WithStatusObservedGeneration{})
}
This should result in the final status to be
status:
conditions:
- lastTransitionTime: "2024-11-04T16:33:55Z"
message: |-
failed to parse expression request.body.tag.contains(: ERROR: <input>:1:27: Syntax error: mismatched input '<EOF>' expecting {'[', '{', '(', ')', '.', '-', '!', 'true', 'false', 'null', NUM_FLOAT, NUM_INT, NUM_UINT, STRING, BYTES, IDENTIFIER}
| request.body.tag.contains(
| ..........................^
observedGeneration: 8
reason: InvalidCELExpression
status: "True"
type: Stalled
- lastTransitionTime: "2024-11-04T16:07:03Z"
message: |-
failed to parse expression request.body.tag.contains(: ERROR: <input>:1:27: Syntax error: mismatched input '<EOF>' expecting {'[', '{', '(', ')', '.', '-', '!', 'true', 'false', 'null', NUM_FLOAT, NUM_INT, NUM_UINT, STRING, BYTES, IDENTIFIER}
| request.body.tag.contains(
| ..........................^
observedGeneration: 8
reason: InvalidCELExpression
status: "False"
type: Ready
observedGeneration: 8
Which I believe is correct and in alignment with what we do in other Flux APIs.
Most of these conventions aren't documented anywhere, except for some github gists I created when we were trying to fit kstatus in flux API, for example https://gist.github.com/darkowlzz/969c90b2f309908a6d71dd861ba69653 which may be out of date by now.
Please ask if anything looks incorrect above or any clarification is needed.
348c595
to
2636e01
Compare
checked, issues := env.Check(parsed) | ||
if issues != nil && issues.Err() != nil { | ||
return nil, fmt.Errorf("expression %v check failed: %w", expr, issues.Err()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the expression is expected to return a boolean result, as checked in newCELEvaluator()
below, we can check the expression output type right after parsing and checking to make sure only expressions that result in boolean results are accepted.
In all the tutorials of cel-go they use reflection to do the same, but I think something like the following should be enough
if checked.OutputType() != types.BoolType {
return nil, fmt.Errorf("invalid expression output type %v", checked.OutputType())
}
internal/server/cel.go
Outdated
mapStrDyn := decls.NewMapType(decls.String, decls.Dyn) | ||
return cel.NewEnv( | ||
celext.Strings(), | ||
notifications(), | ||
cel.Declarations( | ||
decls.NewVar("resource", mapStrDyn), | ||
decls.NewVar("request", mapStrDyn), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going through cel-go's common built-in types, I came across checkedWellKnowns
which has some json specific types. "google.protobuf.Struct"
seems to be very similar to what's defined above as mapStrDyn
, a map of string and a dynamic value. I tried using that instead with:
return cel.NewEnv(
celext.Strings(),
notifications(),
cel.Variable("resource", cel.ObjectType("google.protobuf.Struct")),
cel.Variable("request", cel.ObjectType("google.protobuf.Struct")),
)
and it seems to work. Can we use it instead?
internal/server/cel.go
Outdated
cel.Function("first", | ||
cel.MemberOverload("first_list", []*cel.Type{listStrDyn}, cel.DynType, | ||
cel.UnaryBinding(listFirst))), | ||
cel.Function("last", | ||
cel.MemberOverload("last_list", []*cel.Type{listStrDyn}, cel.DynType, | ||
cel.UnaryBinding(listLast))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These library functions are tested as part of the event handler.
If it's not a lot to ask, being able to test these functions independently would make it easier to maintain and also use as a reference for future additions. Something like a typical table test with lists of different element types would be nice to have.
@darkowlzz It'll be a couple of weeks before I can devote more time to these things, but I will get to them. |
891a42c
to
757f96d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost concluding my CEL research with this. I don't think there's any hurry. This can be addressed whenever you're back to it. 🙂
if err != nil { | ||
return nil, err | ||
} | ||
parsed, issues := env.Parse(expr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In continuation to my comment in #948 (comment) regarding setting some restrictions and limits, I think I have some clarity now. I would like to discuss this with others as well, below and also in next week's Flux dev meeting.
In addition to setting a fixed timeout for the evaluation, CEL accepts limits on the parse for:
- max recursion depth
- expression size code point limit
and limits on the program cost, which is indicative of the CPU usage.
As per https://github.com/google/cel-go/blob/v0.22.0/parser/parser.go#L51-L70, max recursion depth and expression size code point limit have a default value when not set. I think we should make these configurable as we don't set any expression length limit in the Receiver API field and some admins may want to only allow low recursion depth in the expression. Refer https://github.com/google/cel-go/blob/v0.22.0/cel/options.go#L642-L656 for EnvOption
for them.
For the program cost limit, refer https://github.com/google/cel-go/blob/v0.22.0/cel/options.go#L512 for ProgramOption
, there doesn't seem to be any default value, which means the evaluation can potentially consume a lot of CPU by default. It would be nice to provide ability to set a limit on it.
If the above and previous comment is agreed, I believe we can introduce four new flags for setting this at the controller level. Leaving suggestion for potential flag names below
--receiver-cel-parser-max-recursion-depth
--receiver-cel-parser-expression-size-point-limit
--receiver-cel-program-cost-limit
--receiver-cel-eval-timeout
In addition, in the CEL-spec, I found some relevant details about the space and time complexity of the expressions under the performance section, refer https://github.com/google/cel-spec/blob/v0.18.0/doc/langdef.md#performance for details.
I hope these would provide enough controls for the admins to address any potential abuse of arbitrary code execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@darkowlzz have you looked into how Kubernetes provides these controls? Maybe we can use the same args as them.
Here are their default limits: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/apis/cel/config.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I didn't look at kubernetes' CEL use yet. I was mostly trying to independently understand CEL and find justifiable reasons to consider adding such limits.
Thanks for sharing the CEL configuration kubernetes uses. I think this provides enough reasons to consider adding the limits in our use as well. I see that they have a limit for the incoming request size that's evaluated by CEL. I think we need to consider that as well. I need to read and learn more about how kubernetes is using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on leaning on k8s' default limits as a starting point. It might be worth just confirming whether the payload and expression sizes for this use case are equivalent, so that those limits would not be too forgiving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't gone through the kubernetes CEL usage yet but I would like to share some updates from last week's discussion about this in the dev meeting.
We discussed that it may be better to not provide all the limit controls flags at the beginning, but instead have reasonable default limits. Only if the users ask to raise some limits, we can add flag for that. The default limit values could be based on what kubernetes uses but maybe even lower that fits better for our use case.
Regarding the CEL evaluation timeout that I proposed initially, instead of a timeout per CEL evaluation, it would be better to have a fixed timeout for the receiver handler, similar to what we have for the event handler, refer
ctx, cancel := context.WithTimeout(r.Context(), 15*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bigkevmcd really good stuff, I am looking forward to seeing this landing. 🙇
internal/server/receiver_handlers.go
Outdated
return fmt.Errorf("matchLabels field not set when using wildcard '*' as name") | ||
} | ||
|
||
logger.V(1).Info(fmt.Sprintf("annotate resources by matchLabel for kind '%s' in '%s'", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.V(1).Info(fmt.Sprintf("annotate resources by matchLabel for kind '%s' in '%s'", | |
logger.V(1).Info(fmt.Sprintf("annotate resources by matchLabel for kind %q in %q", |
|
||
var evaluator func(context.Context, client.Object) (*bool, error) | ||
if receiver.Spec.ResourceFilter != "" { | ||
evaluator, err = newCELEvaluator(receiver.Spec.ResourceFilter, r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any chance that a maliciously crafted filter could result in sensitive information from the request being exposed - perhaps via the returned error?
if err != nil { | ||
return nil, err | ||
} | ||
parsed, issues := env.Parse(expr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on leaning on k8s' default limits as a starting point. It might be worth just confirming whether the payload and expression sizes for this use case are equivalent, so that those limits would not be too forgiving.
That's 3MiB, I've definitely seen GitHub hook notifications that were > 1.5MiB. |
757f96d
to
1645796
Compare
CEL for Receiver notification filtering This introduces CEL for filtering CEL resources in a Receiver. Users can define a CEL expression that is applied as a filter for resources that are identified for notification. A CEL expression that returns false means that a resource will not be annotated. Signed-off-by: Kevin McDermott <[email protected]>
This moves the JSON body to request.body from request to allow for future expansion with the headers. Add documentation for the CEL functionality to the receivers doc. Signed-off-by: Kevin McDermott <[email protected]>
Signed-off-by: Kevin McDermott <[email protected]>
Signed-off-by: Kevin McDermott <[email protected]>
Signed-off-by: Kevin McDermott <[email protected]>
Signed-off-by: Kevin McDermott <[email protected]>
Refactor to reduce duplication. Signed-off-by: Kevin McDermott <[email protected]>
If the .spec.resourceFilter is provided, and can't be parsed by the CEL environment, the resource will not be ready, and an appropriate error indicated. Signed-off-by: Kevin McDermott <[email protected]>
Signed-off-by: Kevin McDermott <[email protected]>
In the event of a CEL expresion error, this doesn't return the error to the controller-runtime reconciler mechanism. This means that the reconciliation will not be retried until there's a change to the resource. Signed-off-by: Kevin McDermott <[email protected]>
* Don't pass log through - get it from the context * Use ContextEval - need to set a timeout on the context that's passed Signed-off-by: Kevin McDermott <[email protected]>
This adds more testing for the CEL evaluation mechanism for resource filtering. Signed-off-by: Kevin McDermott <[email protected]>
Signed-off-by: Kevin McDermott <[email protected]>
1645796
to
f79599e
Compare
This implementation allows filtering of wildcarded resources for receivers using CEL expressions.
Closes: #491
Spec: #491 (comment)