Circuit breaker #266

Kamil-Lontkowski · 2025-01-09T14:34:02Z

This draft implements CricuitBreaker with features based on that are provided in breaker from resilience4j.

Those features, for count based window (last n operations):

defining threshold for failures
defining threshold for slow operations, and threshold for what is considered slow
minimum number of operations until thresholds are calculated
HalfOpen state that allows number of operations to pass, after that it is calculated if those operations where below threshold so we can close breaker or go back to open.
time after which breaker goes from Open to HalfOpen

Things that are not implemented in here but are in resilience4j.

Timeout for HalfOpen state, after which breaker goes back to Open
Ability to turn of automatic transition from Open to HalfOpen state

Kamil-Lontkowski · 2025-01-09T14:48:57Z

For now this is incomplete since I have some questions.

Right now I am not keeping results of calls made in HalfOpen state separate. Maybe it should be, because different number of operations can push rate more or less depending on the difference.
Should I add those 2 missing features from resilience4j? Flag seems necessary, since we should not worry about background thread. But HalfOpen timeout might be useful, but it will complicate a bit process of registering operation started in HalfOpen state, so we don't end up with wrong metrics.
Right now there is only runOrDrop operation defined. Since resilience4j always throw exception I am not sure what other interfaces would make sense.
Would it make sense to have ability to completely wipe out current state of circuit breaker?

Right now only now only count based sliding window is implemented but I wanted to discuss those questions right away.

With time based sliding windows I think that background loop evicting older entries from queue similar to SlidingWindow RateLimiterAlgorithm should be sufficient.

Kamil-Lontkowski · 2025-01-09T14:58:25Z

core/src/main/scala/ox/resilience/CircuitBreaker.scala

+            if numOfOperations >= minimumNumberOfCalls && (failuresRate >= failureRateThreshold || slowRate >= slowCallThreshold) then
+              // Start schedule to switch to HalfOpen after waitDurationOpenState passed
+              forkDiscard(
+                scheduled(ScheduledConfig[Throwable, Unit](Schedule.InitialDelay(waitDurationOpenState)))(


Effects in updateAndGet should be avoided, but I think scheduling state change twice (in case of reapplying function) should not break things, especially since we should have braker open for some time. But maybe it can create race condition where we can complete enough calls in HalfOpen state to change it to Closed only for second scheduled to complete and change it back to HalfOpen

Hm this looks fragile. I think we should change the design a little. An actor here seems like a natural choice - the actor state - that is, the state machine - would then always be accessed from a single thread. We could send updates to the actor with the results of operation invocations, and basing on the, the internal state would be updated.

One place where we'd have to deviate from the actor pattern is checking if the circuit breaker is open - always going through the actor would be a bottleneck (as all operations would have to synchronize on a single thread). So the actor would also have to update some shared mutable state, which every thread could quickly check (not necessarily immediately consistent, it's fine if we let through one or two additional operations, when the CB is closing)

Then each circuit breaker state could be modelled as an immutable value, so we'd only have one top-level mutable overall state, managed by the actor

In last commits I changed to this approach but in second test it seems like on my machine circuitBreaker allow to run 26-27 operations while it should open after 10. If I change buffer capacity of actor to 100 it is able to complete all operations before it opens. When f takes 10ms only one operations slips through. I think with actor I can get rid of AtomicCircularBuffer, maybe this creates enough overhead that "fast" operations are able to start faster than state changes.

I also would need actorRef to schedule state update within state machine but I can't seem to figure out how to pass it since they depend on each other

core/src/main/scala/ox/resilience/AtomicCircularBuffer.scala

core/src/main/scala/ox/resilience/CircuitBreaker.scala

adamw · 2025-01-10T11:36:48Z

As for the questions:

hm I don't know - how do circuit breakers work normally? maybe you could find some articles describing typical designs, and link to them here or in the comments; I think we should aim for whatever is "industry standard"
again, that's a question of how circuit breakers work in general. Intuitively, the half open state should transition to open/closed after certain conditions are met, but I'm sure there's plenty of edge-cases to consider
yeah, runOrDrop is fine
I think any use-case for wiping can be served by simply creating a new CB, so let's omit wiping for now

Kamil-Lontkowski · 2025-01-13T17:33:18Z

Answer based on what is available in pekko/akka, monix, rezilience (zio). circuit (cats-effect) and polly(C#)

resilience4j provide some more configuration in this regard than other libs. I think it would be best to calculate metrics for different states separately. It is more intuitive how rates are calculated. Other libs allow just for one call and if it succeeds it closes.
This also seems like more flexibility on resilience4j side.

Monix, circuit and Pekko/Akka works exactly the same. They count failures (or slow calls) in a row not a rate based on window. Then wait before going to halfOpen and then deciding based on one operation result. Plus the wait duration before transitioning to halfOpen is configured as backoff.

rezilience provides maxFailures in a row just like monix and also count based sliding window. It also supports different schedules for waiting before going to halfOpen state. It also allows only one call to decide if it goes back to open or close.

Polly is little different but only in few cases. It provides threshold rates for sampling window of some duration. As I understand it in effect means sliding window(But maybe it is simpler and works just like fixed window). It also support minimum number of calls to be able trip in a sample. It also allows for dynamically determining break duration before switching to halfOpen.It also provide ability to set state manually and reading current state through CircuitBreakerStateProvider.

Zio, resilience4j and rezilience also provides ability to consume different events like state changes or metrics.

adamw · 2025-01-14T15:49:52Z

core/src/main/scala/ox/resilience/CircuitBreaker.scala

+  case Slow
+
+case class Metrics(
+    failureRate: Int,


it would be good to name it or at least comment to be more precise on what "rate" is - I assume calls per second? or another unit? :)

It's percent of failed call in last window. I wanted to move state logic to parent trait and only leave calculating metrics to different implementations

adamw · 2025-01-14T15:51:58Z

core/src/main/scala/ox/resilience/CircuitBreaker.scala

+)
+
+enum SlidingWindow:
+  case CountBased(windowSize: Int)


these Ints are quite opaque ... maybe we could use type aliases or opaque types to make them more meaningful?

adamw · 2025-01-14T16:00:32Z

core/src/main/scala/ox/resilience/CircuitBreaker.scala

+    end if
+  end runOrDrop
+
+  def runEitherOrDrop[E, T](resultPolicy: ResultPolicy[E, T] = ResultPolicy.default[E, T])(


shouldn't this just delegate to runOrDrop using the either error mode?

Also we need the "basic" case of runOrDrop(op: => T): T where failures are exceptions

I don't really know why I didn't do that, will fix.

adamw · 2025-01-14T16:01:11Z

core/src/main/scala/ox/resilience/CircuitBreaker.scala

+    if acquiredResult.acquired then
+      val before = System.nanoTime()
+      val result = op
+      val after = System.nanoTime()


we should have a timed utility method in ox, I think we have such time measurements in a couple of places

Honestly I don't think we have. We have couple of similar methods as test utilities.

adamw · 2025-01-14T16:01:46Z

core/src/main/scala/ox/resilience/CircuitBreaker.scala

+      val after = System.nanoTime()
+      val duration = (after - before).nanos
+      // Check result and results of policy
+      if em.isError(result) && resultPolicy.isWorthRetrying(em.getError(result)) then


I'm wondering if resultPolicy applies here. If it's an error, it's an error. Does it matter if the error is retryable when it comes to CB?

adamw · 2025-01-14T16:10:00Z

core/src/main/scala/ox/resilience/CircuitBreaker.scala

+    private val callResults: Array[Option[CircuitBreakerResult]] = Array.fill[Option[CircuitBreakerResult]](windowSize)(None)
+    private var writeIndex = 0
+
+    private var _state: CircuitBreakerState = CircuitBreakerState.Closed


this is accessed externally, so needs to be concurrency-protected? with a large disclaimer that although this does break the actor's internal state protection, we know what we're doing ;)

I was wondering if it needs to be atomic. We read concurrently but write only from the same actor's thread

Concurrent reads need to be protected as well. If you don't know why, it might be good to read on memory barriers, java's memory model and concurrency primitives :)

adamw · 2025-01-14T16:11:43Z

core/src/main/scala/ox/resilience/CircuitBreaker.scala

+    def state: CircuitBreakerState = _state
+
+    def registerResult(result: CircuitBreakerResult, acquired: AcquireResult): Unit =
+      callResults(writeIndex) = Some(result)


in fact, it might be good to separate the "pure" state machine update function - which takes the current state + call result as a parameter, and outputs the new state + optional self-callbacks to run (?); from the mutable actor's state, which can be queried externally.

adamw · 2025-01-14T16:13:33Z

core/src/test/scala/ox/resilience/CircuitBreakerTest.scala

+import scala.concurrent.duration.*
+import ox.*
+
+class CircuitBreakerTest extends AnyFlatSpec with Matchers:


probably too late to do proper TDD, but it might be a good idea to invest some time into a strong testsuite, where all (or most) test fail, but which test the various configuration options that we want to support (thresholds, opening/closing, count-based and time-based windows etc.)

I wanted to test separately state machine, so now that the interface more or less clarified it should be easier to write those tests.

Ah yes, this might be a good idea - but I would then double-down on separating the "pure" and side-effecting/mutating parts. And write a lot of tests for the pure part, and some "integration" ones for the whole package

adamw · 2025-01-14T16:25:37Z

Answer based on what is available in pekko/akka, monix, rezilience (zio). circuit (cats-effect) and polly(C#)

Good analysis, thanks :) So basing on that, what design would you propose? What would be the configuration options, and the algorithm of transferring between closed/ho/open states?

Not sure if we need both count-based and windowed variants - isn't the count-based variant a windowed variant, but with window duration = Inf?

Kamil-Lontkowski · 2025-01-14T17:21:03Z

There is difference that if we would just treat count based as sliding window with Inf we would always have to count all results. Window size defines how many n last operation we want to include in metrics. I wanted to move all state machine logic to base trait and only difference between implementations would be how we calculate metrics.

If we leave both variants we can have all functionalities (maybe apart from ability to consume events). Giving proper arguments we can mimic pekko and monix behavior exactly. I am only debating if we would want to support any Infinite schedule when it comes to those durations, but I would want to have proper implementation of all other functionalities before that, then see if it fits.

adamw · 2025-01-15T08:36:31Z

But in a count-based approach, you're counting all results anyway?

Kamil-Lontkowski · 2025-01-15T11:40:58Z

Yeah, but callResults is a very basic implementation of CircularBuffer so we count only on max n call results and don't hold in memory more results than we need. The writeIndex is increment during registration of result.

Kamil-Lontkowski added 2 commits January 9, 2025 12:38

AtomicCircularBuffer and skeleton on CircuitBreaker

71c2fa7

CircuitBreakerCountStateMachine

4b96c8e

Kamil-Lontkowski commented Jan 9, 2025

View reviewed changes

core/src/main/scala/ox/resilience/AtomicCircularBuffer.scala Outdated Show resolved Hide resolved

adamw reviewed Jan 10, 2025

View reviewed changes

core/src/main/scala/ox/resilience/CircuitBreaker.scala Outdated Show resolved Hide resolved

Kamil-Lontkowski added 2 commits January 10, 2025 21:30

WIP

60c1ab6

breaker based on actor

9f10e20

Kamil-Lontkowski added 2 commits January 14, 2025 15:21

Don't use atomics inside state machine

46b0b8b

Delete out of date TODOs

ee33c5a

adamw reviewed Jan 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Circuit breaker #266

Circuit breaker #266

Kamil-Lontkowski commented Jan 9, 2025

Kamil-Lontkowski commented Jan 9, 2025

Kamil-Lontkowski Jan 9, 2025

adamw Jan 10, 2025

adamw Jan 10, 2025

Kamil-Lontkowski Jan 13, 2025

Kamil-Lontkowski Jan 13, 2025

adamw commented Jan 10, 2025

Kamil-Lontkowski commented Jan 13, 2025 •

edited

Loading

adamw Jan 14, 2025

Kamil-Lontkowski Jan 14, 2025

adamw Jan 14, 2025

adamw Jan 14, 2025

Kamil-Lontkowski Jan 14, 2025

adamw Jan 14, 2025

Kamil-Lontkowski Jan 14, 2025

adamw Jan 14, 2025

adamw Jan 14, 2025

Kamil-Lontkowski Jan 14, 2025

adamw Jan 15, 2025

adamw Jan 14, 2025

adamw Jan 14, 2025

Kamil-Lontkowski Jan 14, 2025

adamw Jan 14, 2025

adamw commented Jan 14, 2025

Kamil-Lontkowski commented Jan 14, 2025

adamw commented Jan 15, 2025

Kamil-Lontkowski commented Jan 15, 2025

Circuit breaker #266

Are you sure you want to change the base?

Circuit breaker #266

Conversation

Kamil-Lontkowski commented Jan 9, 2025

Kamil-Lontkowski commented Jan 9, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamw commented Jan 10, 2025

Kamil-Lontkowski commented Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamw commented Jan 14, 2025

Kamil-Lontkowski commented Jan 14, 2025

adamw commented Jan 15, 2025

Kamil-Lontkowski commented Jan 15, 2025

Kamil-Lontkowski commented Jan 13, 2025 •

edited

Loading