|
|
|
# RFC-0006 Alternative Suspend Control
|
|
|
|
|
|
|
|
**Status:** provisional
|
|
|
|
|
|
|
|
**Creation date:** 2023-09-20
|
|
|
|
|
|
|
|
**Last update:** 2023-10-18
|
|
|
|
|
|
|
|
|
|
|
|
## Summary
|
|
|
|
|
|
|
|
This RFC proposes an alternative method to indicate the suspended state of
|
|
|
|
suspendable resources to flux controllers through object metadata. It presents
|
|
|
|
an annotation key that can be used to suspend a resource from reconciliation as
|
|
|
|
an alternative to the `.spec.suspend` field. It does not address the
|
|
|
|
deprecation of this field from the resource apis. This annotation can
|
|
|
|
optionally act as a vehicle for communicating contextual information about the
|
|
|
|
suspended resource to users.
|
|
|
|
|
|
|
|
|
|
|
|
## Motivation
|
|
|
|
|
|
|
|
The current implementation of suspending a resource from reconciliation uses
|
|
|
|
the `.spec.suspend` field. A change to this field results in a generation
|
|
|
|
number increase which can be confusing when diffing.
|
|
|
|
|
|
|
|
Teams may wish to communicate information about the suspended resource, such as
|
|
|
|
the reason for the suspension, in the object itself.
|
|
|
|
|
|
|
|
### Goals
|
|
|
|
|
|
|
|
The flux reconciliation loop will support recognizing a resource's suspend
|
|
|
|
status from either the api field or the designated metadata annotation key.
|
|
|
|
The flux cli will similarly recognize this state with `get` commands and but
|
|
|
|
will alter only the metadata under the `suspend` command. The `resume` command
|
|
|
|
will still alter the api field but additionally the metadata. The
|
|
|
|
flux cli will support optionally setting the suspend metadata annotation value
|
|
|
|
with a user supplied string for a contextual message.
|
|
|
|
|
|
|
|
### Non-Goals
|
|
|
|
|
|
|
|
The deprecation plan for the `.spec.suspend` field is out of scope for this
|
|
|
|
RFC.
|
|
|
|
|
|
|
|
|
|
|
|
## Proposal
|
|
|
|
|
|
|
|
Register a flux resource metadata key `reconcile.fluxcd.io/suspended` with a
|
|
|
|
suspend semantic to be interpreted by controllers and manipulated by the cli.
|
|
|
|
The presence of the annotation key is an alternative to the `.spec.suspend` api
|
|
|
|
field setting when considering if a resource is suspended or not. The
|
|
|
|
annotation key is set by a `flux suspend` command and removed by a `flux
|
|
|
|
resume` command. The annotation key value is open for communicating a message
|
|
|
|
or reason for the object's suspension. The value can be set using a
|
|
|
|
`--message` flag to the `suspend` command.
|
|
|
|
|
|
|
|
### User Stories
|
|
|
|
|
|
|
|
#### Suspend/Resume without Generation Roll
|
|
|
|
|
|
|
|
Currently when a resource is set to suspended or resumed the `.spec.suspend`
|
|
|
|
field is mutated which increments the `.metadata.generation` field and after
|
|
|
|
successful reconciliation the `.status.observedGeneration` number. The
|
|
|
|
community believes that the generation change for this reason is not in
|
|
|
|
alignment with gitops principles. In more detail, upon suspension the
|
|
|
|
generation increments but the observed generation lags since reconciliation is
|
|
|
|
not completed successfully.
|
|
|
|
|
|
|
|
The flux controllers should recognize that a resource is suspended or
|
|
|
|
unsuspended from the presence of a special metadata key -- this key can be
|
|
|
|
added, removed or changed without patching the object in such a way that the
|
|
|
|
generation number increments.
|
|
|
|
|
|
|
|
#### Seeing Suspend State
|
|
|
|
|
|
|
|
Users should be able to see the effective suspend state of the resource with a
|
|
|
|
`flux get` command. The display should mirror what the controllers interpret
|
|
|
|
the suspend state to be. This story is included to capture current
|
|
|
|
functionality that should be preserved.
|
|
|
|
|
|
|
|
#### Suspend with a Reason
|
|
|
|
|
|
|
|
Often there is a purpose behind suspending a resource with the flux cli,
|
|
|
|
whether it be during incident response, source manifest cutovers, or various
|
|
|
|
other scenarios. The `flux diff` command provides an illustrative UX for
|
|
|
|
determining what will change if a suspended resource is resumed, but neither it
|
|
|
|
nor `flux get` help explain _why_ something is paused or when it would be ok to
|
|
|
|
resume reconciliation. On distributed teams this can become a point of friction
|
|
|
|
as it needs to be communicated among group stakeholders.
|
|
|
|
|
|
|
|
Flux users should have a way to succinctly signal to other users why a resource
|
|
|
|
is suspended on the resource itself.
|
|
|
|
|
|
|
|
#### Suspend without Cluster Access
|
|
|
|
|
|
|
|
How do these users ensure the application is suspended?
|
|
|
|
|
|
|
|
* A validated spec field `.spec.suspend` is typesafe and can be trusted to to
|
|
|
|
suspend a resource from reconciliation.
|
|
|
|
|
|
|
|
* Logs and metrics can reveal the suspend status for confirmation. Logs are
|
|
|
|
not ideal for this use case. Metrics may be the only safe way to
|
|
|
|
confirm an object is suspended without cluster access.
|
|
|
|
|
|
|
|
What other options are there?
|
|
|
|
|
|
|
|
* The existence of the `reconcile.fluxcd.io/suspended` metadata annotation is
|
|
|
|
not typesafe and not a trustworthy way to suspend. It becomes more valid
|
|
|
|
when reported by the cli, by a controller metric/log/event, or by object
|
|
|
|
status.
|
|
|
|
|
|
|
|
* The emission of an event from a controller upon suspend or resume transition.
|
|
|
|
|
|
|
|
* The update of the object status with indication of suspended status.
|
|
|
|
|
|
|
|
### Alternatives
|
|
|
|
|
|
|
|
#### More `.spec`
|
|
|
|
|
|
|
|
The existing `.spec.suspend` could be expanded with fields for the above
|
|
|
|
semantics. This would drive more generation number changes and would require a
|
|
|
|
change to the apis.
|
|
|
|
|
|
|
|
|
|
|
|
## Design Details
|
|
|
|
|
|
|
|
Implementing this RFC would involve the controllers and the cli.
|
|
|
|
|
|
|
|
This feature would create an alternate path to suspending an object and would
|
|
|
|
not violate the current apis.
|
|
|
|
|
|
|
|
### Common
|
|
|
|
|
|
|
|
The `reconcile.fluxcd.io/suspended` annotation key string and a getter function
|
|
|
|
would be made avaiable for controllers and the cli to recognize and manipulate the
|
|
|
|
suspend object metadata.
|
|
|
|
|
|
|
|
### Controllers
|
|
|
|
|
|
|
|
Flux controllers would skip reconciling a resource based on an `OR` of (1) the
|
|
|
|
api `.spec.suspend` and (2) the existence of the suspend metadata annotation
|
|
|
|
key. This would be implemented in the controller predicates to completely skip
|
|
|
|
any reconciliation cycle of suspended objects.
|
|
|
|
|
|
|
|
### cli
|
|
|
|
|
|
|
|
The `get` command would recognize the suspend state from the union of the
|
|
|
|
`.spec.suspend` and the presence of the suspended annotation.
|
|
|
|
|
|
|
|
The `suspend` command would add the suspend annotation but forgo modifying the
|
|
|
|
`.spec.suspend` field.
|
|
|
|
|
|
|
|
The `resume` command would remove the suspend annotation and modify the
|
|
|
|
`.spec.suspend` field to `false`.
|
|
|
|
|
|
|
|
The suspend annotation would by default be set to a generic value. An optional
|
|
|
|
cli flag (eg `--message`) would support setting the suspended annotation value
|
|
|
|
to a user-specified string.
|
|
|
|
|
|
|
|
## Breaking Changes - Version Skew and Suspend Honoring
|
|
|
|
|
|
|
|
An edge case exists under these proposed changes with regard to suspending
|
|
|
|
objects using a new version of the cli while the controllers are running older
|
|
|
|
versions. Specifically, the user suspends the object with the cli which adds
|
|
|
|
the suspend annotation but leaves the `.spec.suspend` field unmodified. The
|
|
|
|
user sees the object is suspended by the cli output. The controllers however do
|
|
|
|
not recognize the object is suspended.
|
|
|
|
|
|
|
|
A potential scenario where this case becomes very damaging is during git repo
|
|
|
|
refactoring where users suspend objects, relocate the manifest sources and
|
|
|
|
related references, and resume. The operation is meant to be a no-op. However
|
|
|
|
with such a version skew and `Kustomizations` set with `.spec.prune` enabled
|
|
|
|
major workload disruption could occur.
|
|
|
|
|
|
|
|
|
|
|
|
## Implementation History
|
|
|
|
|
|
|
|
tbd
|