mirror of https://github.com/fluxcd/flux2.git
Add proposal for adding a gating mechanism to Flux
Signed-off-by: Stefan Prodan <stefan.prodan@gmail.com>pull/3158/head
parent
b8fd46d0df
commit
650bea497f
@ -0,0 +1,252 @@
|
|||||||
|
# RFC-XXXX Gating Flux reconciliation
|
||||||
|
|
||||||
|
**Status:** provisional
|
||||||
|
|
||||||
|
**Creation date:** 2022-09-28
|
||||||
|
|
||||||
|
**Last update:** 2022-09-28
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
Flux should offer a mechanism for cluster admins and other teams involved in the release process
|
||||||
|
to manually approve the rollout of changes onto clusters. In addition, Flux should offer
|
||||||
|
a way to define maintenance time windows and other time-based gates, to allow a better control
|
||||||
|
of applications and infrastructure changes to critical system.
|
||||||
|
|
||||||
|
## Motivation
|
||||||
|
|
||||||
|
Flux watches sources (e.g. GitRepositories, OCIRepositories, HelmRepositories, S3-compatible Buckets) and
|
||||||
|
automatically reconciles the changes onto clusters as described with Flux Kustomizations and HelmReleases.
|
||||||
|
The teams involved in the delivery process (e.g. dev, qa, sre) can decide when changes are delivered
|
||||||
|
to production by reviewing and approving the proposed changes in a collaborative manner with pull request.
|
||||||
|
Once a pull request is merged onto a branch that defines the desired state of the production system,
|
||||||
|
Flux kicks off the reconciliation process.
|
||||||
|
|
||||||
|
There are situations when users want to have a gating mechanism after the desired state changes are merged in Git:
|
||||||
|
|
||||||
|
- Manual approval of container image updates (e.g. https://github.com/fluxcd/flux2/discussions/870)
|
||||||
|
- Manual approval of infrastructure upgrades (e.g. https://github.com/fluxcd/flux2/issues/959)
|
||||||
|
- Maintenance window (e.g. https://github.com/fluxcd/flux2/discussions/1004)
|
||||||
|
- Planned releases
|
||||||
|
- No Deploy Friday
|
||||||
|
|
||||||
|
### Goals
|
||||||
|
|
||||||
|
- Offer a dedicated API for defining time-based gates in a declarative manner.
|
||||||
|
- Introduce a `gating-controller` in the Flux suite that manages the `Gate` objects.
|
||||||
|
- Extend the current Flux APIs and controllers to support gating.
|
||||||
|
|
||||||
|
### Non-Goals
|
||||||
|
|
||||||
|
<!--
|
||||||
|
What is out of scope for this RFC? Listing non-goals helps to focus discussion
|
||||||
|
and make progress.
|
||||||
|
-->
|
||||||
|
|
||||||
|
## Proposal
|
||||||
|
|
||||||
|
In order to support manual gating, Flux could be extended with a dedicated API and controller
|
||||||
|
that would allow users to define `Gate` objects and perform operations like `open` and `close`.
|
||||||
|
|
||||||
|
A `Gate` object could be referenced in sources (Buckets, Git, Helm, OCI Repositories)
|
||||||
|
and syncs (Kustomizations, HelmReleases, ImageUpdateAutomation)
|
||||||
|
to block the reconciliation until the gate is opened.
|
||||||
|
|
||||||
|
A `Gate` can be opened or closed by annotating the object with a timestamp or by
|
||||||
|
calling a specific webhook receiver exposed by notification-controller.
|
||||||
|
|
||||||
|
A `Gate` can be configured to automatically close or open based on a time window defined in the `Gate` spec.
|
||||||
|
|
||||||
|
The `Gate` API would replace Flagger's current
|
||||||
|
[manual gating mechanism](https://docs.flagger.app/usage/webhooks#manual-gating).
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
> As a member of the SRE team, I want to allow deployments to happen only
|
||||||
|
> in a particular time frame of my own choosing.
|
||||||
|
|
||||||
|
Define a gate that automatically closes after 1h from the time it has been opened:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
|
||||||
|
kind: Gate
|
||||||
|
metadata:
|
||||||
|
name: sre-approval
|
||||||
|
namespace: flux-system
|
||||||
|
spec:
|
||||||
|
interval: 30s
|
||||||
|
default: closed
|
||||||
|
window: 1h
|
||||||
|
```
|
||||||
|
|
||||||
|
When the gate is created in-cluster, the `gating-controller` uses `spec.default` to set the `Opened` condition:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
|
||||||
|
kind: Gate
|
||||||
|
metadata:
|
||||||
|
name: sre-approval
|
||||||
|
namespace: flux-system
|
||||||
|
status:
|
||||||
|
conditions:
|
||||||
|
- lastTransitionTime: "2021-03-26T10:09:26Z"
|
||||||
|
message: "Gate closed by default"
|
||||||
|
reason: ReconciliationSucceeded
|
||||||
|
status: "False"
|
||||||
|
type: Opened
|
||||||
|
```
|
||||||
|
|
||||||
|
While the gate is closed, all the objects that reference it will wait for an approval:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
apiVersion: kustomize.toolkit.fluxcd.io/v1beta1
|
||||||
|
kind: Kustomization
|
||||||
|
metadata:
|
||||||
|
name: my-app
|
||||||
|
namespace: flux-system
|
||||||
|
spec:
|
||||||
|
gates:
|
||||||
|
- name: sre-approval
|
||||||
|
- name: qa-approval
|
||||||
|
status:
|
||||||
|
conditions:
|
||||||
|
- lastTransitionTime: "2021-03-26T10:09:26Z"
|
||||||
|
message: "Reconciliation is waiting approval, gate 'flux-system/sre-approval' is closed."
|
||||||
|
reason: GateClosed
|
||||||
|
status: "False"
|
||||||
|
type: Approved
|
||||||
|
```
|
||||||
|
|
||||||
|
The SRE team can open the gate either by annotating the gate or by calling the notification-controller webhook:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
kubectl -n flux-system annotate --overwrite gate/sre-approval \
|
||||||
|
open.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")"
|
||||||
|
```
|
||||||
|
|
||||||
|
The `gating-controller` extracts the ISO8601 date from the `open.gate` annotation value,
|
||||||
|
sets the `requestedAt` & `resetToDefaultAt`, and opens the gate for the specified window:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
|
||||||
|
kind: Gate
|
||||||
|
metadata:
|
||||||
|
name: sre-approval
|
||||||
|
namespace: flux-system
|
||||||
|
status:
|
||||||
|
requestedAt: "2021-03-26T10:00:00Z"
|
||||||
|
resetToDefaultAt: "2021-03-26T11:00:00Z"
|
||||||
|
conditions:
|
||||||
|
- lastTransitionTime: "2021-03-26T10:00:00Z"
|
||||||
|
message: "Gate scheduled for closing at 2021-03-26T11:00:00Z"
|
||||||
|
reason: ReconciliationSucceeded
|
||||||
|
status: "True"
|
||||||
|
type: Opened
|
||||||
|
```
|
||||||
|
|
||||||
|
While the gate is opened, all the objects that reference it are approved to reconcile at their configured interval.
|
||||||
|
|
||||||
|
The SRE can decide to close the gate ahead of its schedule with:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
kubectl -n flux-system annotate --overwrite gate/sre-approval \
|
||||||
|
close.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")"
|
||||||
|
```
|
||||||
|
|
||||||
|
The `gating-controller` extracts the ISO8601 date from the `close.gate` annotation value,
|
||||||
|
compares it with the `open.gate` & `requestedAt` date and closes the gate:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
|
||||||
|
kind: Gate
|
||||||
|
metadata:
|
||||||
|
name: sre-approval
|
||||||
|
namespace: flux-system
|
||||||
|
status:
|
||||||
|
requestedAt: "2021-03-26T10:10:00Z"
|
||||||
|
resetToDefaultAt: "2021-03-26T10:10:00Z"
|
||||||
|
conditions:
|
||||||
|
- lastTransitionTime: "2021-03-26T10:10:00Z"
|
||||||
|
message: "Gate close requested"
|
||||||
|
reason: ReconciliationSucceeded
|
||||||
|
status: "False"
|
||||||
|
type: Opened
|
||||||
|
```
|
||||||
|
|
||||||
|
The objects that are referencing this gate, will finish their ongoing reconciliation (if any) then pause.
|
||||||
|
|
||||||
|
> As a member of the SRE team, I want to block deployments in a particular time window.
|
||||||
|
|
||||||
|
To enforce a maintenance window of 24 hours, you can define a `Gate` that's opened by default:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
|
||||||
|
kind: Gate
|
||||||
|
metadata:
|
||||||
|
name: maintenance
|
||||||
|
namespace: flux-system
|
||||||
|
spec:
|
||||||
|
interval: 30s
|
||||||
|
default: opened
|
||||||
|
window: 24h
|
||||||
|
```
|
||||||
|
|
||||||
|
To start the maintenance window you can annotate the gate with:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
kubectl -n flux-system annotate --overwrite gate/maintenance \
|
||||||
|
close.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")"
|
||||||
|
```
|
||||||
|
|
||||||
|
The `gating-controller` extracts the ISO8601 date from the `close.gate`
|
||||||
|
annotation value and closes the gate for the specified window:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
|
||||||
|
kind: Gate
|
||||||
|
metadata:
|
||||||
|
name: maintenance
|
||||||
|
namespace: flux-system
|
||||||
|
status:
|
||||||
|
requestedAt: "2021-03-26T10:00:00Z"
|
||||||
|
resetToDefaultAt: "2021-03-27T10:00:00Z"
|
||||||
|
conditions:
|
||||||
|
- lastTransitionTime: "2021-03-26T10:00:00Z"
|
||||||
|
message: "Gate scheduled for opening at 2021-03-27T11:00:00Z"
|
||||||
|
reason: ReconciliationSucceeded
|
||||||
|
status: "False"
|
||||||
|
type: Opened
|
||||||
|
```
|
||||||
|
|
||||||
|
You could also schedule "No Deploy Fridays" with a CronJob that closes the `maintenance` gate at `0 0 * * FRI`.
|
||||||
|
|
||||||
|
### Alternatives
|
||||||
|
|
||||||
|
<!--
|
||||||
|
List plausible alternatives to the proposal and explain why the proposal is superior.
|
||||||
|
|
||||||
|
This is a good place to incorporate suggestions made during discussion of the RFC.
|
||||||
|
-->
|
||||||
|
|
||||||
|
## Design Details
|
||||||
|
|
||||||
|
<!--
|
||||||
|
This section should contain enough information that the specifics of your
|
||||||
|
change are understandable. This may include API specs and code snippets.
|
||||||
|
|
||||||
|
The design details should address at least the following questions:
|
||||||
|
- How can this feature be enabled / disabled?
|
||||||
|
- Does enabling the feature change any default behavior?
|
||||||
|
- Can the feature be disabled once it has been enabled?
|
||||||
|
- How can an operator determine if the feature is in use?
|
||||||
|
- Are there any drawbacks when enabling this feature?
|
||||||
|
-->
|
||||||
|
|
||||||
|
## Implementation History
|
||||||
|
|
||||||
|
<!--
|
||||||
|
Major milestones in the lifecycle of the RFC such as:
|
||||||
|
- The first Flux release where an initial version of the RFC was available.
|
||||||
|
- The version of Flux where the RFC graduated to general availability.
|
||||||
|
- The version of Flux where the RFC was retired or superseded.
|
||||||
|
-->
|
Loading…
Reference in New Issue