mirror of https://github.com/fluxcd/flux2.git
Add proposal for adding a gating mechanism to Flux
Signed-off-by: Stefan Prodan <stefan.prodan@gmail.com>pull/3158/head
parent
b8fd46d0df
commit
650bea497f
@ -0,0 +1,252 @@
|
||||
# RFC-XXXX Gating Flux reconciliation
|
||||
|
||||
**Status:** provisional
|
||||
|
||||
**Creation date:** 2022-09-28
|
||||
|
||||
**Last update:** 2022-09-28
|
||||
|
||||
## Summary
|
||||
|
||||
Flux should offer a mechanism for cluster admins and other teams involved in the release process
|
||||
to manually approve the rollout of changes onto clusters. In addition, Flux should offer
|
||||
a way to define maintenance time windows and other time-based gates, to allow a better control
|
||||
of applications and infrastructure changes to critical system.
|
||||
|
||||
## Motivation
|
||||
|
||||
Flux watches sources (e.g. GitRepositories, OCIRepositories, HelmRepositories, S3-compatible Buckets) and
|
||||
automatically reconciles the changes onto clusters as described with Flux Kustomizations and HelmReleases.
|
||||
The teams involved in the delivery process (e.g. dev, qa, sre) can decide when changes are delivered
|
||||
to production by reviewing and approving the proposed changes in a collaborative manner with pull request.
|
||||
Once a pull request is merged onto a branch that defines the desired state of the production system,
|
||||
Flux kicks off the reconciliation process.
|
||||
|
||||
There are situations when users want to have a gating mechanism after the desired state changes are merged in Git:
|
||||
|
||||
- Manual approval of container image updates (e.g. https://github.com/fluxcd/flux2/discussions/870)
|
||||
- Manual approval of infrastructure upgrades (e.g. https://github.com/fluxcd/flux2/issues/959)
|
||||
- Maintenance window (e.g. https://github.com/fluxcd/flux2/discussions/1004)
|
||||
- Planned releases
|
||||
- No Deploy Friday
|
||||
|
||||
### Goals
|
||||
|
||||
- Offer a dedicated API for defining time-based gates in a declarative manner.
|
||||
- Introduce a `gating-controller` in the Flux suite that manages the `Gate` objects.
|
||||
- Extend the current Flux APIs and controllers to support gating.
|
||||
|
||||
### Non-Goals
|
||||
|
||||
<!--
|
||||
What is out of scope for this RFC? Listing non-goals helps to focus discussion
|
||||
and make progress.
|
||||
-->
|
||||
|
||||
## Proposal
|
||||
|
||||
In order to support manual gating, Flux could be extended with a dedicated API and controller
|
||||
that would allow users to define `Gate` objects and perform operations like `open` and `close`.
|
||||
|
||||
A `Gate` object could be referenced in sources (Buckets, Git, Helm, OCI Repositories)
|
||||
and syncs (Kustomizations, HelmReleases, ImageUpdateAutomation)
|
||||
to block the reconciliation until the gate is opened.
|
||||
|
||||
A `Gate` can be opened or closed by annotating the object with a timestamp or by
|
||||
calling a specific webhook receiver exposed by notification-controller.
|
||||
|
||||
A `Gate` can be configured to automatically close or open based on a time window defined in the `Gate` spec.
|
||||
|
||||
The `Gate` API would replace Flagger's current
|
||||
[manual gating mechanism](https://docs.flagger.app/usage/webhooks#manual-gating).
|
||||
|
||||
### User Stories
|
||||
|
||||
> As a member of the SRE team, I want to allow deployments to happen only
|
||||
> in a particular time frame of my own choosing.
|
||||
|
||||
Define a gate that automatically closes after 1h from the time it has been opened:
|
||||
|
||||
```yaml
|
||||
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
|
||||
kind: Gate
|
||||
metadata:
|
||||
name: sre-approval
|
||||
namespace: flux-system
|
||||
spec:
|
||||
interval: 30s
|
||||
default: closed
|
||||
window: 1h
|
||||
```
|
||||
|
||||
When the gate is created in-cluster, the `gating-controller` uses `spec.default` to set the `Opened` condition:
|
||||
|
||||
```yaml
|
||||
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
|
||||
kind: Gate
|
||||
metadata:
|
||||
name: sre-approval
|
||||
namespace: flux-system
|
||||
status:
|
||||
conditions:
|
||||
- lastTransitionTime: "2021-03-26T10:09:26Z"
|
||||
message: "Gate closed by default"
|
||||
reason: ReconciliationSucceeded
|
||||
status: "False"
|
||||
type: Opened
|
||||
```
|
||||
|
||||
While the gate is closed, all the objects that reference it will wait for an approval:
|
||||
|
||||
```yaml
|
||||
apiVersion: kustomize.toolkit.fluxcd.io/v1beta1
|
||||
kind: Kustomization
|
||||
metadata:
|
||||
name: my-app
|
||||
namespace: flux-system
|
||||
spec:
|
||||
gates:
|
||||
- name: sre-approval
|
||||
- name: qa-approval
|
||||
status:
|
||||
conditions:
|
||||
- lastTransitionTime: "2021-03-26T10:09:26Z"
|
||||
message: "Reconciliation is waiting approval, gate 'flux-system/sre-approval' is closed."
|
||||
reason: GateClosed
|
||||
status: "False"
|
||||
type: Approved
|
||||
```
|
||||
|
||||
The SRE team can open the gate either by annotating the gate or by calling the notification-controller webhook:
|
||||
|
||||
```sh
|
||||
kubectl -n flux-system annotate --overwrite gate/sre-approval \
|
||||
open.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")"
|
||||
```
|
||||
|
||||
The `gating-controller` extracts the ISO8601 date from the `open.gate` annotation value,
|
||||
sets the `requestedAt` & `resetToDefaultAt`, and opens the gate for the specified window:
|
||||
|
||||
```yaml
|
||||
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
|
||||
kind: Gate
|
||||
metadata:
|
||||
name: sre-approval
|
||||
namespace: flux-system
|
||||
status:
|
||||
requestedAt: "2021-03-26T10:00:00Z"
|
||||
resetToDefaultAt: "2021-03-26T11:00:00Z"
|
||||
conditions:
|
||||
- lastTransitionTime: "2021-03-26T10:00:00Z"
|
||||
message: "Gate scheduled for closing at 2021-03-26T11:00:00Z"
|
||||
reason: ReconciliationSucceeded
|
||||
status: "True"
|
||||
type: Opened
|
||||
```
|
||||
|
||||
While the gate is opened, all the objects that reference it are approved to reconcile at their configured interval.
|
||||
|
||||
The SRE can decide to close the gate ahead of its schedule with:
|
||||
|
||||
```sh
|
||||
kubectl -n flux-system annotate --overwrite gate/sre-approval \
|
||||
close.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")"
|
||||
```
|
||||
|
||||
The `gating-controller` extracts the ISO8601 date from the `close.gate` annotation value,
|
||||
compares it with the `open.gate` & `requestedAt` date and closes the gate:
|
||||
|
||||
```yaml
|
||||
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
|
||||
kind: Gate
|
||||
metadata:
|
||||
name: sre-approval
|
||||
namespace: flux-system
|
||||
status:
|
||||
requestedAt: "2021-03-26T10:10:00Z"
|
||||
resetToDefaultAt: "2021-03-26T10:10:00Z"
|
||||
conditions:
|
||||
- lastTransitionTime: "2021-03-26T10:10:00Z"
|
||||
message: "Gate close requested"
|
||||
reason: ReconciliationSucceeded
|
||||
status: "False"
|
||||
type: Opened
|
||||
```
|
||||
|
||||
The objects that are referencing this gate, will finish their ongoing reconciliation (if any) then pause.
|
||||
|
||||
> As a member of the SRE team, I want to block deployments in a particular time window.
|
||||
|
||||
To enforce a maintenance window of 24 hours, you can define a `Gate` that's opened by default:
|
||||
|
||||
```yaml
|
||||
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
|
||||
kind: Gate
|
||||
metadata:
|
||||
name: maintenance
|
||||
namespace: flux-system
|
||||
spec:
|
||||
interval: 30s
|
||||
default: opened
|
||||
window: 24h
|
||||
```
|
||||
|
||||
To start the maintenance window you can annotate the gate with:
|
||||
|
||||
```sh
|
||||
kubectl -n flux-system annotate --overwrite gate/maintenance \
|
||||
close.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")"
|
||||
```
|
||||
|
||||
The `gating-controller` extracts the ISO8601 date from the `close.gate`
|
||||
annotation value and closes the gate for the specified window:
|
||||
|
||||
```yaml
|
||||
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
|
||||
kind: Gate
|
||||
metadata:
|
||||
name: maintenance
|
||||
namespace: flux-system
|
||||
status:
|
||||
requestedAt: "2021-03-26T10:00:00Z"
|
||||
resetToDefaultAt: "2021-03-27T10:00:00Z"
|
||||
conditions:
|
||||
- lastTransitionTime: "2021-03-26T10:00:00Z"
|
||||
message: "Gate scheduled for opening at 2021-03-27T11:00:00Z"
|
||||
reason: ReconciliationSucceeded
|
||||
status: "False"
|
||||
type: Opened
|
||||
```
|
||||
|
||||
You could also schedule "No Deploy Fridays" with a CronJob that closes the `maintenance` gate at `0 0 * * FRI`.
|
||||
|
||||
### Alternatives
|
||||
|
||||
<!--
|
||||
List plausible alternatives to the proposal and explain why the proposal is superior.
|
||||
|
||||
This is a good place to incorporate suggestions made during discussion of the RFC.
|
||||
-->
|
||||
|
||||
## Design Details
|
||||
|
||||
<!--
|
||||
This section should contain enough information that the specifics of your
|
||||
change are understandable. This may include API specs and code snippets.
|
||||
|
||||
The design details should address at least the following questions:
|
||||
- How can this feature be enabled / disabled?
|
||||
- Does enabling the feature change any default behavior?
|
||||
- Can the feature be disabled once it has been enabled?
|
||||
- How can an operator determine if the feature is in use?
|
||||
- Are there any drawbacks when enabling this feature?
|
||||
-->
|
||||
|
||||
## Implementation History
|
||||
|
||||
<!--
|
||||
Major milestones in the lifecycle of the RFC such as:
|
||||
- The first Flux release where an initial version of the RFC was available.
|
||||
- The version of Flux where the RFC graduated to general availability.
|
||||
- The version of Flux where the RFC was retired or superseded.
|
||||
-->
|
Loading…
Reference in New Issue