Signed-off-by: Paulo Gomes <paulo.gomes@weave.works> |
2 years ago | |
---|---|---|
.. | ||
README.md | 2 years ago |
README.md
RFC-XXXX Gating Flux reconciliation
Status: provisional
Creation date: 2022-09-28
Last update: 2022-10-04
Summary
Flux should offer a mechanism for cluster admins and other teams involved in the release process to manually approve the rollout of changes onto clusters. In addition, Flux should offer a way to define maintenance time windows and other time-based gates, to allow a better control of applications and infrastructure changes to critical system.
Motivation
Flux watches sources (e.g. GitRepositories, OCIRepositories, HelmRepositories, S3-compatible Buckets) and automatically reconciles the changes onto clusters as described with Flux Kustomizations and HelmReleases. The teams involved in the delivery process (e.g. dev, qa, sre) can decide when changes are delivered to production by reviewing and approving the proposed changes in a collaborative manner with pull request. Once a pull request is merged onto a branch that defines the desired state of the production system, Flux kicks off the reconciliation process.
There are situations when users want to have a gating mechanism after the desired state changes are merged in Git:
- Manual approval of container image updates (e.g. https://github.com/fluxcd/flux2/discussions/870)
- Manual approval of infrastructure upgrades (e.g. https://github.com/fluxcd/flux2/issues/959)
- Maintenance window (e.g. https://github.com/fluxcd/flux2/discussions/1004)
- Planned releases
- No Deploy Friday
Goals
- Offer a dedicated API for defining time-based gates in a declarative manner.
- Introduce a
gating-controller
in the Flux suite that manages theGate
objects. - Extend the current Flux APIs and controllers to support gating.
Non-Goals
Proposal
In order to support manual gating, Flux could be extended with a dedicated API and controller
that would allow users to define Gate
objects and perform operations like open
and close
.
A Gate
object could be referenced in sources (Buckets, Git, Helm, OCI Repositories)
and syncs (Kustomizations, HelmReleases, ImageUpdateAutomation)
to block the reconciliation until the gate is opened.
A Gate
can be opened or closed by annotating the object with a timestamp or by
calling a specific webhook receiver exposed by notification-controller.
A Gate
can be configured to automatically close or open based on a time window defined in the Gate
spec.
The Gate
API would replace Flagger's current
manual gating mechanism.
User Stories
Story 1
As a member of the SRE team, I want to allow deployments to happen only in a particular time frame of my own choosing.
Define a gate that automatically closes after 1h from the time it has been opened:
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
name: sre-approval
namespace: flux-system
spec:
interval: 30s
default: closed
window: 1h
When the gate is created in-cluster, the gating-controller
uses spec.default
to set the Opened
condition:
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
name: sre-approval
namespace: flux-system
status:
conditions:
- lastTransitionTime: "2021-03-26T10:09:26Z"
message: "Gate closed by default"
reason: ReconciliationSucceeded
status: "False"
type: Opened
While the gate is closed, all the objects that reference it will wait for an approval:
apiVersion: kustomize.toolkit.fluxcd.io/v1beta1
kind: Kustomization
metadata:
name: my-app
namespace: flux-system
spec:
gates:
- name: sre-approval
- name: qa-approval
status:
conditions:
- lastTransitionTime: "2021-03-26T10:09:26Z"
message: "Reconciliation is waiting approval, gate 'flux-system/sre-approval' is closed."
reason: GateClosed
status: "False"
type: Approved
The SRE team can open the gate either by annotating the gate or by calling the notification-controller webhook:
kubectl -n flux-system annotate --overwrite gate/sre-approval \
open.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")"
The gating-controller
extracts the ISO8601 date from the open.gate
annotation value,
sets the requestedAt
& resetToDefaultAt
, and opens the gate for the specified window:
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
name: sre-approval
namespace: flux-system
status:
requestedAt: "2021-03-26T10:00:00Z"
resetToDefaultAt: "2021-03-26T11:00:00Z"
conditions:
- lastTransitionTime: "2021-03-26T10:00:00Z"
message: "Gate scheduled for closing at 2021-03-26T11:00:00Z"
reason: ReconciliationSucceeded
status: "True"
type: Opened
While the gate is opened, all the objects that reference it are approved to reconcile at their configured interval.
The SRE can decide to close the gate ahead of its schedule with:
kubectl -n flux-system annotate --overwrite gate/sre-approval \
close.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")"
The gating-controller
extracts the ISO8601 date from the close.gate
annotation value,
compares it with the open.gate
& requestedAt
date and closes the gate:
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
name: sre-approval
namespace: flux-system
status:
requestedAt: "2021-03-26T10:10:00Z"
resetToDefaultAt: "2021-03-26T10:10:00Z"
conditions:
- lastTransitionTime: "2021-03-26T10:10:00Z"
message: "Gate close requested"
reason: ReconciliationSucceeded
status: "False"
type: Opened
The objects that are referencing this gate, will finish their ongoing reconciliation (if any) then pause.
As a member of the SRE team, I want to block deployments in a particular time window.
To enforce a maintenance window of 24 hours, you can define a Gate
that's opened by default:
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
name: maintenance
namespace: flux-system
spec:
interval: 30s
default: opened
window: 24h
To start the maintenance window you can annotate the gate with:
kubectl -n flux-system annotate --overwrite gate/maintenance \
close.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")"
The gating-controller
extracts the ISO8601 date from the close.gate
annotation value and closes the gate for the specified window:
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
name: maintenance
namespace: flux-system
status:
requestedAt: "2021-03-26T10:00:00Z"
resetToDefaultAt: "2021-03-27T10:00:00Z"
conditions:
- lastTransitionTime: "2021-03-26T10:00:00Z"
message: "Gate scheduled for opening at 2021-03-27T11:00:00Z"
reason: ReconciliationSucceeded
status: "False"
type: Opened
You could also schedule "No Deploy Fridays" with a CronJob that closes the maintenance
gate at 0 0 * * FRI
.
Story 2
As a member of the SRE team, I want existing deployments to still be reconciled during a change freeze.
Gates can be used to block Flux sources from being refreshed, resulting in Flux to continue to reconcile existing approved desired states, whislt new changes are held at a Flux source gate.
Example:
apiVersion: kustomize.toolkit.fluxcd.io/v1beta1
kind: GitRepository
metadata:
name: flux-system
namespace: flux-system
spec:
gates:
- name: change-freeze # gate that enforces a change freeze time window
status:
conditions:
- lastTransitionTime: "2022-05-26T01:12:22Z"
message: "Reconciliation is blocked as gate 'flux-system/change-freeze' is closed."
reason: GateClosed
status: "True"
type: Blocked
This would ensure that Gate changes would not impact the eventual consistency of mid-flight reconciliations that were already deployed in the cluster. Flux would also continue to re-create Flux managed objects that were manually deleted from the cluster.
Alternatives
Users to implement gating outside of Flux
Before Flux source
Users could implement their own gating mechanisms as part of their development processes ensuring that their custom rules are applied before the changes reach their Flux sources (i.e. the target Git repository). For example, if deployments are not allowed on Fridays, no PRs would be merged on those days.
The disadvantage is that some source types may not provide easy ways for users to enforce such rules. When using different source types (e.g. Git, OCI, Helm), multiple implementations may be required.
CronJobs and Flux Suspend
Users can implement a gating mechanism within Kubernetes by leveraging CronJobs and using the built-in suspend feature in Flux that allows for a Flux object to stop being reconciled until it is resumed. This alternative does not scale well when considering hundreds of Flux objects.