You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
flux2/rfcs/XXXX-gating
Paulo Gomes 35785b8a6f
rfc: Add story 2 and alternatives
Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>
2 years ago
..
README.md rfc: Add story 2 and alternatives 2 years ago

README.md

RFC-XXXX Gating Flux reconciliation

Status: provisional

Creation date: 2022-09-28

Last update: 2022-10-04

Summary

Flux should offer a mechanism for cluster admins and other teams involved in the release process to manually approve the rollout of changes onto clusters. In addition, Flux should offer a way to define maintenance time windows and other time-based gates, to allow a better control of applications and infrastructure changes to critical system.

Motivation

Flux watches sources (e.g. GitRepositories, OCIRepositories, HelmRepositories, S3-compatible Buckets) and automatically reconciles the changes onto clusters as described with Flux Kustomizations and HelmReleases. The teams involved in the delivery process (e.g. dev, qa, sre) can decide when changes are delivered to production by reviewing and approving the proposed changes in a collaborative manner with pull request. Once a pull request is merged onto a branch that defines the desired state of the production system, Flux kicks off the reconciliation process.

There are situations when users want to have a gating mechanism after the desired state changes are merged in Git:

Goals

  • Offer a dedicated API for defining time-based gates in a declarative manner.
  • Introduce a gating-controller in the Flux suite that manages the Gate objects.
  • Extend the current Flux APIs and controllers to support gating.

Non-Goals

Proposal

In order to support manual gating, Flux could be extended with a dedicated API and controller that would allow users to define Gate objects and perform operations like open and close.

A Gate object could be referenced in sources (Buckets, Git, Helm, OCI Repositories) and syncs (Kustomizations, HelmReleases, ImageUpdateAutomation) to block the reconciliation until the gate is opened.

A Gate can be opened or closed by annotating the object with a timestamp or by calling a specific webhook receiver exposed by notification-controller.

A Gate can be configured to automatically close or open based on a time window defined in the Gate spec.

The Gate API would replace Flagger's current manual gating mechanism.

User Stories

Story 1

As a member of the SRE team, I want to allow deployments to happen only in a particular time frame of my own choosing.

Define a gate that automatically closes after 1h from the time it has been opened:

apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
  name: sre-approval
  namespace: flux-system
spec:
  interval: 30s
  default: closed
  window: 1h

When the gate is created in-cluster, the gating-controller uses spec.default to set the Opened condition:

apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
  name: sre-approval
  namespace: flux-system
status:
  conditions:
    - lastTransitionTime: "2021-03-26T10:09:26Z"
      message: "Gate closed by default"
      reason: ReconciliationSucceeded
      status: "False"
      type: Opened

While the gate is closed, all the objects that reference it will wait for an approval:

apiVersion: kustomize.toolkit.fluxcd.io/v1beta1
kind: Kustomization
metadata:
  name: my-app
  namespace: flux-system
spec:
  gates:
    - name: sre-approval
    - name: qa-approval
status:
  conditions:
    - lastTransitionTime: "2021-03-26T10:09:26Z"
      message: "Reconciliation is waiting approval, gate 'flux-system/sre-approval' is closed."
      reason: GateClosed
      status: "False"
      type: Approved

The SRE team can open the gate either by annotating the gate or by calling the notification-controller webhook:

kubectl -n flux-system annotate --overwrite gate/sre-approval \
open.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")"

The gating-controller extracts the ISO8601 date from the open.gate annotation value, sets the requestedAt & resetToDefaultAt, and opens the gate for the specified window:

apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
  name: sre-approval
  namespace: flux-system
status:
  requestedAt: "2021-03-26T10:00:00Z"
  resetToDefaultAt: "2021-03-26T11:00:00Z"
  conditions:
    - lastTransitionTime: "2021-03-26T10:00:00Z"
      message: "Gate scheduled for closing at 2021-03-26T11:00:00Z"
      reason: ReconciliationSucceeded
      status: "True"
      type: Opened

While the gate is opened, all the objects that reference it are approved to reconcile at their configured interval.

The SRE can decide to close the gate ahead of its schedule with:

kubectl -n flux-system annotate --overwrite gate/sre-approval \
close.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")"

The gating-controller extracts the ISO8601 date from the close.gate annotation value, compares it with the open.gate & requestedAt date and closes the gate:

apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
  name: sre-approval
  namespace: flux-system
status:
  requestedAt: "2021-03-26T10:10:00Z"
  resetToDefaultAt: "2021-03-26T10:10:00Z"
  conditions:
    - lastTransitionTime: "2021-03-26T10:10:00Z"
      message: "Gate close requested"
      reason: ReconciliationSucceeded
      status: "False"
      type: Opened

The objects that are referencing this gate, will finish their ongoing reconciliation (if any) then pause.

As a member of the SRE team, I want to block deployments in a particular time window.

To enforce a maintenance window of 24 hours, you can define a Gate that's opened by default:

apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
  name: maintenance
  namespace: flux-system
spec:
  interval: 30s
  default: opened
  window: 24h

To start the maintenance window you can annotate the gate with:

kubectl -n flux-system annotate --overwrite gate/maintenance \
close.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")"

The gating-controller extracts the ISO8601 date from the close.gate annotation value and closes the gate for the specified window:

apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
  name: maintenance
  namespace: flux-system
status:
  requestedAt: "2021-03-26T10:00:00Z"
  resetToDefaultAt: "2021-03-27T10:00:00Z"
  conditions:
    - lastTransitionTime: "2021-03-26T10:00:00Z"
      message: "Gate scheduled for opening at 2021-03-27T11:00:00Z"
      reason: ReconciliationSucceeded
      status: "False"
      type: Opened

You could also schedule "No Deploy Fridays" with a CronJob that closes the maintenance gate at 0 0 * * FRI.

Story 2

As a member of the SRE team, I want existing deployments to still be reconciled during a change freeze.

Gates can be used to block Flux sources from being refreshed, resulting in Flux to continue to reconcile existing approved desired states, whislt new changes are held at a Flux source gate.

Example:

apiVersion: kustomize.toolkit.fluxcd.io/v1beta1
kind: GitRepository
metadata:
  name: flux-system
  namespace: flux-system
spec:
  gates:
    - name: change-freeze # gate that enforces a change freeze time window
status:
  conditions:
    - lastTransitionTime: "2022-05-26T01:12:22Z"
      message: "Reconciliation is blocked as gate 'flux-system/change-freeze' is closed."
      reason: GateClosed
      status: "True"
      type: Blocked

This would ensure that Gate changes would not impact the eventual consistency of mid-flight reconciliations that were already deployed in the cluster. Flux would also continue to re-create Flux managed objects that were manually deleted from the cluster.

Alternatives

Users to implement gating outside of Flux

Before Flux source

Users could implement their own gating mechanisms as part of their development processes ensuring that their custom rules are applied before the changes reach their Flux sources (i.e. the target Git repository). For example, if deployments are not allowed on Fridays, no PRs would be merged on those days.

The disadvantage is that some source types may not provide easy ways for users to enforce such rules. When using different source types (e.g. Git, OCI, Helm), multiple implementations may be required.

CronJobs and Flux Suspend

Users can implement a gating mechanism within Kubernetes by leveraging CronJobs and using the built-in suspend feature in Flux that allows for a Flux object to stop being reconciled until it is resumed. This alternative does not scale well when considering hundreds of Flux objects.

Design Details

Implementation History