# RFC-XXXX Gating Flux reconciliation **Status:** provisional **Creation date:** 2022-09-28 **Last update:** 2022-10-04 ## Summary Flux should offer a mechanism for cluster admins and other teams involved in the release process to manually approve the rollout of changes onto clusters. In addition, Flux should offer a way to define maintenance time windows and other time-based gates, to allow a better control of applications and infrastructure changes to critical system. ## Motivation Flux watches sources (e.g. GitRepositories, OCIRepositories, HelmRepositories, S3-compatible Buckets) and automatically reconciles the changes onto clusters as described with Flux Kustomizations and HelmReleases. The teams involved in the delivery process (e.g. dev, qa, sre) can decide when changes are delivered to production by reviewing and approving the proposed changes in a collaborative manner with pull request. Once a pull request is merged onto a branch that defines the desired state of the production system, Flux kicks off the reconciliation process. There are situations when users want to have a gating mechanism after the desired state changes are merged in Git: - Manual approval of container image updates (e.g. https://github.com/fluxcd/flux2/discussions/870) - Manual approval of infrastructure upgrades (e.g. https://github.com/fluxcd/flux2/issues/959) - Maintenance window (e.g. https://github.com/fluxcd/flux2/discussions/1004) - Planned releases - No Deploy Friday ### Goals - Offer a dedicated API for defining time-based gates in a declarative manner. - Introduce a `gating-controller` in the Flux suite that manages the `Gate` objects. - Extend the current Flux APIs and controllers to support gating. ### Non-Goals ## Proposal In order to support manual gating, Flux could be extended with a dedicated API and controller that would allow users to define `Gate` objects and perform operations like `open` and `close`. A `Gate` object could be referenced in sources (Buckets, Git, Helm, OCI Repositories) and syncs (Kustomizations, HelmReleases, ImageUpdateAutomation) to block the reconciliation until the gate is opened. A `Gate` can be opened or closed by annotating the object with a timestamp or by calling a specific webhook receiver exposed by notification-controller. A `Gate` can be configured to automatically close or open based on a time window defined in the `Gate` spec. The `Gate` API would replace Flagger's current [manual gating mechanism](https://docs.flagger.app/usage/webhooks#manual-gating). ### User Stories #### Story 1 > As a member of the SRE team, I want to allow deployments to happen only > in a particular time frame of my own choosing. Define a gate that automatically closes after 1h from the time it has been opened: ```yaml apiVersion: gating.toolkit.fluxcd.io/v1alpha1 kind: Gate metadata: name: sre-approval namespace: flux-system spec: interval: 30s default: closed window: 1h ``` When the gate is created in-cluster, the `gating-controller` uses `spec.default` to set the `Opened` condition: ```yaml apiVersion: gating.toolkit.fluxcd.io/v1alpha1 kind: Gate metadata: name: sre-approval namespace: flux-system status: conditions: - lastTransitionTime: "2021-03-26T10:09:26Z" message: "Gate closed by default" reason: ReconciliationSucceeded status: "False" type: Opened ``` While the gate is closed, all the objects that reference it will wait for an approval: ```yaml apiVersion: kustomize.toolkit.fluxcd.io/v1beta1 kind: Kustomization metadata: name: my-app namespace: flux-system spec: gates: - name: sre-approval - name: qa-approval status: conditions: - lastTransitionTime: "2021-03-26T10:09:26Z" message: "Reconciliation is waiting approval, gate 'flux-system/sre-approval' is closed." reason: GateClosed status: "False" type: Approved ``` The SRE team can open the gate either by annotating the gate or by calling the notification-controller webhook: ```sh kubectl -n flux-system annotate --overwrite gate/sre-approval \ open.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")" ``` The `gating-controller` extracts the ISO8601 date from the `open.gate` annotation value, sets the `requestedAt` & `resetToDefaultAt`, and opens the gate for the specified window: ```yaml apiVersion: gating.toolkit.fluxcd.io/v1alpha1 kind: Gate metadata: name: sre-approval namespace: flux-system status: requestedAt: "2021-03-26T10:00:00Z" resetToDefaultAt: "2021-03-26T11:00:00Z" conditions: - lastTransitionTime: "2021-03-26T10:00:00Z" message: "Gate scheduled for closing at 2021-03-26T11:00:00Z" reason: ReconciliationSucceeded status: "True" type: Opened ``` While the gate is opened, all the objects that reference it are approved to reconcile at their configured interval. The SRE can decide to close the gate ahead of its schedule with: ```sh kubectl -n flux-system annotate --overwrite gate/sre-approval \ close.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")" ``` The `gating-controller` extracts the ISO8601 date from the `close.gate` annotation value, compares it with the `open.gate` & `requestedAt` date and closes the gate: ```yaml apiVersion: gating.toolkit.fluxcd.io/v1alpha1 kind: Gate metadata: name: sre-approval namespace: flux-system status: requestedAt: "2021-03-26T10:10:00Z" resetToDefaultAt: "2021-03-26T10:10:00Z" conditions: - lastTransitionTime: "2021-03-26T10:10:00Z" message: "Gate close requested" reason: ReconciliationSucceeded status: "False" type: Opened ``` The objects that are referencing this gate, will finish their ongoing reconciliation (if any) then pause. > As a member of the SRE team, I want to block deployments in a particular time window. To enforce a maintenance window of 24 hours, you can define a `Gate` that's opened by default: ```yaml apiVersion: gating.toolkit.fluxcd.io/v1alpha1 kind: Gate metadata: name: maintenance namespace: flux-system spec: interval: 30s default: opened window: 24h ``` To start the maintenance window you can annotate the gate with: ```sh kubectl -n flux-system annotate --overwrite gate/maintenance \ close.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")" ``` The `gating-controller` extracts the ISO8601 date from the `close.gate` annotation value and closes the gate for the specified window: ```yaml apiVersion: gating.toolkit.fluxcd.io/v1alpha1 kind: Gate metadata: name: maintenance namespace: flux-system status: requestedAt: "2021-03-26T10:00:00Z" resetToDefaultAt: "2021-03-27T10:00:00Z" conditions: - lastTransitionTime: "2021-03-26T10:00:00Z" message: "Gate scheduled for opening at 2021-03-27T11:00:00Z" reason: ReconciliationSucceeded status: "False" type: Opened ``` You could also schedule "No Deploy Fridays" with a CronJob that closes the `maintenance` gate at `0 0 * * FRI`. #### Story 2 > As a member of the SRE team, I want existing deployments to still be > reconciled during a change freeze. Gates can be used to block Flux sources from being refreshed, resulting in Flux to continue to reconcile existing approved desired states, whislt new changes are held at a Flux source gate. Example: ```yaml apiVersion: kustomize.toolkit.fluxcd.io/v1beta1 kind: GitRepository metadata: name: flux-system namespace: flux-system spec: gates: - name: change-freeze # gate that enforces a change freeze time window status: conditions: - lastTransitionTime: "2022-05-26T01:12:22Z" message: "Reconciliation is blocked as gate 'flux-system/change-freeze' is closed." reason: GateClosed status: "True" type: Blocked ``` This would ensure that Gate changes would not impact the eventual consistency of mid-flight reconciliations that were already deployed in the cluster. Flux would also continue to re-create Flux managed objects that were manually deleted from the cluster. ### Alternatives #### Users to implement gating outside of Flux ##### Before Flux source Users could implement their own gating mechanisms as part of their development processes ensuring that their custom rules are applied before the changes reach their Flux sources (i.e. the target Git repository). For example, if deployments are not allowed on Fridays, no PRs would be merged on those days. The disadvantage is that some source types may not provide easy ways for users to enforce such rules. When using different source types (e.g. Git, OCI, Helm), multiple implementations may be required. ##### CronJobs and Flux Suspend Users can implement a gating mechanism within Kubernetes by leveraging CronJobs and using the built-in suspend feature in Flux that allows for a Flux object to stop being reconciled until it is resumed. This alternative does not scale well when considering hundreds of Flux objects. ## Design Details ## Implementation History