mirror of https://github.com/fluxcd/flux2.git
You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
302 lines
9.7 KiB
Markdown
302 lines
9.7 KiB
Markdown
# RFC-XXXX Gating Flux reconciliation
|
|
|
|
**Status:** provisional
|
|
|
|
**Creation date:** 2022-09-28
|
|
|
|
**Last update:** 2022-10-04
|
|
|
|
## Summary
|
|
|
|
Flux should offer a mechanism for cluster admins and other teams involved in the release process
|
|
to manually approve the rollout of changes onto clusters. In addition, Flux should offer
|
|
a way to define maintenance time windows and other time-based gates, to allow a better control
|
|
of applications and infrastructure changes to critical system.
|
|
|
|
## Motivation
|
|
|
|
Flux watches sources (e.g. GitRepositories, OCIRepositories, HelmRepositories, S3-compatible Buckets) and
|
|
automatically reconciles the changes onto clusters as described with Flux Kustomizations and HelmReleases.
|
|
The teams involved in the delivery process (e.g. dev, qa, sre) can decide when changes are delivered
|
|
to production by reviewing and approving the proposed changes in a collaborative manner with pull request.
|
|
Once a pull request is merged onto a branch that defines the desired state of the production system,
|
|
Flux kicks off the reconciliation process.
|
|
|
|
There are situations when users want to have a gating mechanism after the desired state changes are merged in Git:
|
|
|
|
- Manual approval of container image updates (e.g. https://github.com/fluxcd/flux2/discussions/870)
|
|
- Manual approval of infrastructure upgrades (e.g. https://github.com/fluxcd/flux2/issues/959)
|
|
- Maintenance window (e.g. https://github.com/fluxcd/flux2/discussions/1004)
|
|
- Planned releases
|
|
- No Deploy Friday
|
|
|
|
### Goals
|
|
|
|
- Offer a dedicated API for defining time-based gates in a declarative manner.
|
|
- Introduce a `gating-controller` in the Flux suite that manages the `Gate` objects.
|
|
- Extend the current Flux APIs and controllers to support gating.
|
|
|
|
### Non-Goals
|
|
|
|
<!--
|
|
What is out of scope for this RFC? Listing non-goals helps to focus discussion
|
|
and make progress.
|
|
-->
|
|
|
|
## Proposal
|
|
|
|
In order to support manual gating, Flux could be extended with a dedicated API and controller
|
|
that would allow users to define `Gate` objects and perform operations like `open` and `close`.
|
|
|
|
A `Gate` object could be referenced in sources (Buckets, Git, Helm, OCI Repositories)
|
|
and syncs (Kustomizations, HelmReleases, ImageUpdateAutomation)
|
|
to block the reconciliation until the gate is opened.
|
|
|
|
A `Gate` can be opened or closed by annotating the object with a timestamp or by
|
|
calling a specific webhook receiver exposed by notification-controller.
|
|
|
|
A `Gate` can be configured to automatically close or open based on a time window defined in the `Gate` spec.
|
|
|
|
The `Gate` API would replace Flagger's current
|
|
[manual gating mechanism](https://docs.flagger.app/usage/webhooks#manual-gating).
|
|
|
|
### User Stories
|
|
|
|
#### Story 1
|
|
|
|
> As a member of the SRE team, I want to allow deployments to happen only
|
|
> in a particular time frame of my own choosing.
|
|
|
|
Define a gate that automatically closes after 1h from the time it has been opened:
|
|
|
|
```yaml
|
|
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
|
|
kind: Gate
|
|
metadata:
|
|
name: sre-approval
|
|
namespace: flux-system
|
|
spec:
|
|
interval: 30s
|
|
default: closed
|
|
window: 1h
|
|
```
|
|
|
|
When the gate is created in-cluster, the `gating-controller` uses `spec.default` to set the `Opened` condition:
|
|
|
|
```yaml
|
|
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
|
|
kind: Gate
|
|
metadata:
|
|
name: sre-approval
|
|
namespace: flux-system
|
|
status:
|
|
conditions:
|
|
- lastTransitionTime: "2021-03-26T10:09:26Z"
|
|
message: "Gate closed by default"
|
|
reason: ReconciliationSucceeded
|
|
status: "False"
|
|
type: Opened
|
|
```
|
|
|
|
While the gate is closed, all the objects that reference it will wait for an approval:
|
|
|
|
```yaml
|
|
apiVersion: kustomize.toolkit.fluxcd.io/v1beta1
|
|
kind: Kustomization
|
|
metadata:
|
|
name: my-app
|
|
namespace: flux-system
|
|
spec:
|
|
gates:
|
|
- name: sre-approval
|
|
- name: qa-approval
|
|
status:
|
|
conditions:
|
|
- lastTransitionTime: "2021-03-26T10:09:26Z"
|
|
message: "Reconciliation is waiting approval, gate 'flux-system/sre-approval' is closed."
|
|
reason: GateClosed
|
|
status: "False"
|
|
type: Approved
|
|
```
|
|
|
|
The SRE team can open the gate either by annotating the gate or by calling the notification-controller webhook:
|
|
|
|
```sh
|
|
kubectl -n flux-system annotate --overwrite gate/sre-approval \
|
|
open.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")"
|
|
```
|
|
|
|
The `gating-controller` extracts the ISO8601 date from the `open.gate` annotation value,
|
|
sets the `requestedAt` & `resetToDefaultAt`, and opens the gate for the specified window:
|
|
|
|
```yaml
|
|
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
|
|
kind: Gate
|
|
metadata:
|
|
name: sre-approval
|
|
namespace: flux-system
|
|
status:
|
|
requestedAt: "2021-03-26T10:00:00Z"
|
|
resetToDefaultAt: "2021-03-26T11:00:00Z"
|
|
conditions:
|
|
- lastTransitionTime: "2021-03-26T10:00:00Z"
|
|
message: "Gate scheduled for closing at 2021-03-26T11:00:00Z"
|
|
reason: ReconciliationSucceeded
|
|
status: "True"
|
|
type: Opened
|
|
```
|
|
|
|
While the gate is opened, all the objects that reference it are approved to reconcile at their configured interval.
|
|
|
|
The SRE can decide to close the gate ahead of its schedule with:
|
|
|
|
```sh
|
|
kubectl -n flux-system annotate --overwrite gate/sre-approval \
|
|
close.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")"
|
|
```
|
|
|
|
The `gating-controller` extracts the ISO8601 date from the `close.gate` annotation value,
|
|
compares it with the `open.gate` & `requestedAt` date and closes the gate:
|
|
|
|
```yaml
|
|
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
|
|
kind: Gate
|
|
metadata:
|
|
name: sre-approval
|
|
namespace: flux-system
|
|
status:
|
|
requestedAt: "2021-03-26T10:10:00Z"
|
|
resetToDefaultAt: "2021-03-26T10:10:00Z"
|
|
conditions:
|
|
- lastTransitionTime: "2021-03-26T10:10:00Z"
|
|
message: "Gate close requested"
|
|
reason: ReconciliationSucceeded
|
|
status: "False"
|
|
type: Opened
|
|
```
|
|
|
|
The objects that are referencing this gate, will finish their ongoing reconciliation (if any) then pause.
|
|
|
|
> As a member of the SRE team, I want to block deployments in a particular time window.
|
|
|
|
To enforce a maintenance window of 24 hours, you can define a `Gate` that's opened by default:
|
|
|
|
```yaml
|
|
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
|
|
kind: Gate
|
|
metadata:
|
|
name: maintenance
|
|
namespace: flux-system
|
|
spec:
|
|
interval: 30s
|
|
default: opened
|
|
window: 24h
|
|
```
|
|
|
|
To start the maintenance window you can annotate the gate with:
|
|
|
|
```sh
|
|
kubectl -n flux-system annotate --overwrite gate/maintenance \
|
|
close.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")"
|
|
```
|
|
|
|
The `gating-controller` extracts the ISO8601 date from the `close.gate`
|
|
annotation value and closes the gate for the specified window:
|
|
|
|
```yaml
|
|
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
|
|
kind: Gate
|
|
metadata:
|
|
name: maintenance
|
|
namespace: flux-system
|
|
status:
|
|
requestedAt: "2021-03-26T10:00:00Z"
|
|
resetToDefaultAt: "2021-03-27T10:00:00Z"
|
|
conditions:
|
|
- lastTransitionTime: "2021-03-26T10:00:00Z"
|
|
message: "Gate scheduled for opening at 2021-03-27T11:00:00Z"
|
|
reason: ReconciliationSucceeded
|
|
status: "False"
|
|
type: Opened
|
|
```
|
|
|
|
You could also schedule "No Deploy Fridays" with a CronJob that closes the `maintenance` gate at `0 0 * * FRI`.
|
|
|
|
#### Story 2
|
|
|
|
> As a member of the SRE team, I want existing deployments to still be
|
|
> reconciled during a change freeze.
|
|
|
|
Gates can be used to block Flux sources from being refreshed, resulting in Flux
|
|
to continue to reconcile existing approved desired states, whislt new changes
|
|
are held at a Flux source gate.
|
|
|
|
Example:
|
|
|
|
```yaml
|
|
apiVersion: kustomize.toolkit.fluxcd.io/v1beta1
|
|
kind: GitRepository
|
|
metadata:
|
|
name: flux-system
|
|
namespace: flux-system
|
|
spec:
|
|
gates:
|
|
- name: change-freeze # gate that enforces a change freeze time window
|
|
status:
|
|
conditions:
|
|
- lastTransitionTime: "2022-05-26T01:12:22Z"
|
|
message: "Reconciliation is blocked as gate 'flux-system/change-freeze' is closed."
|
|
reason: GateClosed
|
|
status: "True"
|
|
type: Blocked
|
|
```
|
|
|
|
This would ensure that Gate changes would not impact the eventual consistency of
|
|
mid-flight reconciliations that were already deployed in the cluster. Flux would also
|
|
continue to re-create Flux managed objects that were manually deleted from the cluster.
|
|
|
|
### Alternatives
|
|
|
|
#### Users to implement gating outside of Flux
|
|
|
|
##### Before Flux source
|
|
|
|
Users could implement their own gating mechanisms as part of their development processes
|
|
ensuring that their custom rules are applied before the changes reach their Flux sources
|
|
(i.e. the target Git repository). For example, if deployments are not allowed on Fridays,
|
|
no PRs would be merged on those days.
|
|
|
|
The disadvantage is that some source types may not provide easy ways for users to enforce
|
|
such rules. When using different source types (e.g. Git, OCI, Helm), multiple implementations
|
|
may be required.
|
|
|
|
##### CronJobs and Flux Suspend
|
|
|
|
Users can implement a gating mechanism within Kubernetes by leveraging CronJobs and using
|
|
the built-in suspend feature in Flux that allows for a Flux object to stop being reconciled
|
|
until it is resumed. This alternative does not scale well when considering hundreds of Flux
|
|
objects.
|
|
|
|
## Design Details
|
|
|
|
<!--
|
|
This section should contain enough information that the specifics of your
|
|
change are understandable. This may include API specs and code snippets.
|
|
|
|
The design details should address at least the following questions:
|
|
- How can this feature be enabled / disabled?
|
|
- Does enabling the feature change any default behavior?
|
|
- Can the feature be disabled once it has been enabled?
|
|
- How can an operator determine if the feature is in use?
|
|
- Are there any drawbacks when enabling this feature?
|
|
-->
|
|
|
|
## Implementation History
|
|
|
|
<!--
|
|
Major milestones in the lifecycle of the RFC such as:
|
|
- The first Flux release where an initial version of the RFC was available.
|
|
- The version of Flux where the RFC graduated to general availability.
|
|
- The version of Flux where the RFC was retired or superseded.
|
|
-->
|