* Change the workers managed instance group to health check nodes
via HTTP probe of the kube-proxy port 10256 /healthz endpoints
* Advantages: kube-proxy is a lower value target (in case there
were bugs in firewalls) that Kubelet, its more representative than
health checking Kubelet (Kubelet must run AND kube-proxy Daemonset
must be healthy), and its already used by kube-proxy liveness probes
(better discoverability via kubectl or alerts on pods crashlooping)
* Another motivator is that GKE clusters also use kube-proxy port
10256 checks to assess node health
* When a worker managed instance group's (MIG) instance template
changes (including machine type, disk size, or Butane snippets
but excluding new AMIs), use Google Cloud's rolling update features
to ensure instances match declared state
* Ignore new AMIs since Fedora CoreOS and Flatcar Linux nodes
already auto-update and reboot themselves
* Rolling updates will create surge instances, wait for health
checks, then delete old instances (0 unavilable instances)
* Instances are replaced to ensure new Ignition/Butane snippets
are respected
* Add managed instance group autohealing (i.e. health checks) to
ensure new instances' Kubelet is running
Renames
* Name apiserver and kubelet health checks consistently
* Rename MIG from `${var.name}-worker-group` to `${var.name}-worker`
Rel: https://cloud.google.com/compute/docs/instance-groups/rolling-out-updates-to-managed-instance-groups
* Add `node_taints` variable to worker modules to set custom
initial node taints on cloud platforms that support auto-scaling
worker pools of heterogeneous nodes (i.e. AWS, Azure, GCP)
* Worker pools could use custom `node_labels` to allowed workloads
to select among differentiated nodes, while custom `node_taints`
allows a worker pool's nodes to be tainted as special to prevent
scheduling, except by workloads that explicitly tolerate the
taint
* Expose `daemonset_tolerations` in AWS, Azure, and GCP kubernetes
cluster modules, to determine whether `kube-system` components
should tolerate the custom taint (advanced use covered in docs)
Rel: #550, #663
Closes #429
* In v1.18.3, the `os_stream` variable was added to select
a Fedora CoreOS image stream (stable, testing, next) on
AWS and Google Cloud (which publish official streams)
* Remove `os_image` variable deprecated in v1.18.3. Manually
uploaded images are no longer needed
* With Fedora CoreOS image stream support (#727), the latest
resolved image will change over the lifecycle of a cluster.
* Fix issue where an image diff proposed replacing a Fedora
CoreOS controller on GCP, introduced in #727 (unreleased)
* Also ignore image diffs to the GCP managed instance group
of workers. This aligns with worker AMI diffs being ignored
on AWS and similar on Azure, since workers update themselves.
Background:
* Controller nodes should strictly not be recreated by Terraform,
they are stateful (etcd) and should not be replaced
* Across cloud platforms, OS image diffs are ignored since both
Flatcar Linux and Fedora CoreOS nodes update themselves. For
workers, user-data or disk size diffs (where relevant) are allowed
to recreate workers templates/configs since these are considered
to be user-initiated declarations that a reprovision should be done
* Add Typhoon Fedora CoreOS on Google Cloud as alpha
* Add docs on uploading the Fedora CoreOS GCP gzipped tarball to
Google Cloud storage to create a boot disk image