What's new in Kubernetes 1.16?

Ephemeral containers

Stage: Alpha

Feature group: node

Ephemeral containers are a great way to debug running pods, as you can’t add regular containers to a pod after creation (you should use sysdig tools like kubectl capture or kubectl trace for that though!), but you can run ephemeral containers.

Right now the steps to run an ephemeral container aren’t straightforward. Once this feature is stable you may be able to run them with just kubectl debug:

kubectl debug -c debug-shell --image=debian target-pod -- bash

These containers executes within the namespace of an existing pod and has access to the file systems of its individual containers.

Ephemeral containers aren’t meant to be used for regular deployments, so they have some limitations. For example, they will never be automatically restarted and you can’t configure them as a regular container. In particular, fields like ports, livenessProbe, readinessProbe or lifecycle that imply a role in a pod will be disallowed.

Add IPv4/IPv6 dual-stack support

Stage: Alpha

Feature group: network

As the use of IPv6 increases it’s getting more common to manage clusters with mixed IPv4 and IPv6 network configurations.

Up until now a Kubernetes cluster could only run in either IPv4 or IPv6 mode. You needed the assistance of plugins to assign dual-stack addresses on a pod, and it wasn’t a convenient solution, as Kubernetes would only be aware of one address per pod.

Now you can natively run your cluster in dual-stack mode. For example, you can have dual-stack pods (services still need to be either IPv4 or IPv6).

To use dual-stack you need to enable the feature gate IPv6DualStack in the relevant components of your cluster, and then setup your services. You can get the full steps here here.

New endpoint API

Stage: Alpha

Feature group: network

Until now, all endpoints for a service were stored in one single object. In large Services with many pods, this Endpoints object may grow too big and become problematic; as big objects cannot be stored in etcd, and also aren’t propagated to kube-proxy(s).

In addition, everytime there is a change in an endpoint the whole Endpoints object is re-computed, stored and shared with all watchers. This process doesn’t scale too well and can become a bottleneck in scenarios like rolling upgrades, where there is a burst of endpoint changes.

The new EndpointSlice API will split endpoints into several Endpoint Slice resources, solving many of the current API problems. It’s also designed to support other future features, like multiple IPs per pod.

Pod overhead: account resources tied to the pod sandbox, but not specific containers

Stage: Alpha

Feature group: node

In addition to the requested resources, your pods needs some extra resources just to maintain their runtime environment.

With PodOverhead feature gate enabled, Kubernetes will take into account this overhead when scheduling a pod. The Pod Overhead is calculated and fixed at admission time and it’s associated with the pod’s RuntimeClass, get the full details here.

Even pod spreading across failure domains

Stage: Alpha

Feature group: scheduling

One of the challenges of running a multi-zone cluster is to spread your pods evenly, so high availability will work correctly and the resource utilization will be efficient.

With topologySpreadConstraints you can distribute your pods across zones, with a maximum difference in pod count number of maxSkew. Zones are created by grouping nodes with the same topologyKey label.

If we want to deploy this pod:

apiVersion: v1
kind: Pod
metadata:
  name: mypod
…
spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        foo: bar
…

In a cluster with this topology:

Label
        +---------------+---------------+
zone=   |     zoneA     |     zoneB     | 
        +-------+-------+-------+-------+
node=   | node1 | node2 | node3 | node4 |
        +-------+-------+-------+-------+
foo:bar | P     | P     | P _   | _     |
        +-------+-------+-------+-------+

The only way to comply with the topology constraints is for the pod to be deployed in node3 or in node4.

Add pod-startup liveness-probe holdoff for slow-starting pods

Stage: Alpha

Feature group: node

Probes allows Kubernetes to monitor the status of your applications. You can use livenessProbe to periodically check if the application is still alive. One example container defines this probe:

livenessProbe:
  httpGet:
    path: /healthz
    port: liveness-port
  failureThreshold: 3
  periodSeconds: 10

If it fails 3 times in 30s the container will be restarted. But as this container is slow and needs more than 30 seconds to start, the probe will fail and the container will be restarted again.

This new feature lets you define a startupProbe that will hold off all the other probes until the pod finishes its startup:

startupProbe:
  httpGet:
    path: /healthz
    port: liveness-port
  failureThreshold: 30
  periodSeconds: 10

Now our slow container has up to 5 minutes (30 checks * 10 seconds = 300s) to finish its startup.

Extending RequestedToCapacityRatio priority function to support resource bin packing of extended resources

Stage: Alpha

Feature group: scheduling

The RequestedToCapacityRatioPriority function allows to schedule pods depending on the relative usage of each node. That way you can choose whether to schedule pods in the less used nodes, or to fill the ones that are already in use.

The new resources property lets you further define the relative usage of a node. By assigning weights to the node resources you can define scenarios like “CPU usage is 3 times more important than used memory”, then schedule more pods in nodes with idle CPUs even if they don’t have that much free memory.

{
    "kind" : "Policy",
    "apiVersion" : "v1",
    …
    "priorities" : [
       …
      {
        "name": "RequestedToCapacityRatioPriority",
        "weight": 2,
        "argument": {
          "requestedToCapacityRatioArguments": {
            "shape": [
              {"utilization": 0, "score": 0},
              {"utilization": 100, "score": 10}
            ],
            "resources": [
              {"name": "intel.com/foo", "weight": 5},
              {"name": "CPU", "weight": 3},
              {"name": "Memory", "weight": 1}
            ]
          }
        }
      }
    ],
  }

RuntimeClass scheduling

Stage: Graduating to Beta

Feature group: node

The initial RuntimeClass implementation was meant for homogeneous clusters, where every node supports every RuntimeClass.

This upgrade improves scheduling in heterogeneous clusters, with specialized nodes that only support a subset of the runtime classes.

In these clusters, pods are now automatically scheduled only to the nodes that have support for their RuntimeClass.

Kubeadm for Windows

Stage: Alpha

Feature group: cluster-lifecycle

Support for windows nodes was introduced in Kubernetes 1.14, however there wasn’t an easy way to join windows nodes to a cluster.

Starting in Kubernetes 1.16, kubeadm join will be available for Windows users with partial functionality. It will lack some features like kubeadm init or kubeadm join --control-plane.

RunAsUserName for Windows

Stage: Alpha

Feature group: windows

Now that Kubernetes has support for Group Managed Service Accounts we can use the runAsUserName Windows specific property to define which user will run a container’s entrypoint.

The property is inside the PodSecurityContext and SecurityContext structs, and it needs to follow the format DOMAIN\USER, where the domain part is optional.

apiVersion: v1
kind: Pod
…
spec:
  securityContext:
    windowsOptions:
      runAsUserName: "NT AUTHORITY\\NETWORK SERVICE"

Support GMSA for Windows workloads

Stage: Graduating to Beta

Feature group: windows

This will allow an operator to choose a GMSA at deployment time, and run containers using it to connect to existing applications such as a database or API server without changing how the authentication and authorization are managed inside the organization.

Admission webhook

Stage: Graduating to Stable

Feature group: API

Until now mutating webhooks were only called once, in alphabetical order. In Kubernetes 1.15 this changed, allowing webhook re-invocation if another webhook later in the chain modifies the same object.

Add watch bookmarks support

Stage: Graduating to Beta

Feature group: API

The “bookmark“ watch event is used as a checkpoint, indicating that all objects up to a given resourceVersion requested by the client have already been sent. The API can skip sending all these events, avoiding unnecessary processing on both sides.

Hardware support

Node topology manager

Stage: Alpha

Feature group: node

Machine learning, scientific computing and financial services are examples of systems that are computational intensive or require ultra low latency, this kinds of workloads benefits from proper resource allocation.

For example, performance is improved if a process runs on one isolated CPU core rather than jumping between cores or sharing time with other processes. Parallel processes also run better on cores inside the same CPU socket (in multi socket systems).

The node topology manager is a kubelet component that centralizes the coordination of hardware resource assignments. Currently this task is done by independent components (CPU manager, device manager, CNI), which sometimes ends up on unoptimized allocations.

Only pods running in Guaranteed QoS class that have an integer cpu value are considered by the Topology Manager, like the one in this example:

…
spec:
  containers:
  - name: nginx
    image: nginx
    resources:
      limits:
        memory: "200Mi"
        cpu: "2"
        example.com/device: "1"
      requests:
        memory: "200Mi"
        cpu: "2"
        example.com/device: "1"

Configuration management

Advanced configurations with kubeadm (using Kustomize).

Stage: Alpha

Feature group: cluster-lifecycle

kubeadm works great to configure most of the Kubernetes clusters, but it has some limitations and some advanced use cases requires extra tools.

With Kustomize you can patch base configurations to obtain configuration variants, which helps to manage some advanced scenarios. For example, you can have a base configuration for your service, then patch it with different limits for each of your dev, test and prod environments.

Now kubeadm integrates with Kustomize. When passing patches via the --experimental-kustomize flag, kubeadm will first apply those patches to the existing configuration, then proceed as usual with the patched config.

kubeadm init --experimental-kustomize kubeadm-patches/

The flag will be renamed to just -kustomize when this feature reaches beta. Learn more and check other examples here.

Server-side apply

Stage: Graduating to Beta

Feature group: API machinery

This feature aims to move the logic away from kubectl apply to the apiserver, fixing most of the current workflow pitfalls and also making the operation accessible directly from the API (for example using curl), without strictly requiring kubectl or a Golang implementation.

Cloud providers

Finalizer protection for service LoadBalancers

Stage: Graduating to Beta

Feature group: network

There are various corner cases where cloud resources are orphaned after the associated Service is deleted. Finalizer Protection for Service LoadBalancers was introduced to prevent this from happening.

Azure availability zones

Stage: Graduating to Stable

Feature group: azure

Nodes in Azure will be added with label failure-domain.beta.kubernetes.io/zone=- and topology-aware provisioning is added for Azure managed disks storage class.

[Azure] Cross resource group nodes

Stage: Graduating to Stable

Feature group: azure

Cross resource group (RG) nodes and unmanaged (such as on-prem) nodes in Azure cloud provider are now supported.

Storage

Support CSI plugins in Windows

Stage: Alpha

Feature group: storage

Container Storage Interface plugins were created to allow the development of third party storage volume systems.

Starting with Kubernetes 1.16, Windows nodes will be able to use the existing CSI plugins.

Extend allowed PVC DataSources

Stage: Graduating to Beta

Feature group: storage

Using this feature, you can “clone” an existing PV. A Clone results in a new, duplicate volume being provisioned from an existing volume.

Add resizing support to CSI volumes

Stage: Graduating to Beta

Feature group: storage

To support resizing of CSI volumes an external resize controller will monitor all PVCs. If a PVC meets following criteria for resizing, it will be added to controller’s workqueue.

CSI inline volume support

Stage: Graduating to Beta

Feature group: storage

CSI volumes can only be referenced via PV/PVC today. This works well for remote persistent volumes. This feature introduces the possibility to use CSI volumes as local ephemeral volumes as well.

Kubernetes 1.16 custom resources

CustomResourceDefinitions

Stage: Graduating to Stable

Feature group: API

This feature groups the many modifications and improvements that were performed to graduate CustomResourceDefinitions to Stable in the Kubernetes 1.16 release

Subresources for custom resources

Stage: Graduating to Stable

Feature group: API

With this feature you can enable the Status and Scale subresources for Custom resources.

By adding the comment // +kubebuilder:subresource:status in your CDR definition you will be enabling the /status subresource, which exposes the current status in the system of your custom resource.

// MySQL is the Schema for the mysqls API
// +k8s:openapi-gen=true
// +kubebuilder:subresource:status
type MySQL struct {
    metav1.TypeMeta   `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty"`
    Spec   MySQLSpec   `json:"spec,omitempty"`
    Status MySQLStatus `json:"status,omitempty"`
}

By enabling the Scale subresource, you’ll be able to check how many replicas of your subresource are deployed vs the desired amount. You can obtain this information from the exposed /scale subresource or executing kubectl get deployments. You can also use kubectl scale to adjust the number of replicas of your custom resource.

To enable the Scale subresource you need to define the corresponding JSONPaths in the CDR:

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
…
spec:
 subresources:
   status: {}
   scale:
     specReplicasPath: .spec.replicas
     statusReplicasPath: .status.replicas
     labelSelectorPath: .status.labelSelector

Defaulting and pruning for custom resources

Stage: Graduating to Stable

Feature group: API

Two features aiming to facilitate the JSON handling and processing associated with CustomResourceDefinitions.

Webhook conversion for custom resources

Stage: Graduating to Stable

Feature group: API

Different CRD versions can have different schemas. You can now handle on-the-fly conversion between versions defining and implementing a conversion webhook.

Publish CRD OpenAPI schema

Stage: Graduating to Stable

Feature group: API

CustomResourceDefinition (CRD) allows the CRD author to define an OpenAPI v3 schema to enable server-side validation for CustomResources (CR).

Deprecations

Deprecate and remove SelfLink

Stage: Alpha

Feature group: API

The field SelfLink is present in every Kubernetes object and contains a URL representing the given object.

This field does not provide any new information and its creation and maintenance has a performance impact, so a decision has been taken to progressly deprecate SelfLink by Kubernetes 1.21

Building Kubernetes without in-tree cloud providers

Stage: Alpha

Feature group: cloud-provider

Specific code for cloud providers is being moved away from the core Kubernetes repository (in-tree) to their own external repositories (out-of-tree). By doing so, cloud providers will be able to develop and make releases independent from the core Kubernetes release cycle.

In this halfway moment cloud providers are being copied out-of-tree but they are still available in-tree, so developers may end up with two versions of the same cloud provider in their builds. How do you know which one of the two versions is active?

With this alpha feature you can disable in-tree cloud providers to ensure your build is only using the external version.

Kubernetes metrics overhaul

Stage: Alpha

Feature group: instrumentation

This feature summarizes several tasks needed to align Kubernetes metrics with their Instrumentation Guidelines. Main tasks are changing the names and units of some metrics to be in line with the rest of the Prometheus ecosystem.

Originally published at https://sysdig.com

#kubernetes #devops