Kubernetes ApiServer Concurrency Security Mechanism

Problem definition

The question I want to explore in this article is: When multiple requests are made to Kubernetes ApiServer concurrently to update the same API resource object, how does Kubernetes ApiServer ensure that updates are not lost, and how to do conflict detection?

Use Kubectl as an entry point

There are many Kubectl commands. The commands that I often use to modify API resource objects are apply and edit. Next, modify the Deployment resource object as an example. By observing the requests issued by these two commands, it is found that they both use HTTP PATCH:

PATCH /apis/extensions/v1beta1/namespaces/default/deployments/nginx HTTP/1.1
Host: 127.0.0.1:8080
User-Agent: kubectl/v0.0.0 (linux/amd64) kubernetes/$Format
Content-Length: 246
Accept: application/json
Content-Type: application/strategic-merge-patch+json
Uber-Trace-Id: 5ca3fde0b9f9aaf1:0c0358897c8e0ef8:0f38135280523f87:1
Accept-Encoding: gzip

{"spec":{"template":{"spec":{"$setElementOrder/containers":[{"name":"nginx"}],"containers":[{"$setElementOrder/env":[{"name":"DEMO_GREETING"}],"env":[{"name":"DEMO_GREETING","value":"Hello from the environment#kubectl edit"}],"name":"nginx"}]}}}}

Tracking ApiServer’s processing of PATCH requests

ApiServer’s call stack for processing PATCH requests is as follows:

restfulPatchResource
    r rest.Patcher // registry.Store
    scope handlers.RequestScope
    admit admission.Interface
    supportedTypes []string 
    restful.RouteFunction // go-restful

handlers.PatchResource
    r rest.Patcher // registry.Store
    scope *RequestScope
    admit admission.Interface
    patchTypes []string
    http.HandlerFunc

k8s.io/apiserver/pkg/endpoints/handlers.(*patcher).patchResource
    ctx context.Context // http.Request.Context
    scope *RequestScope
    runtime.Object
    bool
    error

k8s.io/apiserver/pkg/registry/generic/registry.(*Store).Update
    ctx context.Context
    name string
    objInfo rest.UpdatedObjectInfo
    createValidation rest.ValidateObjectFunc
    updateValidation rest.ValidateObjectUpdateFunc
    forceAllowCreate bool
    options *metav1.UpdateOptions
    runtime.Object
    bool
    error

k8s.io/apiserver/pkg/registry/generic/registry.(*DryRunnableStorage).GuaranteedUpdate
    ctx context.Context
    key string
    ptrToType runtime.Object,
    ignoreNotFound bool,
    preconditions *storage.Preconditions,
    tryUpdate storage.UpdateFunc,
    dryRun bool,
    suggestion ...runtime.Object
    error

k8s.io/apiserver/pkg/storage/cacher.(*Cacher).GuaranteedUpdate
    ctx context.Context
    key string
    ptrToType runtime.Object
    ignoreNotFound bool
    preconditions *storage.Preconditions
    tryUpdate storage.UpdateFunc
    _ ...runtime.Object
    error

k8s.io/apiserver/pkg/storage/etcd3.(*store).GuaranteedUpdate
    ctx context.Context
    key string
    out runtime.Object
    ignoreNotFound bool
    preconditions *storage.Preconditions
    tryUpdate storage.UpdateFunc
    suggestion ...runtime.Object
    error

There are several key points in the entire call stack that need to be analyzed:

storage.Preconditions

Perform a check before tryUpdate.

// Preconditions must be fulfilled before an operation (update, delete, etc.) is carried out.
type Preconditions struct {
    // Specifies the target UID.
    // +optional
    UID *types.UID `json:"uid,omitempty"`
    // Specifies the target ResourceVersion
    // +optional
    ResourceVersion *string `json:"resourceVersion,omitempty"`
}

There is only one method for storage.Preconditions: func (p * Preconditions) Check (key string, obj runtime.Object) error, check if UID and ResourceVersion of obj are consistent with Preconditions.

In PATCH request processing, Preconditions are obtained from rest.UpdatedObjectInfo.Preconditions ().

rest.UpdatedObjectInfo

rest.UpdatedObjectInfo.UpdatedObject () is called in the tryUpdate closure to perform an update operation on the resource object.

// UpdatedObjectInfo provides information about an updated object to an Updater.
// It requires access to the old object in order to return the newly updated object.
type UpdatedObjectInfo interface {
    // Returns preconditions built from the updated object, if applicable.
    // May return nil, or a preconditions object containing nil fields,
    // if no preconditions can be determined from the updated object.
    Preconditions() *metav1.Preconditions

    // UpdatedObject returns the updated object, given a context and old object.
    // The only time an empty oldObj should be passed in is if a "create on update" is occurring (there is no oldObj).
    UpdatedObject(ctx context.Context, oldObj runtime.Object) (newObj runtime.Object, err error)
}

In PATCH request processing, the implementation of rest.UpdatedObjectInfo is rest.defaultUpdatedObjectInfo:

// defaultUpdatedObjectInfo implements UpdatedObjectInfo
type defaultUpdatedObjectInfo struct {
    // obj is the updated object
    obj runtime.Object

    // transformers is an optional list of transforming functions that modify or
    // replace obj using information from the context, old object, or other sources.
    transformers []TransformFunc
}

// DefaultUpdatedObjectInfo returns an UpdatedObjectInfo impl based on the specified object.
func DefaultUpdatedObjectInfo(obj runtime.Object, transformers ...TransformFunc) UpdatedObjectInfo {
    return &defaultUpdatedObjectInfo{obj, transformers}
}

// Preconditions satisfies the UpdatedObjectInfo interface.
// 注意这里只获取了 UID
func (i *defaultUpdatedObjectInfo) Preconditions() *metav1.Preconditions {
    // Attempt to get the UID out of the object
    accessor, err := meta.Accessor(i.obj)
    if err != nil {
        // If no UID can be read, no preconditions are possible
        return nil
    }

    // If empty, no preconditions needed
    uid := accessor.GetUID()
    if len(uid) == 0 {
        return nil
    }

    return &metav1.Preconditions{UID: &uid}
}

// UpdatedObject satisfies the UpdatedObjectInfo interface.
// It returns a copy of the held obj, passed through any configured transformers.
func (i *defaultUpdatedObjectInfo) UpdatedObject(ctx context.Context, oldObj runtime.Object) (runtime.Object, error) {
    var err error
    // Start with the configured object
    newObj := i.obj

    // If the original is non-nil (might be nil if the first transformer builds the object from the oldObj), make a copy,
    // so we don't return the original. BeforeUpdate can mutate the returned object, doing things like clearing ResourceVersion.
    // If we're re-called, we need to be able to return the pristine version.
    if newObj != nil {
        newObj = newObj.DeepCopyObject()
    }

    // Allow any configured transformers to update the new object
    for _, transformer := range i.transformers {
        newObj, err = transformer(ctx, newObj, oldObj) // if http.method == patch: patcher.applyPatch, patcher.applyAdmission
        if err != nil {
            return nil, err
        }
    }

    return newObj, nil
}

When created, defaultUpdatedObjectInfo.obj is nil, then defaultUpdatedObjectInfo.Preconditions () returns nil, that is, no Preconditions check is performed.

// patchResource divides PatchResource for easier unit testing
func (p *patcher) patchResource(ctx context.Context, scope *RequestScope) (runtime.Object, bool, error) {
    ......
    p.updatedObjectInfo = rest.DefaultUpdatedObjectInfo(nil, p.applyPatch, p.applyAdmission)
    ......
}

tryUpdate closure update

// Update performs an atomic update and set of the object. Returns the result of the update
// or an error. If the registry allows create-on-update, the create flow will be executed.
// A bool is returned along with the object and any errors, to indicate object creation.
func (e *Store) Update(ctx context.Context, name string, objInfo rest.UpdatedObjectInfo, createValidation rest.ValidateObjectFunc, updateValidation rest.ValidateObjectUpdateFunc, forceAllowCreate bool, options *metav1.UpdateOptions) (runtime.Object, bool, error) {
    ......
    err = e.Storage.GuaranteedUpdate(ctx, key, out, true, storagePreconditions, func(existing runtime.Object, res storage.ResponseMeta) (runtime.Object, *uint64, error) {
        // Given the existing object, get the new object
        obj, err := objInfo.UpdatedObject(ctx, existing) // defaultUpdatedObjectInfo.UpdatedObject
        if err != nil {
            return nil, nil, err
        }

        // If AllowUnconditionalUpdate() is true and the object specified by
        // the user does not have a resource version, then we populate it with
        // the latest version. Else, we check that the version specified by
        // the user matches the version of latest storage object.
        resourceVersion, err := e.Storage.Versioner().ObjectResourceVersion(obj)
        if err != nil {
            return nil, nil, err
        }
        doUnconditionalUpdate := resourceVersion == 0 && e.UpdateStrategy.AllowUnconditionalUpdate()

        version, err := e.Storage.Versioner().ObjectResourceVersion(existing)
        if err != nil {
            return nil, nil, err
        }
        // version == 0 说明 key 对应的资源对象不存在
        if version == 0 {
            if !e.UpdateStrategy.AllowCreateOnUpdate() && !forceAllowCreate {
                return nil, nil, kubeerr.NewNotFound(qualifiedResource, name)
            }
            creating = true
            creatingObj = obj
            if err := rest.BeforeCreate(e.CreateStrategy, ctx, obj); err != nil {
                return nil, nil, err
            }
            // at this point we have a fully formed object.  It is time to call the validators that the apiserver
            // handling chain wants to enforce.
            if createValidation != nil {
                if err := createValidation(obj.DeepCopyObject()); err != nil {
                    return nil, nil, err
                }
            }
            ttl, err := e.calculateTTL(obj, 0, false)
            if err != nil {
                return nil, nil, err
            }

            return obj, &ttl, nil
        }

        creating = false
        creatingObj = nil
        if doUnconditionalUpdate {
            // Update the object's resource version to match the latest
            // storage object's resource version.
            // 无条件更新
            err = e.Storage.Versioner().UpdateObject(obj, res.ResourceVersion)
            if err != nil {
                return nil, nil, err
            }
        } else {
            // Check if the object's resource version matches the latest
            // resource version.
            if resourceVersion == 0 {
                // TODO: The Invalid error should have a field for Resource.
                // After that field is added, we should fill the Resource and
                // leave the Kind field empty. See the discussion in #18526.
                qualifiedKind := schema.GroupKind{Group: qualifiedResource.Group, Kind: qualifiedResource.Resource}
                fieldErrList := field.ErrorList{field.Invalid(field.NewPath("metadata").Child("resourceVersion"), resourceVersion, "must be specified for an update")}
                return nil, nil, kubeerr.NewInvalid(qualifiedKind, name, fieldErrList)
            }
            // 这里是关键,更新前后的资源对象版本不一致,说明出现了并发更新冲突
            // PATCH 请求中 resourceVersion 始终是等于 version 的!!!
            // PUT 请求中才可能会出现 resourceVersion != version
            if resourceVersion != version {
                return nil, nil, kubeerr.NewConflict(qualifiedResource, name, fmt.Errorf(OptimisticLockErrorMsg))
            }
        }
        if err := rest.BeforeUpdate(e.UpdateStrategy, ctx, obj, existing); err != nil {
            return nil, nil, err
        }
        // at this point we have a fully formed object.  It is time to call the validators that the apiserver
        // handling chain wants to enforce.
        if updateValidation != nil {
            if err := updateValidation(obj.DeepCopyObject(), existing.DeepCopyObject()); err != nil {
                return nil, nil, err
            }
        }
        // Check the default delete-during-update conditions, and store-specific conditions if provided
        if ShouldDeleteDuringUpdate(ctx, key, obj, existing) &&
            (e.ShouldDeleteDuringUpdate == nil || e.ShouldDeleteDuringUpdate(ctx, key, obj, existing)) {
            deleteObj = obj
            return nil, nil, errEmptiedFinalizers
        }
        ttl, err := e.calculateTTL(obj, res.TTL, true)
        if err != nil {
            return nil, nil, err
        }
        if int64(ttl) != res.TTL {
            return obj, &ttl, nil
        }
        return obj, nil, nil
    }, dryrun.IsDryRun(options.DryRun))
    ......
}

Persistent to Etcd

// GuaranteedUpdate implements storage.Interface.GuaranteedUpdate.
func (s *store) GuaranteedUpdate(
    ctx context.Context, key string, out runtime.Object, ignoreNotFound bool,
    preconditions *storage.Preconditions, tryUpdate storage.UpdateFunc, suggestion ...runtime.Object) error {
    trace := utiltrace.New(fmt.Sprintf("GuaranteedUpdate etcd3: %s", getTypeName(out)))
    defer trace.LogIfLong(500 * time.Millisecond)

    v, err := conversion.EnforcePtr(out) // out 必须是非 nil 值的指针类型
    if err != nil {
        panic("unable to convert output object to pointer")
    }
    key = path.Join(s.pathPrefix, key)

    // 从 Etcd 存储中获取 key 对象的状态
    getCurrentState := func() (*objState, error) {
        startTime := time.Now()
        getResp, err := s.client.KV.Get(ctx, key, s.getOps...)
        metrics.RecordEtcdRequestLatency("get", getTypeName(out), startTime)
        if err != nil {
            return nil, err
        }
        return s.getState(getResp, key, v, ignoreNotFound)
    }

    var origState *objState
    var mustCheckData bool // = true 说明 origState 有可能不是最新的
    // 如果上层调用提供了 key 对象的值(从 k8s.io/apiserver/pkg/storage/cacher.Cacher.watchCache.GetByKey(key) 获取)
    // 则不需要访问 Etcd
    if len(suggestion) == 1 && suggestion[0] != nil {
        span.LogFields(log.Object("suggestion[0]", suggestion[0]))
        origState, err = s.getStateFromObject(suggestion[0])
        span.LogFields(log.Object("origState", origState))
        if err != nil {
            return err
        }
        mustCheckData = true
    } else {
        origState, err = getCurrentState()
        if err != nil {
            return err
        }
    }
    trace.Step("initial value restored")

    transformContext := authenticatedDataString(key)
    for {
        // 检查 origState.obj 的 UID、 ResourceVersion 是否与 preconditions 一致
        // PATCH 和 PUT 请求都不做检查!!!
        if err := preconditions.Check(key, origState.obj); err != nil {
            // If our data is already up to date, return the error
            // 如果 origState 已经是最新的状态了,则返回错误
            if !mustCheckData {
                return err
            }

            // 可能 origState 不是最新的状态,从 Etcd 获取最新的状态
            // It's possible we were working with stale data
            // Actually fetch
            origState, err = getCurrentState()
            if err != nil {
                return err
            }
            mustCheckData = false
            // Retry
            continue
        }

        // 在 origState.obj 的基础上进行修改
        ret, ttl, err := s.updateState(origState, tryUpdate)
        if err != nil {
            // origState 可能不是最新的状态,会从 Etcd 中获取最新的状态,再尝试更新一次
            // If our data is already up to date, return the error
            if !mustCheckData {
                return err
            }

            // It's possible we were working with stale data
            // Actually fetch
            origState, err = getCurrentState()
            if err != nil {
                return err
            }
            mustCheckData = false
            // Retry
            continue
        }

        data, err := runtime.Encode(s.codec, ret)
        if err != nil {
            return err
        }
        // 目前 origState.stale 始终为 false
        // 判断修改后的资源对象是否有变化
        if !origState.stale && bytes.Equal(data, origState.data) {
            // if we skipped the original Get in this loop, we must refresh from
            // etcd in order to be sure the data in the store is equivalent to
            // our desired serialization
            // mustCheckData = true 说明 origState 有可能不是最新的
            // 必须再确认一遍
            if mustCheckData {
                origState, err = getCurrentState()
                if err != nil {
                    return err
                }
                mustCheckData = false
                if !bytes.Equal(data, origState.data) {
                    // original data changed, restart loop
                    continue
                }
            }
            // recheck that the data from etcd is not stale before short-circuiting a write
            if !origState.stale {
                return decode(s.codec, s.versioner, origState.data, out, origState.rev)
            }
        }

        // 将对象序列化为二进制数据存储到 Etcd
        newData, err := s.transformer.TransformToStorage(data, transformContext)
        if err != nil {
            return storage.NewInternalError(err.Error())
        }

        // 设置 key 的过期时间
        opts, err := s.ttlOpts(ctx, int64(ttl))
        if err != nil {
            return err
        }
        trace.Step("Transaction prepared")

        span.LogFields(log.Uint64("ttl", ttl), log.Int64("origState.rev", origState.rev))

        // Etcd 事务
        // 注意这里会比较 Etcd 中的资源对象版本跟 origState.rev 是否一致
        // 如果一致,则更新
        // 否则,更新失败并获取当前最新的资源对象
        startTime := time.Now()
        txnResp, err := s.client.KV.Txn(ctx).If(
            clientv3.Compare(clientv3.ModRevision(key), "=", origState.rev),
        ).Then(
            clientv3.OpPut(key, string(newData), opts...),
        ).Else(
            clientv3.OpGet(key),
        ).Commit()
        metrics.RecordEtcdRequestLatency("update", getTypeName(out), startTime)
        if err != nil {
            return err
        }
        trace.Step("Transaction committed")
        if !txnResp.Succeeded {
            // 事务执行失败
            getResp := (*clientv3.GetResponse)(txnResp.Responses[0].GetResponseRange())
            klog.V(4).Infof("GuaranteedUpdate of %s failed because of a conflict, going to retry", key)
            // 获取最新的状态,并重试
            origState, err = s.getState(getResp, key, v, ignoreNotFound)
            if err != nil {
                return err
            }
            trace.Step("Retry value restored")
            mustCheckData = false
            continue
        }
        // 事务执行成功
        putResp := txnResp.Responses[0].GetResponsePut()

        return decode(s.codec, s.versioner, data, out, putResp.Header.Revision)
    }
}

analysis Summary

How do concurrent updates ensure that updates are not lost?

When the updated object is persisted into Etcd, it is guaranteed by the transaction that it will continue to retry when the transaction execution fails. The pseudo-code logic of the transaction is as follows:

// oldObj = FromMemCache(key) or EtcdGet(key)
if inEtcd(key).rev == inMemory(oldObj).rev:
    EtcdSet(key) = newObj
    transaction = success
else:
    EtcdGet(key)
    transaction = fail

How does concurrent updates do conflict detection?

In the above analysis, we can see that there are two conflict detection judgments:

  1. Preconditions
  2. resourceVersion in tryUpdate! = version

For kubectl apply and edit (all PATCH requests are sent), the created Preconditions are zero, so conflict detection will not be performed through Preconditions, and the newObj.rv obtained by calling objInfo.UpdatedObject (ctx, existing) in tryUpdate is always Is equal to existing.rv, so no collision detection is performed.

When will conflict detection take place? In fact, kubectl also has a replace command. It is found through packet capture that the replace command sends a PUT request, and the request will carry a resourceVersion:

PUT /apis/extensions/v1beta1/namespaces/default/deployments/nginx HTTP/1.1
Host: localhost:8080
User-Agent: kubectl/v0.0.0 (linux/amd64) kubernetes/$Format
Content-Length: 866
Accept: application/json
Content-Type: application/json
Uber-Trace-Id: 6e685772cc06fc16:2514dc54a474fe88:4f488c05a7cef9c8:1
Accept-Encoding: gzip

{"apiVersion":"extensions/v1beta1","kind":"Deployment","metadata":{"labels":{"app":"nginx"},"name":"nginx","namespace":"default","resourceVersion":"744603"},"spec":{"progressDeadlineSeconds":600,"replicas":1,"revisionHistoryLimit":10,"selector":{"matchLabels":{"app":"nginx"}},"strategy":{"rollingUpdate":{"maxSurge":"25%","maxUnavailable":"25%"},"type":"RollingUpdate"},"template":{"metadata":{"creationTimestamp":null,"labels":{"app":"nginx"}},"spec":{"containers":[{"env":[{"name":"DEMO_GREETING","value":"Hello from the environment#kubectl replace"}],"image":"nginx","imagePullPolicy":"IfNotPresent","name":"nginx","resources":{},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File"}],"dnsPolicy":"ClusterFirst","restartPolicy":"Always","schedulerName":"default-scheduler","securityContext":{},"terminationGracePeriodSeconds":30}}}}

The processing of PUT requests in ApiServer is similar to PATCH requests. They call k8s.io/apiserver/pkg/registry/generic/registry.(*Store).Update, and the rest.UpdatedObjectInfo created is rest.DefaultUpdatedObjectInfo (obj, transformers … ), Note that the obj parameter value is passed here (obtained through the decode request body), not nil.

In the PUT request processing, the created Preconditions are also zero values and will not be checked by Preconditions. But in tryUpdate, the resourceVersion, err := e.Storage.Versioner().ObjectResourceVersion(obj) obtained resourceVersion is the value in the request body, and not from existing like the PATCH request (see the defaultUpdatedObjectInfo.UpdatedObject method to understand).

Therefore, in the processing of PUT requests, the resourceVersion! = Version in tryUpdate is used to detect whether a concurrent write conflict has occurred.

#kubernetes #ApiServer #kubernetes-apiserver

Kubernetes ApiServer Concurrency Security Mechanism
3.75 GEEK