Kubernetes

Kubernetes

Kubernetes is an open-source platform designed to automate deployment, scaling, and operation of application containers, across multiple hosts and/or clouds.
Quan Huynh

Quan Huynh

1656644125

Books you should read to become DevOps for beginner

Hi guys, In this post, I’m going to introduce the books that help me a lot on the path to becoming a Cloud DevOps Engineer from a Full-stack Developer. Hope it is useful for everyone, the books that I recommend go from basic to advance.

#devops #aws #kubernetes 

https://medium.com/codex/side-story-books-you-should-read-to-become-devops-for-beginner-b75384ef6774

Books you should read to become DevOps for beginner
Quan Huynh

Quan Huynh

1656608474

How Kubernetes works with Container Runtime

Continuing the previous post, we learned about the Container Runtime. In this post, we are going to learn about a rather exciting topic how Kubernetes uses the Container Runtime and the types of Container Runtime that Kubernetes uses.

#devops #kubernetes #docker #containers 

https://medium.com/@hmquan08011996/kubernetes-story-how-kubernetes-works-with-container-runtime-ce618a306f64

How Kubernetes works with Container Runtime
Quan Huynh

Quan Huynh

1656585844

Linux Namespaces and Cgroups: What are containers made from?

If we do DevOps, we are probably familiar with Kubernetes, Docker, and Containers. But have we ever wondered what the hell is docker? What are containers? Docker is a container? Docker is not a container and I will explain what it is in this post.

https://medium.com/@hmquan08011996/kubernetes-story-linux-namespaces-and-cgroups-what-are-containers-made-from-d544ac9bd622

 #devops #containers #kubernetes 

Linux Namespaces and Cgroups: What are containers made from?
Coding  Life

Coding Life

1656475393

Deploy Docker Image to Kubernetes using Jenkins | CI CD Pipeline Using Jenkins

This tutorial will help you to understand complete devops end to end integration where We are going to cover Build Docker Image using Jenkins Pipeline and Push Docker Image to Docker Hub and next deploy it to kubernetes cluster step by step using Jenkins Pipeline

GitHub: https://github.com/Java-Techie-jt/devops-automation 

Subscribe: https://www.youtube.com/c/JavaTechie/featured 

#docker #jenkins #kubernetes 

Deploy Docker Image to Kubernetes using Jenkins | CI CD Pipeline Using Jenkins

Efficient Selenium Protocol Implementas Run Everything in Kubernetes

Moon

Moon is a commercial closed-source enterprise Selenium implementation using Kubernetes to launch browsers.

Moon Animation

Pricing Model

  • The only limitation that determines final Moon price is the total number of browser pods being run in parallel.
  • You can run up to 4 (four) parallel pods for free. Everything on top of free limit is paid as a subscription.
  • Detailed pricing information is available in respective documentation section.
  • To obtain a free evaluation license key email to sales@aerokube.com

Free Support

Features

The main idea behind Moon is to be easily installable and require zero maintenance.

One-command Installation

Having a running Kubernetes cluster and kubectl pointing to it you can launch free Moon cluster with this one-liner:

$ git clone https://github.com/aerokube/moon-deploy.git && cd moon-deploy && kubectl apply -f moon.yaml

To obtain Moon load balancer IP address use the following command:

$ kubectl get svc -n moon
NAME      TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)                         AGE
moon      LoadBalancer   10.63.242.109   104.154.161.58   4444:31894/TCP,8080:30625/TCP   1m

Now use the following Selenium URL in your code:

http://104.154.161.58:4444/wd/hub

We also provide detailed installation video.

Automatic Browser Management

Browsers

  • We maintain ready to use images for Firefox, Chrome, Opera and Android.
  • New browser versions are automatically accessible right after releases.

Scalability and Fault Tolerance

  • Your cluster size is automatically determined by Kubernetes depending on the load.
  • Moon is completely stateless and allows to run an unlimited number of replicas behind load balancer.
  • No additional configuration is required to add a new Moon replica.
  • User requests are not lost even in case of an accidental crash or downtime of the replica.

Efficient and Lightning Fast

  • Completely new Selenium protocol implementation using lightning fast Golang.
  • One Moon replica consumes 0.5 CPU and 512 Mb RAM maximum.
  • One Moon replica is able to work with thousands of running sessions.
  • No Selenium Grid components used.

Logs and Video

  • You can access live browser screen and logs for debugging purposes during test run.
  • Any browser session can be saved to a video file using desired codec, frame rate and screen size.
  • Logs and video files can be automatically uploaded to S3-compatible storage.

Complete Guide

Complete reference guide can be found at: http://aerokube.com/moon/latest/

Author: Aerokube
Source Code: https://github.com/aerokube/moon 
License: Apache-2.0 license

#node #selenium #kubernetes #openshift #playwright 

Efficient Selenium Protocol Implementas Run Everything in Kubernetes
Brooke  Giles

Brooke Giles

1655969735

Build Modern Cloud Native Applications using Spring and Kubernetes

Cloud Native with Spring Boot & Kubernetes

Thomas is a senior software engineer specialized in building modern, cloud native, robust, and secure enterprise applications and author of Cloud Native Spring in Action, published by Manning.

The Spring ecosystem provides you with all you need to build cloud native applications, focusing on productivity, simplicity and speed. It’s ready to leverage cloud environment features and integrates with Kubernetes natively.

In this session, Thomas will cover common patterns and best practices to build cloud native applications using Reactive Spring, which provides better performance, resilience, and efficiency. You’ll then see how to containerize them, configure them through the natively supported ConfigMaps and Secrets, and deploy them to a Kubernetes cluster. Finally, he’ll show how to use Spring Native to build GraalVM native executables.

TIMECODES
00:00 Intro
00:53 Cloud native
03:00 Demo
15:03 Cloud native development
16:08 Containerization
19:50 Demo
23:04 Spring Boot on Kubernetes
23:59 Demo
29:07 Externalized configuration
31:24 Demo
36:21 Health probes
37:54 Demo
46:00 Spring Native
49:51 Demo
54:12 Conclusion
54:50 Outro

#springboot #kubernetes #k8s #cloudnative #cloud #microservices 

Build Modern Cloud Native Applications using Spring and Kubernetes
Oral  Brekke

Oral Brekke

1655954820

Kubectl Completion for Fish Shell

kubectl completion for fish shell

Install

$ mkdir -p ~/.config/fish/completions
$ cd ~/.config/fish
$ git clone https://github.com/evanlucas/fish-kubectl-completions
$ ln -s ../fish-kubectl-completions/completions/kubectl.fish completions/

Install using Fisher

fisher install evanlucas/fish-kubectl-completions

Building

This was tested using go 1.15.7 on macOS 11.1 "Big Sur".

$ make build

Environment Variables

FISH_KUBECTL_COMPLETION_TIMEOUT

This is used to pass the --request-timeout flag to the kubectl command. It defaults to 5s.

Non-zero values should contain a corresponding time unit (e.g. 1s, 2m, 3h). A value of zero means don't timeout requests.

FISH_KUBECTL_COMPLETION_COMPLETE_CRDS

This can be used to prevent completing CRDs. Some users may have limited access to resources. It defaults to 1. To disable, set to anything other than 1.

Author: Evanlucas
Source Code: https://github.com/evanlucas/fish-kubectl-completions 
License: MIT license

#node #kubernetes #shell 

Kubectl Completion for Fish Shell

Spark on Kubernetes with Azure Resources

Azure - Spark on Kubernetes

Objetivos

  • Criar infraestrutura como código
  • Utuilizando um cluster Kubernetes na Azure
    • Ingestão dos dados do Enade 2017 com python para o datalake na Azure
    • Transformar os dados da camada bronze para camada silver usando delta format
    • Enrriquecer os dados da camada silver para camada gold usando delta format
  • Utilizar Azure Synapse Serveless SQL Poll para servir os dados

Arquitetura

arquitetura

Passos

Criar infra

source infra/00-variables

bash infra/01-create-rg.sh

bash infra/02-create-cluster-k8s.sh

bash infra/03-create-lake.sh

bash infra/04-create-synapse.sh

bash infra/05-access-assignments.sh

Preparar k8s

Baixar kubeconfig file

bash infra/02-get-kubeconfig.sh

Para facilitar os comandos usar um alias

alias k=kubectl

Criar namespace

k create namespace processing
k create namespace ingestion

Criar Service Account e Role Bing

k apply -f k8s/crb-spark.yaml

Criar secrets

# O arquivo .env é criado pelo script infra/05-access-assignments.sh durante a criação da infra
k create secret generic azure-service-account --from-env-file=.env --namespace processing
k create secret generic azure-service-account --from-env-file=.env --namespace ingestion

Intalar Spark Operator

helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator

helm repo update

helm install spark spark-operator/spark-operator --set image.tag=v1beta2-1.2.3-3.1.1 --namespace processing

Ingestion app

Ingestion Image

docker build ingestion -f ingestion/Dockerfile -t otaciliopsf/cde-bootcamp:desafio-mod4-ingestion --network=host

docker push otaciliopsf/cde-bootcamp:desafio-mod4-ingestion

ConfigMap

k create configmap lake-config \
    --from-literal=storage_account_name=$STG_ACC_NAME \
    --from-literal=file_system_name=$LAKE_NAME \
    --namespace ingestion

Apply ingestion job

k replace --force -f k8s/ingestion-job.yaml

Logs

ING_POD_NAME=$(k get pods --selector=job-name=ingestion-job --output=jsonpath='{.items[*].metadata.name}' -n ingestion)

k logs $ING_POD_NAME -n ingestion --follow

Spark

Criar Job Image

docker build spark -f spark/Dockerfile -t otaciliopsf/cde-bootcamp:desafio-mod4

docker push otaciliopsf/cde-bootcamp:desafio-mod4

ConfigMap

# Spark Operator possui algumas steps a mais para aceitar configmap, por esse motivos vamos passar como um secret
k create secret generic lake-config \
    --from-literal=storage_account_name=$STG_ACC_NAME \
    --from-literal=file_system_name=$LAKE_NAME \
    --namespace processing

Apply job

k replace --force -f k8s/spark-job.yaml

logs

k logs spark-job-igti-desafio-driver -n processing --follow

Azure Synapse Serveless SQL Poll

Acessar o Synapse workspace através do link gerado

bash infra/04-get-workspace-url.sh

Para começar a usar siga os passos

steps-synapse

Rodar o conteudo do script create-synapse-view.sql no Synapse workspace para criar a view da tabela no lake

Pronto, o Synapse esta pronto para receber as querys.

Limpando os recursos

bash infra/99-delete-service-principal.sh

bash infra/99-delete-rg.sh

Conclusão

Seguindo os passos citados é possivel realizar querys direto na camada gold do delta lake utilizando o Synapse

Download Details:
 

Author: otacilio-psf
Download Link: Download The Source Code
Official Website: https://github.com/otacilio-psf/azure-spark-on-kubernetes 
#python #kubernetes #spark 

vFlow: Enterprise Network Flow Collector (IPFIX, SFlow, Netflow)

High-performance, scalable and reliable IPFIX, sFlow and Netflow collector (written in pure Golang).

Features

  • IPFIX RFC7011 collector
  • sFLow v5 raw header / counters collector
  • Netflow v5 collector
  • Netflow v9 collector
  • Decoding sFlow raw header L2/L3/L4
  • Produce to Apache Kafka, NSQ, NATS
  • Replicate IPFIX and sFlow to 3rd party collector
  • Supports IPv4 and IPv6
  • Prometheus and RESTful APIs monitoring

Alt text

Decoded IPFIX data

The IPFIX data decodes to JSON format and IDs are IANA IPFIX element ID

{"AgentID":"192.168.21.15","Header":{"Version":10,"Length":420,"ExportTime":1483484642,"SequenceNo":1434533677,"DomainID":32771},"DataSets":[[{"I":8,"V":"192.16.28.217"},{"I":12,"V":"180.10.210.240"},{"I":5,"V":2},{"I":4,"V":6},{"I":7,"V":443},{"I":11,"V":64381},{"I":32,"V":0},{"I":10,"V":811},{"I":58,"V":0},{"I":9,"V":24},{"I":13,"V":20},{"I":16,"V":4200000000},{"I":17,"V":27747},{"I":15,"V":"180.105.10.210"},{"I":6,"V":"0x10"},{"I":14,"V":1113},{"I":1,"V":22500},{"I":2,"V":15},{"I":52,"V":63},{"I":53,"V":63},{"I":152,"V":1483484581770},{"I":153,"V":1483484622384},{"I":136,"V":2},{"I":243,"V":0},{"I":245,"V":0}]]}

Decoded sFlow data

{"Version":5,"IPVersion":1,"AgentSubID":5,"SequenceNo":37591,"SysUpTime":3287084017,"SamplesNo":1,"Samples":[{"SequenceNo":1530345639,"SourceID":0,"SamplingRate":4096,"SamplePool":1938456576,"Drops":0,"Input":536,"Output":728,"RecordsNo":3,"Records":{"ExtRouter":{"NextHop":"115.131.251.90","SrcMask":24,"DstMask":14},"ExtSwitch":{"SrcVlan":0,"SrcPriority":0,"DstVlan":0,"DstPriority":0},"RawHeader":{"L2":{"SrcMAC":"58:00:bb:e7:57:6f","DstMAC":"f4:a7:39:44:a8:27","Vlan":0,"EtherType":2048},"L3":{"Version":4,"TOS":0,"TotalLen":1452,"ID":13515,"Flags":0,"FragOff":0,"TTL":62,"Protocol":6,"Checksum":8564,"Src":"10.1.8.5","Dst":"161.140.24.181"},"L4":{"SrcPort":443,"DstPort":56521,"DataOffset":5,"Reserved":0,"Flags":16}}}}],"IPAddress":"192.168.10.0","ColTime": 1646157296}

Decoded Netflow v5 data

{"AgentID":"114.23.3.231","Header":{"Version":5,"Count":3,"SysUpTimeMSecs":51469784,"UNIXSecs":1544476581,"UNIXNSecs":0,"SeqNum":873873830,"EngType":0,"EngID":0,"SmpInt":1000},"Flows":[{"SrcAddr":"125.238.46.48","DstAddr":"114.23.236.96","NextHop":"114.23.3.231","Input":791,"Output":817,"PktCount":4,"L3Octets":1708,"StartTime":51402145,"EndTime":51433264,"SrcPort":49233,"DstPort":443,"Padding1":0,"TCPFlags":16,"ProtType":6,"Tos":0,"SrcAsNum":4771,"DstAsNum":56030,"SrcMask":20,"DstMask":22,"Padding2":0},{"SrcAddr":"125.238.46.48","DstAddr":"114.23.236.96","NextHop":"114.23.3.231","Input":791,"Output":817,"PktCount":1,"L3Octets":441,"StartTime":51425137,"EndTime":51425137,"SrcPort":49233,"DstPort":443,"Padding1":0,"TCPFlags":24,"ProtType":6,"Tos":0,"SrcAsNum":4771,"DstAsNum":56030,"SrcMask":20,"DstMask":22,"Padding2":0},{"SrcAddr":"210.5.53.48","DstAddr":"103.22.200.210","NextHop":"122.56.118.157","Input":564,"Output":802,"PktCount":1,"L3Octets":1500,"StartTime":51420072,"EndTime":51420072,"SrcPort":80,"DstPort":56108,"Padding1":0,"TCPFlags":16,"ProtType":6,"Tos":0,"SrcAsNum":56030,"DstAsNum":13335,"SrcMask":24,"DstMask":23,"Padding2":0}]}

Decoded Netflow v9 data

{"AgentID":"10.81.70.56","Header":{"Version":9,"Count":1,"SysUpTime":357280,"UNIXSecs":1493918653,"SeqNum":14,"SrcID":87},"DataSets":[[{"I":1,"V":"0x00000050"},{"I":2,"V":"0x00000002"},{"I":4,"V":2},{"I":5,"V":192},{"I":6,"V":"0x00"},{"I":7,"V":0},{"I":8,"V":"10.81.70.56"},{"I":9,"V":0},{"I":10,"V":0},{"I":11,"V":0},{"I":12,"V":"224.0.0.22"},{"I":13,"V":0},{"I":14,"V":0},{"I":15,"V":"0.0.0.0"},{"I":16,"V":0},{"I":17,"V":0},{"I":21,"V":300044},{"I":22,"V":299144}]]}

Supported platform

  • Linux
  • Windows

Build

Given that the Go Language compiler (version 1.14.x preferred) is installed, you can build it with:

go get github.com/EdgeCast/vflow/vflow
cd $GOPATH/src/github.com/EdgeCast/vflow

make build
or
cd vflow; go build 

Installation

You can download and install pre-built debian package as below (RPM and Linux binary are available).

dpkg -i vflow-0.9.0-x86_64.deb

Once you installed you need to configure the below files, for more information check configuration guide:

/etc/vflow/vflow.conf
/etc/vflow/mq.conf

You can start the service by the below:

service vflow start

Kubernetes

kubectl apply -f https://github.com/EdgeCast/vflow/blob/master/kubernetes/deploy.yaml

Docker

docker run -d -p 2181:2181 -p 9092:9092 spotify/kafka
docker run -d -p 4739:4739 -p 4729:4729 -p 6343:6343 -p 8081:8081 -e VFLOW_KAFKA_BROKERS="172.17.0.1:9092" mehrdadrad/vflow

Documentation

Contribute

Welcomes any kind of contribution, please follow the next steps:

  • Fork the project on github.com.
  • Create a new branch.
  • Commit changes to the new branch.
  • Send a pull request.

Author: EdgeCast
Source Code: https://github.com/EdgeCast/vflow 
License: Apache-2.0 license

#go #golang #kubernetes #kafka 

vFlow: Enterprise Network Flow Collector (IPFIX, SFlow, Netflow)

Modern TCP tool & Service for Network Performance Observability

TCPProbe is a modern TCP tool and service for network performance observability. It exposes information about socket’s underlying TCP session, TLS and HTTP (more than 60 metrics). you can run it through command line or as a service. the request is highly customizable and you can integrate it with your application through gRPC. it runs in a Kubernetes cluster as cloud native application and by adding annotations on pods allow a fine control of the probing process.

tcpprobe

Features

  • TCP socket statistics
  • TCP/IP request customization
  • Prometheus exporter
  • Probing multiple hosts
  • Runs as service
  • Kubernetes native
  • gRPC interface

Command line (download Linux binary)

tcpprobe -json https://www.google.com
{"Target":"https://www.google.com","IP":"142.250.72.196","Timestamp":1607567390,"Seq":0,"State":1,"CaState":0,"Retransmits":0,"Probes":0,"Backoff":0,"Options":7,"Rto":204000,"Ato":40000,"SndMss":1418,"RcvMss":1418,"Unacked":0,"Sacked":0,"Lost":0,"Retrans":0,"Fackets":0,"LastDataSent":56,"LastAckSent":0,"LastDataRecv":0,"LastAckRecv":0,"Pmtu":9001,"RcvSsthresh":56587,"Rtt":1365,"Rttvar":446,"SndSsthresh":2147483647,"SndCwnd":10,"Advmss":8949,"Reordering":3,"RcvRtt":0,"RcvSpace":62727,"TotalRetrans":0,"PacingRate":20765147,"BytesAcked":448,"BytesReceived":10332,"SegsOut":10,"SegsIn":11,"NotsentBytes":0,"MinRtt":1305,"DataSegsIn":8,"DataSegsOut":3,"DeliveryRate":1785894,"BusyTime":4000,"RwndLimited":0,"SndbufLimited":0,"Delivered":4,"DeliveredCe":0,"BytesSent":447,"BytesRetrans":0,"DsackDups":0,"ReordSeen":0,"RcvOoopack":0,"SndWnd":66816,"TCPCongesAlg":"cubic","HTTPStatusCode":200,"HTTPRcvdBytes":14683,"HTTPRequest":113038,"HTTPResponse":293,"DNSResolve":2318,"TCPConnect":1421,"TLSHandshake":57036,"TCPConnectError":0,"DNSResolveError":0}

Docker

docker run --rm mehrdadrad/tcpprobe smtp.gmail.com:587

Docker Compose

TCPProbe and Prometheus

docker-compose up -d

Open your browser and try http://localhost:9090 You can edit the docker-compose.yml to customize the options and target(s).

Helm Chart

Detailed installation instructions for TCPProbe on Kubernetes are found here.

helm install tcpprobe tcpprobe

Documentation

Contribute

Welcomes any kind of contribution, please follow the next steps:

  • Fork the project on github.com.
  • Create a new branch.
  • Commit changes to the new branch.
  • Send a pull request.

Author: Mehrdadrad
Source Code: https://github.com/mehrdadrad/tcpprobe 
License: MIT license

#go #golang #docker #kubernetes #http 

Modern TCP tool & Service for Network Performance Observability

A Fast Distributed Storage System for Blobs For Billions Of Files!

SeaweedFS


Quick Start for S3 API on Docker

docker run -p 8333:8333 chrislusf/seaweedfs server -s3

Quick Start with Single Binary

  • Download the latest binary from https://github.com/chrislusf/seaweedfs/releases and unzip a single binary file weed or weed.exe
  • Run weed server -dir=/some/data/dir -s3 to start one master, one volume server, one filer, and one S3 gateway.

Also, to increase capacity, just add more volume servers by running weed volume -dir="/some/data/dir2" -mserver="<master_host>:9333" -port=8081 locally, or on a different machine, or on thousands of machines. That is it!

Quick Start SeaweedFS S3 on AWS

Introduction

SeaweedFS is a simple and highly scalable distributed file system. There are two objectives:

  1. to store billions of files!
  2. to serve the files fast!

SeaweedFS started as an Object Store to handle small files efficiently. Instead of managing all file metadata in a central master, the central master only manages volumes on volume servers, and these volume servers manage files and their metadata. This relieves concurrency pressure from the central master and spreads file metadata into volume servers, allowing faster file access (O(1), usually just one disk read operation).

There is only 40 bytes of disk storage overhead for each file's metadata. It is so simple with O(1) disk reads that you are welcome to challenge the performance with your actual use cases.

SeaweedFS started by implementing Facebook's Haystack design paper. Also, SeaweedFS implements erasure coding with ideas from f4: Facebook’s Warm BLOB Storage System, and has a lot of similarities with Facebook’s Tectonic Filesystem

On top of the object store, optional Filer can support directories and POSIX attributes. Filer is a separate linearly-scalable stateless server with customizable metadata stores, e.g., MySql, Postgres, Redis, Cassandra, HBase, Mongodb, Elastic Search, LevelDB, RocksDB, Sqlite, MemSql, TiDB, Etcd, CockroachDB, YDB, etc.

For any distributed key value stores, the large values can be offloaded to SeaweedFS. With the fast access speed and linearly scalable capacity, SeaweedFS can work as a distributed Key-Large-Value store.

SeaweedFS can transparently integrate with the cloud. With hot data on local cluster, and warm data on the cloud with O(1) access time, SeaweedFS can achieve both fast local access time and elastic cloud storage capacity. What's more, the cloud storage access API cost is minimized. Faster and Cheaper than direct cloud storage!

Additional Features

  • Can choose no replication or different replication levels, rack and data center aware.
  • Automatic master servers failover - no single point of failure (SPOF).
  • Automatic Gzip compression depending on file MIME type.
  • Automatic compaction to reclaim disk space after deletion or update.
  • Automatic entry TTL expiration.
  • Any server with some disk spaces can add to the total storage space.
  • Adding/Removing servers does not cause any data re-balancing unless triggered by admin commands.
  • Optional picture resizing.
  • Support ETag, Accept-Range, Last-Modified, etc.
  • Support in-memory/leveldb/readonly mode tuning for memory/performance balance.
  • Support rebalancing the writable and readonly volumes.
  • Customizable Multiple Storage Tiers: Customizable storage disk types to balance performance and cost.
  • Transparent cloud integration: unlimited capacity via tiered cloud storage for warm data.
  • Erasure Coding for warm storage Rack-Aware 10.4 erasure coding reduces storage cost and increases availability.

Back to TOC

Filer Features

Kubernetes

Example: Using Seaweed Object Store

By default, the master node runs on port 9333, and the volume nodes run on port 8080. Let's start one master node, and two volume nodes on port 8080 and 8081. Ideally, they should be started from different machines. We'll use localhost as an example.

SeaweedFS uses HTTP REST operations to read, write, and delete. The responses are in JSON or JSONP format.

Start Master Server

> ./weed master

Start Volume Servers

> weed volume -dir="/tmp/data1" -max=5  -mserver="localhost:9333" -port=8080 &
> weed volume -dir="/tmp/data2" -max=10 -mserver="localhost:9333" -port=8081 &

Write File

To upload a file: first, send a HTTP POST, PUT, or GET request to /dir/assign to get an fid and a volume server URL:

> curl http://localhost:9333/dir/assign
{"count":1,"fid":"3,01637037d6","url":"127.0.0.1:8080","publicUrl":"localhost:8080"}

Second, to store the file content, send a HTTP multi-part POST request to url + '/' + fid from the response:

> curl -F file=@/home/chris/myphoto.jpg http://127.0.0.1:8080/3,01637037d6
{"name":"myphoto.jpg","size":43234,"eTag":"1cc0118e"}

To update, send another POST request with updated file content.

For deletion, send an HTTP DELETE request to the same url + '/' + fid URL:

> curl -X DELETE http://127.0.0.1:8080/3,01637037d6

Save File Id

Now, you can save the fid, 3,01637037d6 in this case, to a database field.

The number 3 at the start represents a volume id. After the comma, it's one file key, 01, and a file cookie, 637037d6.

The volume id is an unsigned 32-bit integer. The file key is an unsigned 64-bit integer. The file cookie is an unsigned 32-bit integer, used to prevent URL guessing.

The file key and file cookie are both coded in hex. You can store the <volume id, file key, file cookie> tuple in your own format, or simply store the fid as a string.

If stored as a string, in theory, you would need 8+1+16+8=33 bytes. A char(33) would be enough, if not more than enough, since most uses will not need 2^32 volumes.

If space is really a concern, you can store the file id in your own format. You would need one 4-byte integer for volume id, 8-byte long number for file key, and a 4-byte integer for the file cookie. So 16 bytes are more than enough.

Read File

Here is an example of how to render the URL.

First look up the volume server's URLs by the file's volumeId:

> curl http://localhost:9333/dir/lookup?volumeId=3
{"volumeId":"3","locations":[{"publicUrl":"localhost:8080","url":"localhost:8080"}]}

Since (usually) there are not too many volume servers, and volumes don't move often, you can cache the results most of the time. Depending on the replication type, one volume can have multiple replica locations. Just randomly pick one location to read.

Now you can take the public URL, render the URL or directly read from the volume server via URL:

 http://localhost:8080/3,01637037d6.jpg

Notice we add a file extension ".jpg" here. It's optional and just one way for the client to specify the file content type.

If you want a nicer URL, you can use one of these alternative URL formats:

 http://localhost:8080/3/01637037d6/my_preferred_name.jpg
 http://localhost:8080/3/01637037d6.jpg
 http://localhost:8080/3,01637037d6.jpg
 http://localhost:8080/3/01637037d6
 http://localhost:8080/3,01637037d6

If you want to get a scaled version of an image, you can add some params:

http://localhost:8080/3/01637037d6.jpg?height=200&width=200
http://localhost:8080/3/01637037d6.jpg?height=200&width=200&mode=fit
http://localhost:8080/3/01637037d6.jpg?height=200&width=200&mode=fill

Rack-Aware and Data Center-Aware Replication

SeaweedFS applies the replication strategy at a volume level. So, when you are getting a file id, you can specify the replication strategy. For example:

curl http://localhost:9333/dir/assign?replication=001

The replication parameter options are:

000: no replication
001: replicate once on the same rack
010: replicate once on a different rack, but same data center
100: replicate once on a different data center
200: replicate twice on two different data center
110: replicate once on a different rack, and once on a different data center

More details about replication can be found on the wiki.

You can also set the default replication strategy when starting the master server.

Allocate File Key on Specific Data Center

Volume servers can be started with a specific data center name:

 weed volume -dir=/tmp/1 -port=8080 -dataCenter=dc1
 weed volume -dir=/tmp/2 -port=8081 -dataCenter=dc2

When requesting a file key, an optional "dataCenter" parameter can limit the assigned volume to the specific data center. For example, this specifies that the assigned volume should be limited to 'dc1':

 http://localhost:9333/dir/assign?dataCenter=dc1

Other Features

Back to TOC

Object Store Architecture

Usually distributed file systems split each file into chunks, a central master keeps a mapping of filenames, chunk indices to chunk handles, and also which chunks each chunk server has.

The main drawback is that the central master can't handle many small files efficiently, and since all read requests need to go through the chunk master, so it might not scale well for many concurrent users.

Instead of managing chunks, SeaweedFS manages data volumes in the master server. Each data volume is 32GB in size, and can hold a lot of files. And each storage node can have many data volumes. So the master node only needs to store the metadata about the volumes, which is a fairly small amount of data and is generally stable.

The actual file metadata is stored in each volume on volume servers. Since each volume server only manages metadata of files on its own disk, with only 16 bytes for each file, all file access can read file metadata just from memory and only needs one disk operation to actually read file data.

For comparison, consider that an xfs inode structure in Linux is 536 bytes.

Master Server and Volume Server

The architecture is fairly simple. The actual data is stored in volumes on storage nodes. One volume server can have multiple volumes, and can both support read and write access with basic authentication.

All volumes are managed by a master server. The master server contains the volume id to volume server mapping. This is fairly static information, and can be easily cached.

On each write request, the master server also generates a file key, which is a growing 64-bit unsigned integer. Since write requests are not generally as frequent as read requests, one master server should be able to handle the concurrency well.

Write and Read files

When a client sends a write request, the master server returns (volume id, file key, file cookie, volume node URL) for the file. The client then contacts the volume node and POSTs the file content.

When a client needs to read a file based on (volume id, file key, file cookie), it asks the master server by the volume id for the (volume node URL, volume node public URL), or retrieves this from a cache. Then the client can GET the content, or just render the URL on web pages and let browsers fetch the content.

Please see the example for details on the write-read process.

Storage Size

In the current implementation, each volume can hold 32 gibibytes (32GiB or 8x2^32 bytes). This is because we align content to 8 bytes. We can easily increase this to 64GiB, or 128GiB, or more, by changing 2 lines of code, at the cost of some wasted padding space due to alignment.

There can be 4 gibibytes (4GiB or 2^32 bytes) of volumes. So the total system size is 8 x 4GiB x 4GiB which is 128 exbibytes (128EiB or 2^67 bytes).

Each individual file size is limited to the volume size.

Saving memory

All file meta information stored on an volume server is readable from memory without disk access. Each file takes just a 16-byte map entry of <64bit key, 32bit offset, 32bit size>. Of course, each map entry has its own space cost for the map. But usually the disk space runs out before the memory does.

Tiered Storage to the cloud

The local volume servers are much faster, while cloud storages have elastic capacity and are actually more cost-efficient if not accessed often (usually free to upload, but relatively costly to access). With the append-only structure and O(1) access time, SeaweedFS can take advantage of both local and cloud storage by offloading the warm data to the cloud.

Usually hot data are fresh and warm data are old. SeaweedFS puts the newly created volumes on local servers, and optionally upload the older volumes on the cloud. If the older data are accessed less often, this literally gives you unlimited capacity with limited local servers, and still fast for new data.

With the O(1) access time, the network latency cost is kept at minimum.

If the hot/warm data is split as 20/80, with 20 servers, you can achieve storage capacity of 100 servers. That's a cost saving of 80%! Or you can repurpose the 80 servers to store new data also, and get 5X storage throughput.

Back to TOC

Compared to Other File Systems

Most other distributed file systems seem more complicated than necessary.

SeaweedFS is meant to be fast and simple, in both setup and operation. If you do not understand how it works when you reach here, we've failed! Please raise an issue with any questions or update this file with clarifications.

SeaweedFS is constantly moving forward. Same with other systems. These comparisons can be outdated quickly. Please help to keep them updated.

Back to TOC

Compared to HDFS

HDFS uses the chunk approach for each file, and is ideal for storing large files.

SeaweedFS is ideal for serving relatively smaller files quickly and concurrently.

SeaweedFS can also store extra large files by splitting them into manageable data chunks, and store the file ids of the data chunks into a meta chunk. This is managed by "weed upload/download" tool, and the weed master or volume servers are agnostic about it.

Back to TOC

Compared to GlusterFS, Ceph

The architectures are mostly the same. SeaweedFS aims to store and read files fast, with a simple and flat architecture. The main differences are

  • SeaweedFS optimizes for small files, ensuring O(1) disk seek operation, and can also handle large files.
  • SeaweedFS statically assigns a volume id for a file. Locating file content becomes just a lookup of the volume id, which can be easily cached.
  • SeaweedFS Filer metadata store can be any well-known and proven data store, e.g., Redis, Cassandra, HBase, Mongodb, Elastic Search, MySql, Postgres, Sqlite, MemSql, TiDB, CockroachDB, Etcd, YDB etc, and is easy to customize.
  • SeaweedFS Volume server also communicates directly with clients via HTTP, supporting range queries, direct uploads, etc.
SystemFile MetadataFile Content ReadPOSIXREST APIOptimized for large number of small files
SeaweedFSlookup volume id, cacheableO(1) disk seek YesYes
SeaweedFS FilerLinearly Scalable, CustomizableO(1) disk seekFUSEYesYes
GlusterFShashing FUSE, NFS  
Cephhashing + rules FUSEYes 
MooseFSin memory FUSE No
MinIOseparate meta file for each file  YesNo

Back to TOC

Compared to GlusterFS

GlusterFS stores files, both directories and content, in configurable volumes called "bricks".

GlusterFS hashes the path and filename into ids, and assigned to virtual volumes, and then mapped to "bricks".

Back to TOC

Compared to MooseFS

MooseFS chooses to neglect small file issue. From moosefs 3.0 manual, "even a small file will occupy 64KiB plus additionally 4KiB of checksums and 1KiB for the header", because it "was initially designed for keeping large amounts (like several thousands) of very big files"

MooseFS Master Server keeps all meta data in memory. Same issue as HDFS namenode.

Back to TOC

Compared to Ceph

Ceph can be setup similar to SeaweedFS as a key->blob store. It is much more complicated, with the need to support layers on top of it. Here is a more detailed comparison

SeaweedFS has a centralized master group to look up free volumes, while Ceph uses hashing and metadata servers to locate its objects. Having a centralized master makes it easy to code and manage.

Ceph, like SeaweedFS, is based on the object store RADOS. Ceph is rather complicated with mixed reviews.

Ceph uses CRUSH hashing to automatically manage data placement, which is efficient to locate the data. But the data has to be placed according to the CRUSH algorithm. Any wrong configuration would cause data loss. Topology changes, such as adding new servers to increase capacity, will cause data migration with high IO cost to fit the CRUSH algorithm. SeaweedFS places data by assigning them to any writable volumes. If writes to one volume failed, just pick another volume to write. Adding more volumes is also as simple as it can be.

SeaweedFS is optimized for small files. Small files are stored as one continuous block of content, with at most 8 unused bytes between files. Small file access is O(1) disk read.

SeaweedFS Filer uses off-the-shelf stores, such as MySql, Postgres, Sqlite, Mongodb, Redis, Elastic Search, Cassandra, HBase, MemSql, TiDB, CockroachCB, Etcd, YDB, to manage file directories. These stores are proven, scalable, and easier to manage.

SeaweedFScomparable to Cephadvantage
MasterMDSsimpler
VolumeOSDoptimized for small files
FilerCeph FSlinearly scalable, Customizable, O(1) or O(logN)

Back to TOC

Compared to MinIO

MinIO follows AWS S3 closely and is ideal for testing for S3 API. It has good UI, policies, versionings, etc. SeaweedFS is trying to catch up here. It is also possible to put MinIO as a gateway in front of SeaweedFS later.

MinIO metadata are in simple files. Each file write will incur extra writes to corresponding meta file.

MinIO does not have optimization for lots of small files. The files are simply stored as is to local disks. Plus the extra meta file and shards for erasure coding, it only amplifies the LOSF problem.

MinIO has multiple disk IO to read one file. SeaweedFS has O(1) disk reads, even for erasure coded files.

MinIO has full-time erasure coding. SeaweedFS uses replication on hot data for faster speed and optionally applies erasure coding on warm data.

MinIO does not have POSIX-like API support.

MinIO has specific requirements on storage layout. It is not flexible to adjust capacity. In SeaweedFS, just start one volume server pointing to the master. That's all.

Dev Plan

  • More tools and documentation, on how to manage and scale the system.
  • Read and write stream data.
  • Support structured data.

This is a super exciting project! And we need helpers and support!

Back to TOC

Installation Guide

Installation guide for users who are not familiar with golang

Step 1: install go on your machine and setup the environment by following the instructions at:

https://golang.org/doc/install

make sure to define your $GOPATH

Step 2: checkout this repo:

git clone https://github.com/chrislusf/seaweedfs.git

Step 3: download, compile, and install the project by executing the following command

cd seaweedfs/weed && make install

Once this is done, you will find the executable "weed" in your $GOPATH/bin directory

Back to TOC

Disk Related Topics

Hard Drive Performance

When testing read performance on SeaweedFS, it basically becomes a performance test of your hard drive's random read speed. Hard drives usually get 100MB/s~200MB/s.

Solid State Disk

To modify or delete small files, SSD must delete a whole block at a time, and move content in existing blocks to a new block. SSD is fast when brand new, but will get fragmented over time and you have to garbage collect, compacting blocks. SeaweedFS is friendly to SSD since it is append-only. Deletion and compaction are done on volume level in the background, not slowing reading and not causing fragmentation.

Back to TOC

Benchmark

My Own Unscientific Single Machine Results on Mac Book with Solid State Disk, CPU: 1 Intel Core i7 2.6GHz.

Write 1 million 1KB file:

Concurrency Level:      16
Time taken for tests:   66.753 seconds
Completed requests:      1048576
Failed requests:        0
Total transferred:      1106789009 bytes
Requests per second:    15708.23 [#/sec]
Transfer rate:          16191.69 [Kbytes/sec]

Connection Times (ms)
              min      avg        max      std
Total:        0.3      1.0       84.3      0.9

Percentage of the requests served within a certain time (ms)
   50%      0.8 ms
   66%      1.0 ms
   75%      1.1 ms
   80%      1.2 ms
   90%      1.4 ms
   95%      1.7 ms
   98%      2.1 ms
   99%      2.6 ms
  100%     84.3 ms

Randomly read 1 million files:

Concurrency Level:      16
Time taken for tests:   22.301 seconds
Completed requests:      1048576
Failed requests:        0
Total transferred:      1106812873 bytes
Requests per second:    47019.38 [#/sec]
Transfer rate:          48467.57 [Kbytes/sec]

Connection Times (ms)
              min      avg        max      std
Total:        0.0      0.3       54.1      0.2

Percentage of the requests served within a certain time (ms)
   50%      0.3 ms
   90%      0.4 ms
   98%      0.6 ms
   99%      0.7 ms
  100%     54.1 ms

Stargazers over time

Stargazers over time

Sponsor SeaweedFS via Patreon

SeaweedFS is an independent Apache-licensed open source project with its ongoing development made possible entirely thanks to the support of these awesome backers. If you'd like to grow SeaweedFS even stronger, please consider joining our sponsors on Patreon.

Your support will be really appreciated by me and other supporters!

Gold Sponsors

Author: Chrislusf
Source Code: https://github.com/chrislusf/seaweedfs 
License: Apache-2.0 license

#go #golang #kubernetes 

A Fast Distributed Storage System for Blobs For Billions Of Files!

AWSでのKubernetes:AmazonEKSを使用してKubernetesクラスターをデプロイする

過去5年間で、コンピューティングは指数関数的に増加しました。アプリケーションはクライアントサーバーモデルをはるかに超えており、分散コンピューティングが標準になっています。AWS上のKubernetesは、最も強力な分散コンピューティングプラットフォームの1つを提供します。

Amazon FargateのようなテクノロジーとAmazonのクラウドコンピューティングインフラストラクチャの広大なアウトリーチのおかげで、Elastic Kubernetes Service(EKS)は、アプリケーションを実行および拡張できる真に分散された環境を提供できます。

AWS Webコンソールを使用してAWSでKubernetesをセットアップすることは、おそらく開始する最も簡単な方法ではありません。次の章では、AWSでKubernetesを起動するための最も簡単な方法を抽出してから、このクラスターでDockerizedアプリケーションを起動します。

AWSでのKubernetesの前提条件

AWSでKubernetesを使い始める前に、いくつかの重要な概念について理解しましょう。これらには次のものが含まれます。

  1. Dockerコンテナの理解。後でコンテナとVMの境界線を曖昧にするので、コンテナと従来の仮想マシンの違いを理解してください。
  2. IAMロール、VPC、

なぜDockerなのか?KubernetesはDockerサポートを削除しませんでしたか?

Kubernetesはコンテナオーケストレーションエンジンです。アプリケーションをKubernetesクラスターにデプロイするには、アプリケーションを1つ以上のコンテナーイメージとしてパッケージ化する必要があります。Dockerは、コンテナーを操作する最も一般的な方法です。それは次のことであなたを助けます:

  1. Windows、macOS、またはLinuxワークステーションでローカルにコンテナを実行する
  2. アプリケーションをDockerイメージとしてパッケージ化します

構想以来、Kubernetesは長い道のりを歩んできました。当初は、Docker(最も一般的)やrktなどのコンテナーランタイムで動作していました。これは、各KubernetesノードにDockerがインストールされて実行されていることを意味します。次に、ノードで実行されているkubeletバイナリは、Dockerエンジンと通信して、Kubernetesによって管理されるコンテナであるポッドを作成します。

最近、KubernetesプロジェクトはDockerランタイムのサポートを終了しました。代わりに、独自のContainer Runtime Interface(CRI)を使用します。これにより、Dockerをノードにインストールするという余分な依存関係がなくなります。

ただし、Dockerは古いか、Kubernetesについて知りたい人と互換性がないという誤解を招いています。

本当じゃない。イメージをローカルで実行してテストするには、Dockerが必要です。また、Dockerランタイムで動作するDockerイメージは、Kubernetesでも動作します。Kubernetesに、これらのDockerイメージを起動するための軽量な実装が追加されただけです。

KubernetesvsEKSはAmazonです

では、なぜAmazonでEKSクラスターを起動することに関心があるのでしょうか。独自のKubernetesクラスターを作成するか、GCPやAzureなどの他のクラウドプロバイダーを使用することを選択してみませんか?

次のような多くの理由があります。

複雑

Kubernetesクラスターをブートストラップすることはお勧めできません。アプリケーションのセキュリティ保護と管理を担当するだけでなく、クラスター、ネットワーク、およびストレージの構成も​​担当します。さらに、Kubernetesのメンテナンスには、クラスター、基盤となるオペレーティングシステムなどのアップグレードが含まれます。

AWSのマネージドKubernetesサービスであるEKSを使用すると、クラスターが正しく構成され、更新とパッチが時間どおりに取得されるようになります。

統合

AWSのEKSは、Amazonの他のインフラストラクチャとそのまま使用できます。Elastic Load Balancer(ELB)は、サービスを外部に公開するために使用されます。クラスタはElasticBlockStorage(EBS)を使用して永続データを保存します。Amazonは、データがオンラインでクラスターで利用可能であることを確認します。

真のスケーラビリティ

Amazon EKSは、セルフホストのKubernetesよりもはるかに優れたスケーラビリティを提供します。コントロールプレーンは、ポッドが複数の物理ノードにまたがって起動されることを確認します(必要に応じて)。いずれかのノードがダウンしても、アプリケーションは引き続きオンラインになります。ただし、独自のクラスターを管理する場合は、異なるVM(EC2インスタンス)が異なるアベイラビリティーゾーンにあることを確認する必要があります。

それを保証できない場合は、同じ物理サーバーで異なるポッドを実行しても、フォールトトレランスはあまり得られません。

ファーゲートとファイアクラッカー

VMインスタンスは、仮想化されたハードウェア、つまりハードウェアを装ったソフトウェアで実行されます。これにより、全体的なクラウドインフラストラクチャのセキュリティが向上します。ただし、これには、ハードウェアリソースを仮想化するソフトウェアのレイヤーが原因で、パフォーマンスが低下するという代償が伴います。

一方、コンテナはすべて同じオペレーティングシステムで実行され、同じ基盤となるカーネルを共有するため、軽量です。これにより、起動時間が短縮され、パフォーマンスへの影響はありません。ハードウェア上で直接コンテナーを実行することは、ベアメタル上のコンテナーと呼ばれます。

この記事の執筆時点では、Amazonはベアメタルコンテナを提供する数少ないパブリッククラウドの1つです。つまり、EC2インスタンスを起動してからそれらのVM内でコンテナーを実行する代わりに、Amazon Fargateを使用して、ベアメタルでコンテナーを実行できます。

彼らは、microVM内でDockerコンテナを実行する非常に軽量なLinuxKVMベースのテクノロジーであるFirecrackerを介してこれを管理します。これらは、コンテナーのパフォーマンスとVMのセキュリティを提供します。これだけが、AmazonのEKSが競合他社よりも好ましい理由です。

AmazonEKSクラスターの構造

EKSクラスターは、次の2つの広範なコンポーネントで構成されています。

コントロールプレーン

このサービスは完全にAWSによって管理されています。これは、アカウントにEC2インスタンスが作成されていないため、etcd、kube-apiserver、およびその他のコンポーネントが表示されることが予想されるためです。代わりに、それらすべてがあなたから抽象化され、コントロールプレーンはサーバー、つまりkube-apiとしてあなたに公開されます。

コントロールプレーンのコストは1時間あたり0.10ドルです。幸い、単一のクラスターを使用して複数のアプリケーションを実行できます。この価格は、より多くのアプリやサービスを起動しても上昇しません。

ノード

これらは、EC2インスタンスを管理することも、AWSFargateで実行することもできます。マネージドEC2インスタンスオプションは、AWSがユーザーに代わってEC2インスタンスを起動し、コントロールプレーンがそれらのインスタンスを制御できるようにする場所です。これらは、アカウントにEC2インスタンスとして表示されます。これらのノードには、標準のEC2料金が適用されます。

AWS Fargateの場合、管理するEC2インスタンスはありません。代わりに、ポッドはベアメタル上で直接実行され、ポッドが実行される時間に対してのみ料金が発生します。新しいクラスターにはAWSFargateを使用することを強くお勧めします。これは、クラスターを作成する次のセクションでも使用します。AWS Fargateの料金の詳細については、こちらをご覧ください

AWSでKubernetesクラスターを作成する

EKSを使い始める最も簡単な方法は、次のようなコマンドラインユーティリティを使用することです。

  1. AWS-CLIを使用してAWSアカウントを操作します
  2. EKSクラスターを作成、管理、削除するeksctl 、および
  3. kubectlは、Kubernetesクラスター自体と対話します。
  4. アプリケーションを作成してコンテナ化するためのdocker 。
  5. DockerイメージをホストするDockerHubアカウント(無料利用枠が機能します)

AWSCLIのセットアップ

AWSは、ユーザーにコマンドラインツールを提供し、ターミナルから直接AWSリソースをプロビジョニングする可能性を提供します。AWS APIと直接通信し、ユーザーに代わってリソースをプロビジョニングします。これにより、AWSWebコンソールを使用してEKSクラスターまたはその他のリソースを手動で設定する必要がなくなります。CLIを使用して自動化すると、プロセスでエラーが発生しにくくなります。

ローカルコンピューターでAWSCLIをセットアップしましょう。

1.まず、システムに適したCLIバイナリを入手します。

2. AWS CLIを使用すると、ダッシュボードをいじくり回すことなく、AWSのクラウドにリソースをすばやくプログラムで作成できます。これにより、人的エラーも排除されます。

3. EKSクラスターを作成および管理するには、rootユーザーまたは管理者アクセス権を持つIAMユーザーである必要があります。

4.簡潔にするためにrootアカウントを使用します。AWS Web Consoleの右上隅にあるプロファイルをクリックし、[マイセキュリティクレデンシャル]を選択します。

次に、メインメニューの[アクセスキー]タブに移動します。

「新しいアクセスキーの作成」ボタンをクリックします。

次に、新しいポップアップで[アクセスキーの表示]をクリックし、アクセスIDとシークレットアクセスキーの両方をローカルコンピューターに慎重にコピーします。シークレットアクセスキーは1回だけ表示されることに注意してください。

5.ターミナルを開き、次のコマンドを入力し、プロンプトが表示されたら、アクセスキーIDとシークレットアクセスキーを入力します。

$ aws configure
AWS Access Key ID [None]:
AWS Secret Access Key [None]:
Default region name [None]:us-east-2
Default output format [None]: text

また、デフォルトのリージョンを選択するように求められます。私たちはと一緒us-east-2に行きますが、あなたはあなたに最も利益をもたらす(またはあなたに最も近い)地域を選ぶことができます。この場合、デフォルトの出力形式はテキストになります。

設定とクレデンシャルは、.awsというサブディレクトリのHOMEディレクトリにあり、AWSとeksctlの両方がリソースを管理するために使用します。これで、クラスターの作成に進むことができます。

2.Fargateを使用したEKSクラスターの作成と削除

Fargateノードを使用してクラスターを作成するには、次のコマンドを実行するだけです。

$ eksctl create cluster --name my-fargate-cluster --fargate

それだ!コマンドが完了するまでに約15〜30分かかる場合があり、コマンドを実行すると、クラスターを起動するために作成されているすべてのリソースがターミナルに出力されます。

以下に出力例を示します。

2021-08-01 18:14:41 [ℹ]  eksctl version 0.59.0
2021-08-01 18:14:41 [ℹ]  using region us-east-2
2021-08-01 18:14:42 [ℹ]  setting availability zones to [us-east-2c us-east-2a us-east-2b]
2021-08-01 18:14:42 [ℹ]  subnets for us-east-2c - public:192.168.0.0/19 private:192.168.96.0/19
2021-08-01 18:14:42 [ℹ]  subnets for us-east-2a - public:192.168.32.0/19 private:192.168.128.0/19
2021-08-01 18:14:42 [ℹ]  subnets for us-east-2b - public:192.168.64.0/19 private:192.168.160.0/19
2021-08-01 18:14:42 [ℹ]  nodegroup "ng-5018c8ae" will use "" [AmazonLinux2/1.20]
2021-08-01 18:14:42 [ℹ]  using Kubernetes version 1.20
2021-08-01 18:14:42 [ℹ]  creating EKS cluster "my-fargate-cluster" in "us-east-2" region with Fargate profile and managed nodes
2021-08-01 18:14:42 [ℹ]  will create 2 separate CloudFormation stacks for cluster itself and the initial managed nodegroup
2021-08-01 18:14:42 [ℹ]  if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-east-2 --cluster=my-fargate-cluster'
2021-08-01 18:14:42 [ℹ]  CloudWatch logging will not be enabled for cluster "my-fargate-cluster" in "us-east-2"
2021-08-01 18:14:42 [ℹ]  you can enable it with 'eksctl utils update-cluster-logging --enable-types={SPECIFY-YOUR-LOG-TYPES-HERE (e.g. all)} --region=us-east-2 --cluster=my-fargate-cluster'
2021-08-01 18:14:42 [ℹ]  Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster "my-fargate-cluster" in "us-east-2"
2021-08-01 18:14:42 [ℹ]  2 sequential tasks: { create cluster control plane "my-fargate-cluster", 3 sequential sub-tasks: { 2 sequential sub-tasks: { wait for control plane to become ready, create fargate profiles }, 1 task: { create addons }, create managed nodegroup "ng-5018c8ae" } }
2021-08-01 18:14:42 [ℹ]  building cluster stack "eksctl-my-fargate-cluster-cluster"
2021-08-01 18:14:44 [ℹ]  deploying stack "eksctl-my-fargate-cluster-cluster"
2021-08-01 18:15:14 [ℹ]  waiting for CloudFormation stack "eksctl-my-fargate-cluster-cluster"
2021-08-01 18:46:17 [✔]  EKS cluster "my-fargate-cluster" in "us-east-2" region is ready

出力が示すように、多くのリソースがスピンアップされています。選択したアベイラビリティーゾーン用のいくつかの新しいプライベートサブネット、いくつかのIAMロール、およびKubernetesクラスターのコントロールプレーン自体を含みます。これらのどれかがわからなくても、慌てないでください!これらの詳細はeksctlの問題です。次のコマンドでクラスターを削除するとわかるように、次のようになります。

$ eksctl delete cluster --name my-fargate-cluster
2021-08-01 23:00:35 [ℹ]  eksctl version 0.59.0
## A lot more output here
2021-08-01 23:06:30 [✔]  all cluster resources were deleted

これにより、クラスターが破棄され、すべてのノードグループ、プライベートサブネット、およびその他の関連リソースが削除されます。また、ELBのようなリソースが存在しないようにして、余分なコストがかかるようにします。

アプリケーションのDockerizing

これで、クラスターを作成および破棄する方法がわかりました。しかし、その中でアプリケーションを起動するにはどうすればよいでしょうか。プロセスの最初のステップは、アプリケーションをコンテナー化することです。

アプリケーションをコンテナ化するには、Dockerfileを作成する方法を知っている必要があります。Dockerfileは、コンテナーオーケストレーションシステムがDockerイメージを構築するための青写真です。Dockerfileを作成するには、次のことを行う必要があります。

  1. ベースイメージを選択します。たとえば、アプリケーションの依存関係に応じて、NodeまたはPython用のDockerイメージを使用できます。
  2. コンテナ内の作業ディレクトリを選択します。
  3. プロジェクトのビルドアーティファクト(コンパイルされたバイナリ、スクリプト、およびライブラリ)をそのディレクトリに転送します
  4. 起動したコンテナのコマンドを実行するように設定します。したがって、たとえば、開始点を持つノードアプリがある場合、Dockerfileの最後の行としてapp.jsコマンドCMD [ node、 ]があります。app.jsこれにより、アプリが起動します。

このDockerfileは、プロジェクトのgitリポジトリのルートに存在するため、CIシステムは、増分更新ごとに自動ビルド、テスト、およびデプロイを簡単に実行できます。

Express.jsで記述されたサンプルアプリを使用して、ドッキングしてEKSクラスターにデプロイするテストプロジェクトとして使用してみましょう。

サンプルアプリ

express.jsクラスタにデプロイする簡単なアプリを作成しましょう。次のディレクトリを作成しますexample-app

$ mkdir example-app

app.js次に、次の内容を持つというファイルを作成します。


const express = require('express')
const app = express()
const port = 80
app.get('/', (req, res) => {
    res.send('Hello World!\n')
})
app.listen(port, () => {
    console.log(`This app is listening at http://localhost:${port}`
})

このapp.jsアプリケーションは、リッスンしport 80て応答するシンプルなWebサーバーHello Worldです。

Dockerfileの作成

アプリをコンテナとしてビルドして実行するには、example-appディレクトリにDockerfileというファイルを作成します。このファイルには次の内容が含まれています。

FROM node:latest
WORKDIR /app
RUN npm install express
COPY app.js .
CMD ["node", "app.js"]

これにより、 Docker Hubのnode:latestベースイメージを使用してコンテナーをビルドするようにDockerエンジンに指示します。Dockerイメージをビルドするには、次のコマンドを実行します。

$ docker build -t username/example-app

ここで、タグのユーザー名を実際のDockerHubのユーザー名に置き換える必要があります。これが当てはまる理由については、コンテナレジストリに関するセクションで説明します。

アプリケーションをローカルで実行する

イメージが作成されたら、ローカルで機能するかどうかをテストできます。以下のコマンドを実行して、コンテナーを起動します。

$ docker run --rm -d -p 80:80 --name test-container username/example-app
$ curl http://localhost
Hello World!

これは意図したとおりに機能しているようです。コンテナログの内容を見てみましょう。特定のコンテナーの名前を指定してdockerlogコマンドを実行します。

$ docker logs test-container
This app is listening at http://localhost:80

ご覧のとおり、express.jsアプリが(console.logを使用して)標準出力に書き込んでいたものはすべて、コンテナーランタイムを介してログに記録されます。これは、テスト中と実稼働環境の両方でアプリケーションをデバッグするのに役立ちます。

コンテナレジストリについて

Dockerイメージは、デフォルトでDockerHubでホストされます。Docker Hubは、DockerイメージのGitHubのようなものです。イメージをバージョン管理し、タグを付けて、DockerHubでホストできます

node:latestタグは、最新リリースバージョンのNode.jsがこのイメージのビルドプロセスに使用されることを意味します。LTSリリースなど、Node.jsの特定のバージョンで使用できる他のタグがあります。特定のイメージのDockerHubページにいつでもアクセスして、ダウンロードできるさまざまなオプションを確認できます。

GitLabやGitHubと同様に、使用できるさまざまなコンテナーレジストリがあります。AWSにはElasticContainerRepositoryまたはECRソリューションがあり、GCPにも同様のものがありますが、Dockerインストールのデフォルトであるため、この記事ではDockerHubを使用します。

前の手順で作成したイメージをプッシュするには、最初にDockerHubアカウントにログインします。

$ docker login

プロンプトが表示されたら、ユーザー名とパスワードを入力します。
ログインに成功したら、次のコマンドで画像をプッシュします。

$ docker push username/example-app

これで、example-appDockerイメージをEKSクラスターで使用する準備が整いました。

アプリケーションのデプロイ

作成したアプリケーションをデプロイするには、コマンドラインツールであるkubectlを使用してKubernetesコントロールプレーンとやり取りします。eksctlを使用してクラスターを作成した場合、kubectlはEKSクラスターと通信するためにすでに認証されています。

eks-configsというディレクトリを作成します。このディレクトリには、クラスタとその上で実行されているアプリケーションの望ましい状態の説明が格納されます。

$ mkdir eks-configs

デプロイメントの作成

アプリをデプロイするために、ステートレスアプリケーションに最適なDeploymentタイプのKubernetesワークロードを作成します。ファイルを作成し、それに以下を追加します。example-deployment.yaml

使用されているイメージは、DockerHubのusername/example-appです。'username'を実際のDockerHubユーザー名に置き換えてください。このアプリを実行すると、次のようになります。

---
apiVersion: apps/v1
kind: Deployment
metadata:
 name: example-deployment
 labels:
   app: example-ap
spec
  replicas: 6
  selector:
    matchLabels:
      app: example-ap
  template:
    metadata:
      labels:
        app: example-app
    spec:
     containers:
     - name: example-app
       image: username/example-app
       ports:
     - containerPort: 80

kubectlを使用すると、次のようにこのデプロイメントを起動できます。

$ kubectl create -f example-deployment.yaml

これにより、Deploymentオブジェクトが作成されます。これは、1つ以上のポッドで実行されるアプリケーション(またはアプリケーションの一部)を論理的に処理するKubernetesの方法です。

次に、すぐにわかるように、クラスターの他の部分で使用されるデプロイメントの名前とラベルを含むいくつかのメタデータを追加しました。また、EKSに、クラスター全体で実行するexample-appコンテナの6つのレプリカを起動するように依頼しています。

ニーズに応じて、その数をその場で変更できます。これにより、アプリケーションをスケールアップまたはスケールダウンできます。ポッドは複数のノードにまたがってスケジュールされるため、レプリカが多数あることで高可用性も保証されます。

次のコマンドを使用して、クラスター(デフォルトの名前空間)で実行されているデプロイメントとポッドのリストを取得します。

$ kubectl get pods
$ kubectl get deployments

サービスの作成

デプロイメントが作成されますが、どのようにアクセスしますか?複数のポッドで実行されているため、別のアプリケーション(ユーザーのブラウザーなど)はどのようにアクセスしますか?ポッドは一時的で代替可能であるため、ポッドを指す直接DNSエントリを作成することはできません。また、クラスターの内部コンポーネントを外部に直接公開することは、最も安全なアイデアではありません。

代わりに、Kubernetesサービスを作成します。Kubernetesサービスにはいくつかの種類がありますが、ここでそれらについて学ぶことができます。今のところ、AWSのElastic Load Balancer(ELB)サービスを使用します。

ちなみに、これはほとんどのKubernetes機能を備えた一般的なテーマです。これらは、基盤となるクラウドインフラストラクチャと非常によく統合されています。この例には、ノードにEC2インスタンスを使用する、サービスを外部に公開するためにELBを使用する、クラスター内のネットワーキングにAWSのVPCを使用する、高可用性の永続ストレージにElasticBlockStorageを使用するなどがあります。

ファイルを作成し、example-service.yamlその中に次のコンテンツを追加することで、LoadBalancerタイプのKubernetesサービスを作成できます。


---
apiVersion: v1
kind: Service
metadata:
  name: example-service
spec:
  type: LoadBalancer
  selector:
    app: example-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80

次に、次のコマンドを使用して作成します。

$ kubectl create -f example-service.yaml

Kubernetesサービスは数分で稼働し、ようやくアプリケーションと通信できるようになります。まず、サービスの「EXTERNAL-IP」を取得する必要があります。これは通常、IPではなく、以下に示すように、サービスを指すFQDNです。

$ kubectl get svc
NAME              TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)        AGE
example-service   LoadBalancer   10.100.88.189   a78b55c6b3a574e61accfe13b324bd08-1728675516.us-east2.elb.amazonaws.com   80:31058/TCP   5m33
kubernetes        ClusterIP      10.100.0.1      <none>  43/TCP        33m

このサービスには、Webブラウザーから、または以下に示すようにcURLなどのコマンドラインツールを使用して簡単にアクセスできます。

$ curl
a78b55c6b3a574e61accfe13b324bd08-1728675516.us-east-2.elb.amazonaws.co
Hello World!

それだ!これで、アプリケーションをAWSEKSプラットフォームに少しずつ展開し始めることができます。クラスターの数を最小限に抑えてコストを削減し、アプリケーションのさまざまな部分をさまざまなDocker化されたマイクロサービスに分離します。

AWSでKubernetesをクリーンアップする

以前に持っていたリソースをクリーンアップするために、各Kubernetesリソースを、作成方法とは逆の順序で削除し、次にクラスター全体を削除します。

$ kubectl delete -f example-service.yaml
$ kubectl delete -f example-deployment.yaml
$ eksctl delete cluster --name my-fargate-cluster

Web UIにアクセスして、リソースが残っていないことを再確認します。EKSサービスとEC2サービスの両方で、ウェブUIのリージョンをに切り替えて~/.aws/config、未使用のELB/EC2インスタンスやKubernetesクラスターが存在しないことを確認します。

結論

マネージドKubernetesソリューションを検討している場合、AWSでKubernetesを実行することはほぼ同じです。スケーラビリティは卓越しており、アップグレードプロセスはスムーズです。AWSでKubernetesを実行することで得られる他のAWSサービスとの緊密な統合も大きなボーナスです。

AWSでKubernetesを実行すると、インフラストラクチャを管理する手間が省け、コア製品に集中するための時間を増やすことができます。これにより、追加のITスタッフの必要性が減り、同時に、製品がユーザーベースからの増え続ける需要を満たすことができるようになります。

このストーリーは、もともとhttps://www.clickittech.com/devops/kubernetes-on-aws/で公開されました

 #amazon #kubernetes 

AWSでのKubernetes:AmazonEKSを使用してKubernetesクラスターをデプロイする
Saul  Alaniz

Saul Alaniz

1655043660

Kubernetes En AWS: Implemente El Clúster De Kubernetes Con Amazon EKS

La última media década ha visto un aumento exponencial en la informática. Las aplicaciones han superado con creces el modelo cliente-servidor y la informática distribuida se ha convertido en la norma. Kubernetes en AWS ofrece una de las plataformas informáticas distribuidas más potentes.

Gracias a tecnologías como Amazon Fargate y el amplio alcance de la infraestructura informática en la nube de Amazon, Elastic Kubernetes Service, o EKS, puede ofrecer un entorno verdaderamente distribuido donde sus aplicaciones pueden ejecutarse y escalarse.

Configurar Kubernetes con AWS con la ayuda de la consola web de AWS probablemente no sea la forma más rápida de comenzar. En los siguientes capítulos, detallaremos el camino más sencillo que puede tomar para lanzar un Kubernetes en AWS, luego lanzaremos una aplicación Dockerizada en este clúster.

Requisitos previos para Kubernetes en AWS

Antes de comenzar con Kubernetes en AWS, familiaricémonos con algunos conceptos clave. Estos incluyen lo siguiente:

  1. Una comprensión de los contenedores Docker . Más adelante desdibujaremos las líneas entre contenedores y VM, así que asegúrese de saber cuál es la diferencia entre un contenedor y una máquina virtual tradicional.
  2. Familiaridad con terminologías relacionadas con AWS, como roles de IAM, VPC,

¿Por qué Docker? ¿Kubernetes no eliminó el soporte de Docker?

Kubernetes es un motor de orquestación de contenedores. Para implementar su aplicación en un clúster de Kubernetes, debe empaquetarla como una o más imágenes de contenedor. Docker es, con mucho, la forma más popular de trabajar con contenedores. Te ayuda con lo siguiente:

  1. Ejecute contenedores localmente en su estación de trabajo Windows, macOS o Linux
  2. Empaquete su aplicación como una imagen de Docker

Desde su concepción, Kubernetes ha recorrido un largo camino. Inicialmente, funcionaba con tiempos de ejecución de contenedores como Docker (que es el más común) y rkt. Esto significa que cada nodo de Kubernetes tendría docker instalado y ejecutándose en él. El binario de kubelet que se ejecuta en un nodo luego hablaría con el motor de Docker para crear pods, contenedores administrados por Kubernetes.

Recientemente, el proyecto Kubernetes dejó de admitir Docker Runtime . En su lugar, utiliza su propia interfaz de tiempo de ejecución de contenedor (CRI) , y esto elimina la dependencia adicional de tener Docker instalado en sus nodos.

Sin embargo, ha llevado a la idea errónea de que Docker es antiguo o incompatible con las personas que quieren aprender sobre Kubernetes.

Esto no es verdad. Todavía necesita Docker para ejecutar y probar sus imágenes localmente, y cualquier imagen de Docker que funcione en su tiempo de ejecución de Docker también funcionará en Kubernetes. Es solo que Kubernetes ahora tiene su implementación ligera para lanzar esas imágenes de Docker.

Kubernetes frente a EKS en Amazon

Entonces, ¿por qué le importa poner en marcha un clúster de EKS en Amazon? ¿Por qué no elegir crear su propio clúster de Kubernetes o usar algún otro proveedor de nube como GCP o Azure?

Hay una multitud de razones, incluyendo las siguientes:

Complejidad

Arrancar su clúster de Kubernetes es una mala idea. No solo estará a cargo de proteger y administrar su aplicación, sino que también será responsable de la configuración del clúster, las redes y el almacenamiento. Además de esto, el mantenimiento de Kubernetes implica actualizaciones del clúster, el sistema operativo subyacente y mucho más.

El uso del servicio Kubernetes administrado de AWS, EKS, garantizará que su clúster esté configurado correctamente y reciba actualizaciones y parches a tiempo.

Integración

El EKS de AWS funciona de forma inmediata con el resto de la infraestructura de Amazon. Los Elastic Load Balancers (ELB) se utilizan para exponer los servicios al mundo exterior. Su clúster usa Elastic Block Storage (EBS) para almacenar datos persistentes. Amazon garantiza que los datos estén en línea y disponibles para su clúster.

Escalabilidad real

Amazon EKS proporciona una escalabilidad mucho mejor que Kubernetes autohospedado. El plano de control se asegura de que sus pods se lancen a través de múltiples nodos físicos (si así lo desea). Si alguno de los nodos deja de funcionar, su aplicación seguirá estando en línea. Pero si administra su propio clúster, deberá asegurarse de que las diferentes máquinas virtuales (instancias EC2) estén en diferentes zonas de disponibilidad.

Si no puede garantizar eso, ejecutar diferentes pods en el mismo servidor físico no le dará mucha tolerancia a fallas.

Fargate y petardo

Las instancias de VM se ejecutan en hardware virtualizado, es decir, software que finge ser hardware. Esto da como resultado una mejor seguridad general de la infraestructura de la nube. Pero esto tiene el precio de un rendimiento más lento debido a una capa de software que virtualiza los recursos de hardware.

Por otro lado, los contenedores son livianos ya que todos se ejecutan en el mismo sistema operativo y comparten el mismo kernel subyacente. ¡Esto da como resultado tiempos de arranque más rápidos y ningún impacto en el rendimiento! La ejecución de contenedores directamente en el hardware se conoce como contenedores sobre metal desnudo.

En el momento de escribir este artículo, Amazon es una de las pocas nubes públicas que ofrece contenedores básicos. Es decir; en lugar de lanzar instancias de EC2 y luego ejecutar sus contenedores dentro de esas máquinas virtuales, puede usar Amazon Fargate y ejecutar los contenedores en hardware completo.

Lo gestionan a través de Firecracker, una tecnología muy ligera basada en KVM de Linux que ejecuta contenedores Docker dentro de una microVM. Estos le brindan el rendimiento de los contenedores y la seguridad de las máquinas virtuales. Esta sola es la razón por la que EKS en Amazon es preferible a cualquiera de sus competidores.

Anatomía de un clúster de Amazon EKS

Un clúster de EKS consta de dos componentes amplios:

el plano de control

Este servicio está completamente administrado por AWS, ya que no tendrá instancias EC2 creadas en su cuenta, donde puede esperar que aparezcan etcd, kube-apiserver y otros componentes. En su lugar, todo eso se abstrae de usted, y el plano de control simplemente se expone a usted como un servidor, es decir, el kube-api.

El avión de control cuesta $0.10 por hora. Afortunadamente, puede usar un solo clúster para ejecutar varias aplicaciones, y este precio no aumentará a medida que inicie más aplicaciones o servicios.

los nodos

Estos pueden, a su vez, ser instancias EC2 administradas o ejecutarse en AWS Fargate. La opción de instancias EC2 administradas es donde AWS activa las instancias EC2 en su nombre y le otorga al plano de control el control sobre esas instancias. Estos aparecen como instancias EC2 en su cuenta. Se aplica el precio estándar de EC2 para estos nodos.

En el caso de AWS Fargate, no hay instancias EC2 para administrar; en cambio, sus pods se ejecutan directamente y solo paga por el tiempo durante el cual se ejecutan los pods. Recomiendo enfáticamente usar AWS Fargate para sus nuevos clústeres, y también lo usaremos en la siguiente sección donde creamos un clúster. Los detalles de precios de AWS Fargate se pueden encontrar aquí .

Crear un clúster de Kubernetes en AWS

La forma más fácil de comenzar con EKS es usar las utilidades de línea de comandos que incluyen:

  1. AWS-CLI para interactuar con su cuenta de AWS
  2. eksctl para crear, administrar y eliminar clústeres de EKS, y
  3. kubectl para interactuar con el propio clúster de Kubernetes.
  4. docker para crear y contener su aplicación.
  5. Cuenta de Docker Hub para alojar sus imágenes de Docker (el nivel gratuito funcionará)

Configuración de la CLI de AWS

AWS proporciona a los usuarios una herramienta de línea de comandos y la posibilidad de aprovisionar recursos de AWS directamente desde la terminal. Se comunica directamente con la API de AWS y aprovisiona recursos en su nombre. Esto elimina la necesidad de configurar manualmente el clúster de EKS u otros recursos mediante la consola web de AWS. Automatizarlo con CLI también hace que el proceso sea menos propenso a errores.

Configuremos AWS CLI en nuestra computadora local.

1. Primero, obtenga los archivos binarios CLI adecuados para su sistema .

2. AWS CLI le permite crear recursos de forma rápida y mediante programación en la nube de AWS sin tener que perder el tiempo en el tablero. Esto también elimina los errores humanos.

3. Para crear y administrar clústeres de EKS, debe ser el usuario raíz o un usuario de IAM con acceso de administrador.

4. Usaré mi cuenta raíz por motivos de brevedad. Haga clic en su perfil en la esquina superior derecha de su consola web de AWS y seleccione "Mis credenciales de seguridad".

A continuación, vaya a la pestaña "Teclas de acceso" en el menú principal.

Haga clic en el botón "Crear nueva clave de acceso".

Luego haga clic en "Mostrar claves de acceso" en la nueva ventana emergente y copie con cuidado tanto la ID de acceso como la clave de acceso secreta en su computadora local. Es importante tener en cuenta que la clave de acceso secreta se mostrará solo una vez.

5. Abra su terminal y escriba el siguiente comando y cuando se le solicite, ingrese su ID de clave de acceso y clave de acceso secreta:

$ aws configure
AWS Access Key ID [None]:
AWS Secret Access Key [None]:
Default region name [None]:us-east-2
Default output format [None]: text

También se le pedirá que seleccione una región predeterminada. Vamos con us-east-2, pero puede elegir la región que más le beneficie (o que esté más cerca de usted). Y el formato de salida predeterminado será texto en nuestro caso.

Su configuración y credenciales se encuentran en su directorio HOME en un subdirectorio llamado .aws y serán utilizados tanto por AWS como por eksctl para administrar los recursos. Ahora, podemos pasar a crear un clúster.

2. Creación y eliminación de un clúster de EKS con Fargate

Para crear un clúster con Fargate Nodes, simplemente ejecute el siguiente comando:

$ eksctl create cluster --name my-fargate-cluster --fargate

¡Eso es! El comando puede tardar entre 15 y 30 minutos en finalizar y, a medida que se ejecuta, generará en su terminal todos los recursos que se están creando para iniciar el clúster.

Puede ver un resultado de muestra a continuación:

2021-08-01 18:14:41 [ℹ]  eksctl version 0.59.0
2021-08-01 18:14:41 [ℹ]  using region us-east-2
2021-08-01 18:14:42 [ℹ]  setting availability zones to [us-east-2c us-east-2a us-east-2b]
2021-08-01 18:14:42 [ℹ]  subnets for us-east-2c - public:192.168.0.0/19 private:192.168.96.0/19
2021-08-01 18:14:42 [ℹ]  subnets for us-east-2a - public:192.168.32.0/19 private:192.168.128.0/19
2021-08-01 18:14:42 [ℹ]  subnets for us-east-2b - public:192.168.64.0/19 private:192.168.160.0/19
2021-08-01 18:14:42 [ℹ]  nodegroup "ng-5018c8ae" will use "" [AmazonLinux2/1.20]
2021-08-01 18:14:42 [ℹ]  using Kubernetes version 1.20
2021-08-01 18:14:42 [ℹ]  creating EKS cluster "my-fargate-cluster" in "us-east-2" region with Fargate profile and managed nodes
2021-08-01 18:14:42 [ℹ]  will create 2 separate CloudFormation stacks for cluster itself and the initial managed nodegroup
2021-08-01 18:14:42 [ℹ]  if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-east-2 --cluster=my-fargate-cluster'
2021-08-01 18:14:42 [ℹ]  CloudWatch logging will not be enabled for cluster "my-fargate-cluster" in "us-east-2"
2021-08-01 18:14:42 [ℹ]  you can enable it with 'eksctl utils update-cluster-logging --enable-types={SPECIFY-YOUR-LOG-TYPES-HERE (e.g. all)} --region=us-east-2 --cluster=my-fargate-cluster'
2021-08-01 18:14:42 [ℹ]  Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster "my-fargate-cluster" in "us-east-2"
2021-08-01 18:14:42 [ℹ]  2 sequential tasks: { create cluster control plane "my-fargate-cluster", 3 sequential sub-tasks: { 2 sequential sub-tasks: { wait for control plane to become ready, create fargate profiles }, 1 task: { create addons }, create managed nodegroup "ng-5018c8ae" } }
2021-08-01 18:14:42 [ℹ]  building cluster stack "eksctl-my-fargate-cluster-cluster"
2021-08-01 18:14:44 [ℹ]  deploying stack "eksctl-my-fargate-cluster-cluster"
2021-08-01 18:15:14 [ℹ]  waiting for CloudFormation stack "eksctl-my-fargate-cluster-cluster"
2021-08-01 18:46:17 [✔]  EKS cluster "my-fargate-cluster" in "us-east-2" region is ready

Como sugiere el resultado, se están invirtiendo muchos recursos. Incluyendo varias subredes privadas nuevas para la zona de disponibilidad elegida, varios roles de IAM y el propio plano de control del Kubernetes Cluster. Si no sabes cuáles son estos, ¡no entres en pánico! Estos detalles son problema de eksctl. Como verás cuando eliminemos el clúster con el siguiente comando:

$ eksctl delete cluster --name my-fargate-cluster
2021-08-01 23:00:35 [ℹ]  eksctl version 0.59.0
## A lot more output here
2021-08-01 23:06:30 [✔]  all cluster resources were deleted

Esto desmantelará el clúster, eliminará todos los grupos de nodos, las subredes privadas y otros recursos relacionados, y se asegurará de que no tenga recursos como ELB por ahí, lo que le costará dinero extra.

Dockerización de su aplicación

Ahora sabemos cómo crear y destruir un clúster. Pero, ¿cómo lanzamos una aplicación dentro de él? El primer paso del proceso es contener su aplicación.

Para contener su aplicación, necesita saber cómo escribir un Dockerfile para ella. Un Dockerfile es un plano para que su sistema de orquestación de contenedores construya su imagen de Docker. Para escribir un Dockerfile, necesita:

  1. Elija una imagen base. Por ejemplo, puede usar imágenes de Docker para Node o Python, según de qué dependa su aplicación.
  2. Seleccione un directorio de trabajo dentro del contenedor.
  3. Transfiera los artefactos de compilación de su proyecto (archivos binarios, scripts y bibliotecas compilados) a ese directorio
  4. Configure el comando para que se ejecute el contenedor lanzado. Entonces, por ejemplo, si tiene una aplicación de nodo con el punto de inicio app.js, tendrá un comando CMD [ node, app.js] como la línea final en su Dockerfile. Esto iniciará su aplicación.

Este Dockerfile vive en la raíz del repositorio git de su proyecto, lo que facilita que los sistemas de CI ejecuten compilaciones, pruebas e implementaciones automatizadas para cada actualización incremental.

Usemos una aplicación de muestra escrita en Express.js como proyecto de prueba para dockerizar e implementar en nuestro clúster de EKS.

Una aplicación de ejemplo

Creemos una express.jsaplicación simple para implementar en nuestro clúster. Crear un directorio llamado example-app:

$ mkdir example-app

A continuación, cree un archivo llamado app.jsque tendrá los siguientes contenidos:


const express = require('express')
const app = express()
const port = 80
app.get('/', (req, res) => {
    res.send('Hello World!\n')
})
app.listen(port, () => {
    console.log(`This app is listening at http://localhost:${port}`
})

Esta app.jsaplicación es un servidor web simple que escucha port 80y responde con Hello World!

Creación de un archivo Docker

Para compilar y ejecutar la aplicación como un contenedor, creamos un archivo llamado Dockerfile en el directorio de la aplicación de ejemplo con los siguientes contenidos dentro:

FROM node:latest
WORKDIR /app
RUN npm install express
COPY app.js .
CMD ["node", "app.js"]

Esto le indicará al motor de Docker que cree nuestro contenedor utilizando la imagen base node:latest de Docker Hub . Para construir la imagen de Docker, ejecute el siguiente comando:

$ docker build -t username/example-app

Aquí, el nombre de usuario de la etiqueta debe reemplazarse con su nombre de usuario real de Docker Hub. Discutiremos por qué este es el caso en la sección sobre Registros de Contenedores.

Ejecutar la aplicación localmente

Una vez que se construye la imagen, podemos probar si funciona localmente. Ejecute el siguiente comando para iniciar un contenedor:

$ docker run --rm -d -p 80:80 --name test-container username/example-app
$ curl http://localhost
Hello World!

Esto parece estar funcionando según lo previsto. Veamos qué dicen los registros del contenedor. Ejecute el comando docker log con el nombre del contenedor específico:

$ docker logs test-container
This app is listening at http://localhost:80

Como puede ver, todo lo que nuestra aplicación express.js estaba escribiendo en la salida estándar (usando console.log) se registra a través del tiempo de ejecución del contenedor. Esto ayuda a depurar su aplicación tanto durante las pruebas como en los entornos de producción.

Acerca de los registros de contenedores

Las imágenes de Docker, de forma predeterminada, están alojadas en Docker Hub . Docker Hub es como un GitHub para sus imágenes de Docker. Puede versionar sus imágenes, etiquetarlas y alojarlas en Docker Hub

La etiqueta node:latest implica que se usará la última versión publicada de Node.js para el proceso de compilación de esta imagen. Hay otras etiquetas disponibles para versiones específicas de Node.js, como una versión LTS. Siempre puede visitar la página de Docker Hub de una imagen específica y ver las diversas opciones que están disponibles para descargar.

Al igual que con GitLab y GitHub, hay varios registros de contenedores que puede usar. AWS tiene su solución Elastic Container Repository o ECR, GCP tiene algo similar , pero nos limitaremos a Docker Hub para este artículo, ya que de todos modos es el valor predeterminado para su instalación de Docker.

Para enviar la imagen que se creó en el paso anterior, primero iniciamos sesión en nuestra cuenta de Docker Hub:

$ docker login

Y proporcione su nombre de usuario y contraseña cuando se le solicite.
Una vez que haya iniciado sesión correctamente, presione su imagen con este comando:

$ docker push username/example-app

Ahora, la imagen de Docker de la aplicación de ejemplo está lista para ser utilizada por un clúster de EKS.

Implementación de la aplicación

Para implementar la aplicación que acabamos de crear, usaremos kubectl, una herramienta de línea de comandos para interactuar con un plano de control de Kubernetes. Si usó eksctl para crear su clúster, kubectl ya está autenticado para comunicarse con su clúster de EKS.

Crearemos un directorio llamado eks-configs que almacenará una descripción del estado deseado de nuestro clúster y la aplicación que se ejecuta en él.

$ mkdir eks-configs

Creación de implementación

Para implementar la aplicación, crearemos una carga de trabajo de Kubernetes de tipo Implementación , que es ideal para aplicaciones sin estado. Cree un archivo example-deployment.yamly agréguele lo siguiente.

La imagen que se usa es nombre de usuario/aplicación de ejemplo de Docker Hub. Asegúrese de reemplazar 'nombre de usuario' con su nombre de usuario real de Docker Hub. Cuando se ejecuta esta aplicación, obtendrá lo siguiente:

---
apiVersion: apps/v1
kind: Deployment
metadata:
 name: example-deployment
 labels:
   app: example-ap
spec
  replicas: 6
  selector:
    matchLabels:
      app: example-ap
  template:
    metadata:
      labels:
        app: example-app
    spec:
     containers:
     - name: example-app
       image: username/example-app
       ports:
     - containerPort: 80

Usando kubectl podemos lanzar esta implementación de la siguiente manera:

$ kubectl create -f example-deployment.yaml

Esto crea un Deploymentobjeto, que es la forma en que Kubernetes maneja lógicamente una aplicación (o una parte de una aplicación) que se ejecutaría en uno o más pods.

A continuación, agregamos algunos metadatos, incluido el nombre de la implementación y las etiquetas que usan otras partes de nuestro clúster, como veremos pronto. Además, le pedimos a EKS que lance seis réplicas del contenedor de la aplicación de ejemplo para que se ejecuten en nuestro clúster.

Dependiendo de sus necesidades, puede cambiar ese número sobre la marcha; le permitirá ampliar o reducir su aplicación. Tener muchas réplicas también garantiza una alta disponibilidad, ya que sus pods se programan en varios nodos.

Utilice los siguientes comandos para obtener una lista de sus implementaciones y los pods que se ejecutan en su clúster (en el espacio de nombres predeterminado):

$ kubectl get pods
$ kubectl get deployments

Creación de un servicio

El deployment está creado, pero ¿cómo accedemos a él? Dado que se ejecuta en varios pods, ¿cómo accede otra aplicación (como el navegador de un usuario)? No podemos tener entradas DNS directas apuntando a los pods ya que son efímeros y fungibles. Además, exponer directamente los componentes internos de un clúster al mundo exterior no es la idea más segura.

En su lugar, crearemos un servicio de Kubernetes. Hay varios tipos de servicios de Kubernetes, puede conocerlos aquí . Usaremos el servicio Elastic Load Balancer, o ELB, de AWS por ahora.

Como nota al margen, este es un tema general con la mayoría de las funciones de Kubernetes. Se integran muy bien con la infraestructura de nube subyacente. Ejemplos de esto incluyen el uso de instancias EC2 para nodos, ELB para exponer servicios al mundo exterior, VPC de AWS para redes dentro del clúster y Elastic Block Storage para almacenamiento persistente de alta disponibilidad.

Podemos crear un servicio de Kubernetes de tipo LoadBalancer creando un example-service.yamlarchivo y agregando los siguientes contenidos dentro de él:


---
apiVersion: v1
kind: Service
metadata:
  name: example-service
spec:
  type: LoadBalancer
  selector:
    app: example-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80

Y luego créalo usando el siguiente comando:

$ kubectl create -f example-service.yaml

El servicio de Kubernetes estará en funcionamiento en unos minutos y finalmente podremos hablar con nuestra aplicación. Para comenzar, necesitamos obtener la "IP EXTERNA" del servicio. Por lo general, no es una IP, sino un FQDN que apunta a nuestro servicio, como se muestra a continuación:

$ kubectl get svc
NAME              TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)        AGE
example-service   LoadBalancer   10.100.88.189   a78b55c6b3a574e61accfe13b324bd08-1728675516.us-east2.elb.amazonaws.com   80:31058/TCP   5m33
kubernetes        ClusterIP      10.100.0.1      <none>  43/TCP        33m

Se puede acceder fácilmente al servicio desde su navegador web o mediante el uso de herramientas de línea de comandos como cURL, como se muestra a continuación.

$ curl
a78b55c6b3a574e61accfe13b324bd08-1728675516.us-east-2.elb.amazonaws.co
Hello World!

¡Eso es! Ahora puede comenzar a implementar su aplicación pieza por pieza en la plataforma AWS EKS. Mantenga la cantidad de clústeres al mínimo para reducir los costos y separe las diferentes partes de su aplicación en diferentes microservicios dockerizados.

Limpiar Kubernetes en AWS

Para limpiar los recursos que teníamos anteriormente, eliminaremos cada recurso de Kubernetes en orden inverso a cómo los creamos y luego todo el clúster.

$ kubectl delete -f example-service.yaml
$ kubectl delete -f example-deployment.yaml
$ eksctl delete cluster --name my-fargate-cluster

Vuelva a verificar que no queden recursos visitando la interfaz de usuario web. Para los servicios de EKS y EC2, cambie la región en la interfaz de usuario web ~/.aws/configy asegúrese de que no tiene instancias ELB/EC2 sin usar o clústeres de Kubernetes por ahí.

Conclusión

Si está considerando una solución administrada de Kubernetes, ejecutar Kubernetes en AWS es lo mejor posible. La escalabilidad es excepcional, el proceso de actualización es fluido. La estrecha integración con otros servicios de AWS que obtiene al ejecutar Kubernetes con AWS también es una gran ventaja.

Ejecutar Kubernetes en AWS liberaría a su empresa de la molestia de administrar la infraestructura y le daría más tiempo para concentrarse únicamente en el producto principal. Reduciría la necesidad de personal de TI adicional y, al mismo tiempo, permitiría que su producto satisfaga una demanda cada vez mayor de su base de usuarios.

Esta historia se publicó originalmente en https://www.clickittech.com/devops/kubernetes-on-aws/

  #amazon #kubernetes 

Kubernetes En AWS: Implemente El Clúster De Kubernetes Con Amazon EKS
Sheldon  Grant

Sheldon Grant

1655026980

Fish-kubectl-completions: Kubectl Completions for Fish Shell

kubectl completion for fish shell

Install

$ mkdir -p ~/.config/fish/completions
$ cd ~/.config/fish
$ git clone https://github.com/evanlucas/fish-kubectl-completions
$ ln -s ../fish-kubectl-completions/completions/kubectl.fish completions/

Install using Fisher

fisher install evanlucas/fish-kubectl-completions

Building

This was tested using go 1.15.7 on macOS 11.1 "Big Sur".

$ make build

Environment Variables

FISH_KUBECTL_COMPLETION_TIMEOUT

This is used to pass the --request-timeout flag to the kubectl command. It defaults to 5s.

Non-zero values should contain a corresponding time unit (e.g. 1s, 2m, 3h). A value of zero means don't timeout requests.

FISH_KUBECTL_COMPLETION_COMPLETE_CRDS

This can be used to prevent completing CRDs. Some users may have limited access to resources. It defaults to 1. To disable, set to anything other than 1.

Author: Evanlucas
Source Code: https://github.com/evanlucas/fish-kubectl-completions 
License: MIT license

#node #nodejs #kubernetes 

Fish-kubectl-completions: Kubectl Completions for Fish Shell
Sheldon  Grant

Sheldon Grant

1654996860

Learning-knative: Notes and Examples for Learning Knative

Learning Knative

Knative looks to build on Kubernetes and present a consistent, standard pattern for building and deploying serverless and event-driven applications.

Knative allows services to scale down to zero and scale up from zero.

Background knowledge

Before we start there are a few things that I was not 100% clear about and this section aims to sort this out to allow for better understanding of the underlying technologies.

Container

A container is a process that is isolated from other processes using Linux kernel features like cgroups, namespaces, mounted union fs (chrooted), etc.

When a container is deployed what happens is that the above mentioned features are configured, a filesystem mounted, and a process is started. The metadata and the filesystem are contained in an image (more on this later).

A container image has all the libraries and files it needs to run. It does not have an entire OS but instead uses the underlying host's kernel which saves space compared to a separate VM.

Also it is worth mentioning that a running container is a process (think unix process) which has a separate control group (cgroup), and namespace (mnt, IPC, net, usr, pid, and uts (Unix Time Share system)). It could also include seccomp (Secure Computing mode) which is a way to filter the system calls allowed to be performed, apparmor (prevents access to files the process should not access), and linux capabilities (reducing what a privileged process can do). More on these three security features can be found later in this document.

Namespaces

The namespace API consists of three system calls:

  • clone
  • unshare
  • setns

A namespace can be created using clone:

int clone(int (*child_func)(void *), void *child_stack, int flags, void *arg);

The child_func is a function pointer to the function that the new child process will execute, and arg are the arguments that that function might take. Linux also has the fork system call which also creates a child process, but clone allows control over the things that get shared between the parent and the child process. Things like if they should share the virtual address space, the file descriptor table, the signal handler table, and also allows the new process to be placed in separate namespaces. This is controlled by the flags parameter. This is also how threads are created on Linux and the kernel has the same internal representation for this which is the task_struct

child_stack specifies the location of the stack used by the child process.

There is an example of the clone systemcall in clone.c which can be compiled and run using the following commands:

$ docker run -ti --privileged -v$PWD:/root/src -w /root/src gcc
$ gcc -o clone clone.c
$ ./clone
parent pid: 81
child hostname: child_host
child pid: 1
child ppid: 0
parent hostname: caa66b227dfe

The goal of this is just to give an example and show the names of the flags that control the namespaces. `

cgroups (control groups)

cgroups allows the Linux OS to manage and monitor resources allocated to a process and also set limits for things like CPU, memory, network. This is so that one process is not allowed to hog all the resources and affect others.

Subsystems:

  • blkio (or just io) Block I/O subsystem which limits I/O access to block devices (disk, SSD, USB)
  • cpu
  • cpuacct Automatic reports on cpu resources used by tasks in a cgroup
  • cpuset Assigns processors and memory to tasks in a group.
  • memory Sets limits on memory usage by tasks in a group.
  • devices Allows access to devices to tasks in a group.
  • freezer Allows supend/resumption of tasks in a group.
  • net_cls Allows the marking of network packets in a group.
  • net_prio Allows for a priority of networks packets to be set.
  • perf_event Allows access to perf events.
  • hugeltb Activates support for huge tables for a group.
  • pid Set the limit of allowed processes for a group.
$ cat /proc/cgroups 
#subsys_name	hierarchy	num_cgroups	enabled
cpuset	2	1	1
cpu	7	14	1
cpuacct	7	14	1
blkio	6	14	1
memory	11	170	1
devices	3	72	1
freezer	12	1	1
net_cls	4	1	1
perf_event	9	1	1
net_prio	4	1	1
hugetlb	5	1	1
pids	8	76	1
misc	10	1	1

$ ls -l /sys/fs/cgroup/
total 0
dr-xr-xr-x. 12 root root  0 Sep  9 06:59 blkio
lrwxrwxrwx.  1 root root 11 Sep  9 06:59 cpu -> cpu,cpuacct
lrwxrwxrwx.  1 root root 11 Sep  9 06:59 cpuacct -> cpu,cpuacct
dr-xr-xr-x. 12 root root  0 Sep  9 06:59 cpu,cpuacct
dr-xr-xr-x.  2 root root  0 Sep  9 06:59 cpuset
dr-xr-xr-x. 12 root root  0 Sep  9 06:59 devices
dr-xr-xr-x.  2 root root  0 Sep  9 06:59 freezer
dr-xr-xr-x.  2 root root  0 Sep  9 06:59 hugetlb
dr-xr-xr-x. 12 root root  0 Sep  9 06:59 memory
dr-xr-xr-x.  2 root root  0 Sep  9 06:59 misc
lrwxrwxrwx.  1 root root 16 Sep  9 06:59 net_cls -> net_cls,net_prio
dr-xr-xr-x.  2 root root  0 Sep  9 06:59 net_cls,net_prio
lrwxrwxrwx.  1 root root 16 Sep  9 06:59 net_prio -> net_cls,net_prio
dr-xr-xr-x.  2 root root  0 Sep  9 06:59 perf_event
dr-xr-xr-x. 12 root root  0 Sep  9 06:59 pids
dr-xr-xr-x. 13 root root  0 Sep  9 06:59 systemd
dr-xr-xr-x. 13 root root  0 Sep  9 06:59 unified
$ cd /sys/fs/cgroup/devices/
$ mkdir cgroups_test_group

Notice that after creating this directory there will be a number of files that will have been automatically generated:

$ ls -l
total 0
-rw-r--r--. 1 root root 0 Sep 27 08:48 cgroup.clone_children
-rw-r--r--. 1 root root 0 Sep 27 08:48 cgroup.procs
--w-------. 1 root root 0 Sep 27 08:48 devices.allow
--w-------. 1 root root 0 Sep 27 08:48 devices.deny
-r--r--r--. 1 root root 0 Sep 27 08:48 devices.list
-rw-r--r--. 1 root root 0 Sep 27 08:48 notify_on_release
-rw-r--r--. 1 root root 0 Sep 27 08:48 tasks

Add the following line to devices.deny:

c 5:0 w

In this case we are denying access to the character device /dev/tty:

$ ls -l /dev/tty
crw-rw-rw-. 1 root tty 5, 0 Sep 27 09:08 /dev/tty

Now, lets start our print task:

$ ./print.sh

And then from another terminal/console:

$ su -
$ echo $(pidof -x print.sh) > /sys/fs/cgroup/devices/cgroups_test_group/tasks

We output should be the following in the terminal that started print.sh:

$ ./print.sh 
bajja
bajja
bajja
bajja
./print.sh: line 5: /dev/tty: Operation not permitted

secccomp (Secure Computing)

Is a Linux kernel feature that restricts the system calls a process can call. So if someone was to gain access they would not be able to use any other system call than the ones that were specified.

The command that controls this is named prctl (process control). There is an example of using prctl in seccomp.c:

$ docker run -ti --privileged -v$PWD:/root/src -w /root/src gcc
$ gcc -o seccomp seccomp.c
./seccomp
pid: 351
setting restrictions...
running with restrictions. Allowed system calls areread(), write(), exit()
try calling getpid()
Killed

We can run this with strace to see the system calls being made:

$ apt-get update
$ apt-get install strace
root@d978e6c92dca:~/src# strace ./seccomp
execve("./seccomp", ["./seccomp"], 0x7fff025d34c0 /* 10 vars */) = 0
brk(NULL)                               = 0x1f2d000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=37087, ...}) = 0
mmap(NULL, 37087, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f1985142000
close(3)                                = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260A\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1824496, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1985140000
mmap(NULL, 1837056, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f1984f7f000
mprotect(0x7f1984fa1000, 1658880, PROT_NONE) = 0
mmap(0x7f1984fa1000, 1343488, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x22000) = 0x7f1984fa1000
mmap(0x7f19850e9000, 311296, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x16a000) = 0x7f19850e9000
mmap(0x7f1985136000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b6000) = 0x7f1985136000
mmap(0x7f198513c000, 14336, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f198513c000
close(3)                                = 0
arch_prctl(ARCH_SET_FS, 0x7f1985141500) = 0
mprotect(0x7f1985136000, 16384, PROT_READ) = 0
mprotect(0x403000, 4096, PROT_READ)     = 0
mprotect(0x7f1985173000, 4096, PROT_READ) = 0
munmap(0x7f1985142000, 37087)           = 0
getpid()                                = 350
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0), ...}) = 0
brk(NULL)                               = 0x1f2d000
brk(0x1f4e000)                          = 0x1f4e000
write(1, "pid: 350\n", 9pid: 350
)               = 9
write(1, "setting restrictions...\n", 24setting restrictions...
) = 24
prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT) = 0
write(1, "running with restrictions. Allow"..., 75running with restrictions. Allowed system calls areread(), write(), exit()
) = 75
write(1, "try calling getpid()\n", 21try calling getpid()
)  = 21
getpid()                                = ?
+++ killed by SIGKILL +++
Killed

In this case we were not able to specify exactly which system calls are allowed but this can be done using Berkley Paket Filtering (BPF). seccomp_bpf.c:

$ apt-get install libseccomp-dev
$ gcc -lseccomp -o seccomp_bpf seccomp_bpf.c

namespaces

Are used to isolate processes from each other. Each container will have its own namespace but it is also possible for multiple containers to be in the same namespace which is what the deployment unit of kubernetes is; the pod.

PID (CLONE_NEWPID)

In a pid namespace your process becomes PID 1. You can only see this process and child processes, all others on the underlying host system are "gone".

UTS (CLONE_NEWUTS)

Isolates domainname and hostname allowing each container to have its own hostname and NIS domain name. The hostname and domain name are retrived by the uname system call and the struct passed into this function is named utsname (UNIX Time-share System)

IPC (CLONE_NEWIPC)

Isolate System V IPC Objects and POSIX message queues. Each namespace will have its own set of these.

Network (CLONE_NEWNET)

A net namespace for isolating network ip/ports, IP routing tables.

The following is an example of creating a network namespace just to get a feel for what is involved.

$ docker run --privileged -ti centos /bin/bash

A network namespace can be created using ip netns:

$ ip netns add something
$ ip netns list
something

With a namespace created we can add virtual ethernet (veth) interfaces to it. These come in pairs and can be thought of as a cable between the namespace and the outside world (which is usually a bridge in the kubernetes case I think). So the other end would be connected to the bridge. Multiple namespaces can be connected to the same bridge.

First we can create a virtual ethernet pair (veth pair) named v0 and v1:

$ ip link add v0 type veth peer name v1
$ ip link list
...
4: v1@v0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 8e:3f:28:e1:e8:d9 brd ff:ff:ff:ff:ff:ff
5: v0@v1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 1e:35:a5:17:76:6d brd ff:ff:ff:ff:ff:ff
...

Next, we add one end of the virtual ethernet pair to the namespace we created:

$ ip link set v1 netns something

We also want to give v1 and ip address and enable it:

$ ip netns exec something ip address add 172.16.0.1 dev v1
$ ip netns exec something ip link set v1 up
$ ip link set dev v0 up
$ ip netns exec something ip address show dev v1
4: v1@if5: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state LOWERLAYERDOWN group default qlen 1000
    link/ether 9a:56:bf:12:32:0d brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.16.0.1/32 scope global v1
       valid_lft forever preferred_lft forever

We can find the ip address of eth0 in the default namespace using:

$ ip address show dev eth0
86: eth0@if87: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever

So can we ping that address from our something namespace?

$ ip netns exec something ping 172.17.0.2
connect: Network is unreachable

No, we can't because there is no routing table for the namespace:

$ ip netns exec something ip route list

We should be able to add a default route that sends anything not on host to v1:

$ ip netns exec something ip route add default via 172.16.0.1 dev v1

We also need to add a route for this container in the host so that the return packet can be routed back:

$ ip route add 172.16.0.1/32 dev v0
$ ip netns exec something ip link set lo up

With that in place we should be able to ping:

$ ip netns exec something ping 172.17.0.2
ip netns exec something ping 172.17.0.2
PING 172.17.0.2 (172.17.0.2) 56(84) bytes of data.
64 bytes from 172.17.0.2: icmp_seq=1 ttl=64 time=0.052 ms
64 bytes from 172.17.0.2: icmp_seq=2 ttl=64 time=0.079 ms
...

Notice that we have only added a namespace and not started a process/container. It is in fact the kernel networking stack that is replying to this ping.

This would look something like the following:

     +--------------------------------------------------------+
     |     Default namespace                                  |
     | +---------------------------------------------------+  |
     | |  something namespace                              |  |
     | | +--------------+  +-----------------------------+ |  |
     | | | v1:172.16.0.1|  | routing table               | |  |
     | + +--------------+  |default via 172.16.0.1 dev v1| |  |
     | |   |               +-----------------------------+ |  |
     | +---|-----------------------------------------------+  |
     |     |                                                  |
     |   +----+                                               |
     |   | v0 |                                               |
     |   +----+                                               |
     |                                                        |
     |   +----+            +------------------------------+   |
     |   |eth0|            | routing table                |   |
     |   +----+            |172.16.0.1 dev v0 scope link  |   |
     |                     +------------------------------+   |
     +--------------------------------------------------------+

So we have see how we can have a single namespace on a host. If we want to add more namespaces, those namespaces not only have to be able to connect with the host but also with each other.

Lets start by adding a second namespace:

$ ip link add v2 type veth peer name v3
$ ip netns add something2
$ ip link set v3 netns something2
$ ip netns exec something2 ip address add 172.16.0.2 dev v3
$ ip netns exec something2 ip link set v3 up
$ ip netns exec something2 ip link set lo up
$ ip link set dev v2 up
$ ip netns exec something2 ip route add default via 172.16.0.2 dev v3
$ ip link add bridge0 type bridge
$ ip link set dev v0 master bridge0
$ ip link set dev v2 master bridge0
$ ip address add 172.168.0.3/24 dev bridge0
$ ip link dev bridge 0 up

We can verify that we can ping from the something namesspace to something2:

$ ip netns exec something ping 172.16.0.2
PING 172.16.0.2 (172.16.0.2) 56(84) bytes of data.
64 bytes from 172.16.0.2: icmp_seq=1 ttl=64 time=0.338 ms

$ ip netns exec something2 ping 172.16.0.1
PING 172.16.0.1 (172.16.0.1) 56(84) bytes of data.
64 bytes from 172.16.0.1: icmp_seq=1 ttl=64 time=0.061 ms

But can we ping the second container from the host?

$ ping 172.16.0.2
PING 172.16.0.2 (172.16.0.2) 56(84) bytes of data.
...

For this to work we need a route in the host:

$ ip route add 172.16.0.0/24 dev bridge0

After having done this our configuration should look something like this:

     +-----------------------------------------------------------------------------------------------------------+
     |     Default namespace                                                                                     |
     | +---------------------------------------------------+ +-------------------------------------------------+ |
     | |  something namespace                              | | something2 namespace                            | |
     | | +--------------+  +-----------------------------+ | | +-------------+ +-----------------------------+ | |
     | | | v1:172.16.0.1|  | routing table               | | | |v3:172.16.0.2| | routing table               | | |
     | + +--------------+  |default via 172.16.0.1 dev v1| | | +-------------+ |default via 172.16.0.2 dev v3| | |
     | |   |               +-----------------------------+ | |    |            +-----------------------------+ | |
     | +---|-----------------------------------------------+ +----|--------------------------------------------+ |
     |     |                                                      |                                              |
     |   +------------------------------------------------------------------------------------------------+      |
     |   | | v0 |             bridge0                           | v2 |                                    |      |
     |   | +----+           72.168.0.3                          +----+                                    |      |
     |   +------------------------------------------------------------------------------------------------+      |
     |                                                                                                           |
     |   +----+            +-------------------------------+                                                     |
     |   |eth0|            | routing table                 |                                                     |
     |   +----+            |172.16.0.0/24 dev bridge0 scope|                                                     |
     |                     +-------------------------------+                                                     |
     +-----------------------------------------------------------------------------------------------------------+

If you have docker deployed the bridge would be named docker0. For example:

$ docker run -it --rm --privileged --pid=host justincormack/nsenter1
$ ip link list
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 02:50:00:00:00:01 brd ff:ff:ff:ff:ff:ff
5: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether 02:42:88:76:10:4c brd ff:ff:ff:ff:ff:ff
87: veth2a62021@if86: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default
    link/ether 2e:70:0c:6d:61:aa brd ff:ff:ff:ff:ff:ff link-netnsid 0
89: veth24a8a43@if88: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default
    link/ether d2:69:da:8b:99:b6 brd ff:ff:ff:ff:ff:ff link-netnsid 1

User (CLONE_NEWUSER)

Isolates user and group IDs.

capabilities

A process in Linux can be either privileged or unprivileged. Capabilities allows limiting the privileges for the superuser, so that if the program is compromised it will not have all privileges and hopefully not be able to do as much harm. As an example, if you have a web server and you want it to listen to port 80 which requires root permission. But giving the web server root permission will allow it to to much more. Instead the binary can be given the CAP_NET_BIND_SERVICE capability.

Are privileges that can be enabled per process(thread/task). The root user, effective user id 0 (EUID 0) has all capabilities enabled. The Linux kernel always checks the capabilites and does not check that the user is root (EUID 0).

You can use the following command to list the capabilities:

$ capsh --print
$ cat /proc/1/task/1/status
...
CapInh:	0000003fffffffff
CapPrm:	0000003fffffffff
CapEff:	0000003fffffffff
CapBnd:	0000003fffffffff
CapAmb:	0000000000000000
...

Example of using capabilities:

$ docker run -ti --privileged -v$PWD:/root/src -w /root/src gcc
$ chmod u-s /bin/ping 
$ adduser danbev
$ ping localhost
ping: socket: Operation not permitted

We first removed the setuid for ping and then added a new user and verified that they cannot use ping and get the error above.

Next, lets add the CAP_NET_RAW capability:

$ setcap cap_net_raw+p /bin/ping
$ su - danbev
$ ping -c 1 localhost
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.053 ms

--- localhost ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.053/0.053/0.053/0.000 ms

We can specify capabilities when we start docker instead of using --privileged as this:

$ docker run -ti --cap-add=NEW_RAW -v$PWD:/root/src -w /root/src gcc
$ su danbev
$ ping -c 1 localhost

Apparmor

Is a mandatory access control framework which uses whitelist/blacklist for the access to objects, like file, paths etc. So this can limit what files a process can access for example.

The component responsible for all this work, setting the limits for cgroups, configuring the namespaces, mounting the filesystem, and starting the process is the responsibility of the container runtime.

Image format

What about an docker image, what does it look like?

We can use a tool named skopeo and umoci to inspect and find out more about images.

$ brew install skopeo

The image I'm using is the following:

$ skopeo inspect docker://dbevenius/faas-js-example
{
    "Name": "docker.io/dbevenius/faas-js-example",
    "Digest": "sha256:69cc8b6087f355b7e4b2344587ae665c61a067ee05876acc3a5b15ca2b15e763",
    "RepoTags": [
        "0.0.3",
        "latest"
    ],
    "Created": "2019-11-25T08:37:50.894674023Z",
    "DockerVersion": "19.03.3",
    "Labels": null,
    "Architecture": "amd64",
    "Os": "linux",
    "Layers": [
        "sha256:e7c96db7181be991f19a9fb6975cdbbd73c65f4a2681348e63a141a2192a5f10",
        "sha256:95b3c812425e243848db3a3eb63e1e461f24a63fb2ec9aa61bcf5a553e280c07",
        "sha256:778b81d0468fbe956db39aca7059653428a7a15031c9483b63cb33798fcdadfa",
        "sha256:28549a15ba3eb287d204a7c67fdb84e9d7992c7af1ca3809b6d8c9e37ebc9877",
        "sha256:0bcb2f6e53a714f0095f58973932760648f1138f240c99f1750be308befd9436",
        "sha256:5a4ed7db773aa044d8c7d54860c6eff0f22aee8ee56d4badf4f890a3c82e6070",
        "sha256:aaf35efcb95f6c74dc6d2c489268bdc592ce101c990729280980da140647e63f",
        "sha256:c79d77af46518dfd4e94d3eb3a989a43f06c08f481ab3a709bc5cd5570bb0fe2"
    ],
    "Env": [
        "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
        "NODE_VERSION=12.10.0",
        "YARN_VERSION=1.17.3",
        "HOME=/home/node"
    ]
}
$ skopeo --insecure-policy copy docker://dbevenius/faas-js-example oci:faas-js-example-oci:latest
Getting image source signatures
Copying blob e7c96db7181b done
Copying blob 95b3c812425e done
Copying blob 778b81d0468f done
Copying blob 28549a15ba3e done
Copying blob 0bcb2f6e53a7 done
Copying blob 5a4ed7db773a done
Copying blob aaf35efcb95f done
Copying blob c79d77af4651 done
Copying config c5b8673f93 done
Writing manifest to image destination
Storing signatures

We can take a look at the directory layout:

$ ls faas-js-example-oci/
blobs		index.json	oci-layout

Lets take a look at index.json:

$ cat index.json | python3 -m json.tool
{
    "schemaVersion": 2,
    "manifests": [
        {
            "mediaType": "application/vnd.oci.image.manifest.v1+json",
            "digest": "sha256:be5c2a500a597f725e633753796f1d06d3388cee84f9b66ffd6ede3e61544077",
            "size": 1440,
            "annotations": {
                "org.opencontainers.image.ref.name": "latest"
            }
        }
    ]
}

I'm on a mac so I'm going to use a docker to run a container and mount the directory containing our example:

$ docker run --privileged -ti -v $PWD/faas-js-example-oci:/root/faas-js-example-oci fedora /bin/bash
$ cd /root/faas-js-example-oci
$ dnf install -y runc
$ dnf install dnf-plugins-core
$ dnf copr enable ganto/umoci
$ dnf install umoci

We can now use unoci to unpack the image into a OCI bundle:

$ umoci unpack --image faas-js-example-oci:latest faas-js-example-bundle
[root@2a3b333ff24b ~]# ls faas-js-example-bundle/
config.json  rootfs  sha256_be5c2a500a597f725e633753796f1d06d3388cee84f9b66ffd6ede3e61544077.mtree  umoci.json

rootfs will be the filesystem to be mounted and the configuration of the process can be found in config.json.

So we now have an idea of what a container is, a process, but what creates these processes. This is the responsibility of a container runtime.

Container runtime

Docker contributed a runtime that they extracted named runC. There are others as well which I might expand upon later but for now just know that this is not the only possibly runtime.

Something worth noting though is that these runtimes follow a specification that describes what is to be run. These runtime operate on a filesystem bundle

We can run this bundle using runC:

$ runc create --bundle faas-js-example-bundle faas-js-example-container
$ runc list
ID                          PID         STATUS      BUNDLE                         CREATED                        OWNER
faas-js-example-container   31          created     /root/faas-js-example-bundle   2019-12-09T12:55:40.8534462Z   root

runC does not deal with any image registries and only runs applications that are packaged in the OCI format. So what ever executes runC would have to somehow get the images into this format (bundle) and execute runC with that bundle.

So what calls runC?
This is done by a component named containerd which is a container supervisor (process monitor). It does not run the containers itself, that is done by runC. Instead it deals with container lifecycle operations of containers run by runC. Actually there is a runtime shim API allowing other runtimes to be used instead of runC.

Containerd contains a Container Runtime Interface (CRI) API which is a gRPC API . The API implementation uses the containerd Go client to call into containerd. Other clients that use the containerd Go client are Docker, Pouch, ctr.

$ wget https://github.com/containerd/containerd/archive/v1.3.0.zip
$ unzip v1.3.0.zip

Building a docker image to play around with containerd and runc:

$ docker build -t containerd-dev .
$ docker run -it --privileged \
    -v /var/lib/containerd \
    -v ${GOPATH}/src/github.com/opencontainers/runc:/go/src/github.com/opencontainers/runc \
    -v ${GOPATH}/src/github.com/containerd/containerd:/go/src/github.com/containerd/containerd \
    -e GOPATH=/go \
    -w /go/src/github.com/containerd/containerd containerd-dev sh

$ make && make install
$ cd /go/src/github.com/opencontainers/runc
$ make BUILDTAGS='seccomp apparmor' && make install

$ containerd --config config.toml

You can now attach to the same container and we can try out ctr and other commands:

$ docker ps
$ docker exec -ti <CONTAINER_ID> sh

So lets try pulling an image:

$ ctr image pull docker.io/library/alpine:latest
docker.io/library/alpine:latest:                                                  resolved       |++++++++++++++++++++++++++++++++++++++|
index-sha256:c19173c5ada610a5989151111163d28a67368362762534d8a8121ce95cf2bd5a:    done           |++++++++++++++++++++++++++++++++++++++|
manifest-sha256:e4355b66995c96b4b468159fc5c7e3540fcef961189ca13fee877798649f531a: done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:89d9c30c1d48bac627e5c6cb0d1ed1eec28e7dbdfbcc04712e4c79c0f83faf17:    done           |++++++++++++++++++++++++++++++++++++++|
config-sha256:965ea09ff2ebd2b9eeec88cd822ce156f6674c7e99be082c7efac3c62f3ff652:   done           |++++++++++++++++++++++++++++++++++++++|
elapsed: 2.5 s                                                                    total:  1.9 Mi (772.0 KiB/s)
unpacking linux/amd64 sha256:c19173c5ada610a5989151111163d28a67368362762534d8a8121ce95cf2bd5a...
done

The looks good. Next, lets see if we can run it:

# ctr run docker.io/library/alpine:latest some_container_id echo "bajja"
bajja

So containerd is the daemon (long running background process) which exposes a gRPC API over a local Unix socket (so there is not network traffic involved). containerd supports the OCI Image Specification so any image that exists in upstream repositories. OCI Runtime Specification support allows any container runtime that support that spec to be run, like runC, rkt. Supports image pull and push. A Task is a live running process on the system.

ctr is a command line tool for interacting with containerd.

So, how could we run our above container using containerd?

$ docker exec -ti 78e22cb726b9 /bin/bash
$ cd /root/go/src/github.com/containerd/containerd/bin
$ ctr --debug images pull --user dbevenius:xxxx docker.io/dbevenius/faas-js-example:latest

The first thing that happens is containerd will fetch the the data from the remote, in this case docker and store this in the content store:

$ ctr content ls

Fetch will update the metadata store and add a record that the image. The second stage is the Unpack stage which will read the content and reads the layers from the content store and unpack them into the snapshotter.

$ ctr images ls
REF                                        TYPE                                                 DIGEST                                                                  SIZE     PLATFORMS  LABELS
docker.io/dbevenius/faas-js-example:latest application/vnd.docker.distribution.manifest.v2+json sha256:69cc8b6087f355b7e4b2344587ae665c61a067ee05876acc3a5b15ca2b15e763 28.9 MiB linux/amd64 -


$ ctr content ls | grep sha256:69cc8b6087f355b7e4b2344587ae665c61a067ee05876acc3a5b15ca2b15e763
DIGEST									SIZE	AGE		LABELS
sha256:69cc8b6087f355b7e4b2344587ae665c61a067ee05876acc3a5b15ca2b15e763	1.99kB	About an hour	containerd.io/gc.ref.content.2=sha256:95b3c812425e243848db3a3eb63e1e461f24a63fb2ec9aa61bcf5a553e280c07,containerd.io/gc.ref.content.4=sha256:28549a15ba3eb287d204a7c67fdb84e9d7992c7af1ca3809b6d8c9e37ebc9877,containerd.io/gc.ref.content.6=sha256:5a4ed7db773aa044d8c7d54860c6eff0f22aee8ee56d4badf4f890a3c82e6070,containerd.io/gc.ref.content.1=sha256:e7c96db7181be991f19a9fb6975cdbbd73c65f4a2681348e63a141a2192a5f10,containerd.io/gc.ref.content.7=sha256:aaf35efcb95f6c74dc6d2c489268bdc592ce101c990729280980da140647e63f,containerd.io/gc.ref.content.8=sha256:c79d77af46518dfd4e94d3eb3a989a43f06c08f481ab3a709bc5cd5570bb0fe2,containerd.io/gc.ref.content.3=sha256:778b81d0468fbe956db39aca7059653428a7a15031c9483b63cb33798fcdadfa,containerd.io/gc.ref.content.0=sha256:3e98616b38fe8a6943029ed434345adc3f01fd63dce3bec54600eb0c9e03bdff,containerd.io/distribution.source.docker.io=dbevenius/faas-js-example,containerd.io/gc.ref.content.5=sha256:0bcb2f6e53a714f0095f58973932760648f1138f240c99f1750be308befd943

$ ctr snapshots info faas-js-example-container
{
    "Kind": "Active",
    "Name": "faas-js-example-container",
    "Parent": "sha256:0e7d0af5a24eb910a700e2b293e4ae3b6a4b0ed5c277233ae7a62810cfe9c831",
    "Created": "2019-12-11T09:58:39.8149122Z",
    "Updated": "2019-12-11T09:58:39.8149122Z"
}
$ ctr snapshots tree
 sha256:f1b5933fe4b5f49bbe8258745cf396afe07e625bdab3168e364daf7c956b6b81
  \_ sha256:0a57385ee1dd96a86f16bfc33e7e8f3b03ba5054d663e4249e9798f15def762d
    \_ sha256:ebd0af597629452dee5e09da6b0bbecc93288d4910d49cef417097e1319e8e5f
      \_ sha256:fae0635457a678fa17ba41dc06cffc00c339c3c760515d8fd95f4c54d111ce4d
        \_ sha256:8e7ae562c333ef89a5ce0a5a49236ada5c7241e7788adbf5fe20fd3f6e2eb97d
          \_ sha256:323ec4a838fe67b66e8fa8e4fb649f569be22c9a7119bb59664c106c1af8e5b1
            \_ sha256:f4238a21a85c3d721b54f2304a671aa56cc593a436e2fe554f88369c527672f0
              \_ sha256:0e7d0af5a24eb910a700e2b293e4ae3b6a4b0ed5c277233ae7a62810cfe9c831
                \_ faas-js-example-container

So this is the information that will be available after a pull.

So we should now be able to run this image using ctr:

$ ctr run docker.io/dbevenius/faas-js-example:latest faas-js-example-container
+ umask 000
+ cd /home/node/usr
+ '[' -f package.json ]
+ cd ../src
+ node .
{"level":30,"time":1576058320396,"pid":9,"hostname":"d51fc5895172","msg":"Server listening at http://0.0.0.0:8080","v":1}
FaaS framework initialized

Run read the image we want to run and create the OCI specification from it. It will create a new read/write layer in the snapshotter. Then it will setup the container which will have a new rootfs. When the runtime shim is asked to start the process is will take the OCI specification and create a bundle directory:

$ ls /run/containerd/io.containerd.runtime.v2.task/default/faas-js-example-container
address  config.json  init.pid	log  log.json  rootfs  runtime	work

We could use this directory to start a container just with runc if we wanted too:

$ runc create -bundle /run/containerd/io.containerd.runtime.v2.task/default/faas-js-example-container/ faas-js-example-container2
$ runc list
ID                           PID         STATUS      BUNDLE                                                                            CREATED                        OWNER
faas-js-example-container2   5732        created     /run/containerd/io.containerd.runtime.v2.task/default/faas-js-example-container   2019-12-11T11:37:08.5385797Z   root

So we have launched a container using ctr which uses containerd go-client, and the contains runtime used is runc.

We can attach another process and the inspect things:

List all the containers:

$ ctr containers ls
CONTAINER                    IMAGE                                         RUNTIME
faas-js-example-container    docker.io/dbevenius/faas-js-example:latest    io.containerd.runc.v2

Get info about a specific container:

$ ctr container info faas-js-example-container
{
    "ID": "faas-js-example-container",
    "Labels": {
        "io.containerd.image.config.stop-signal": "SIGTERM"
    },
    "Image": "docker.io/dbevenius/faas-js-example:latest",
    "Runtime": {
        "Name": "io.containerd.runc.v2",
        "Options": {
            "type_url": "containerd.runc.v1.Options"
        }
    },
    "SnapshotKey": "faas-js-example-container",
    "Snapshotter": "overlayfs",
    "CreatedAt": "2019-12-11T09:58:39.8637501Z",
    "UpdatedAt": "2019-12-11T09:58:39.8637501Z",
    "Extensions": null,
    "Spec": {
        "ociVersion": "1.0.1-dev",
        "process": {
            "user": {
                "uid": 1001,
                "gid": 0
            },
            "args": [
                "docker-entrypoint.sh",
                "/home/node/src/run.sh"
            ],
    ...

Notice the SnapshotKey which is faas-js-example-container and that it matches the output from when we used the ctr snapshots command above.

Get information about running processes (tasks):

# ctr tasks ls
TASK                         PID     STATUS
faas-js-example-container    5299    RUNNING

Lets take a look at that pid:

# ps 5299
  PID TTY      STAT   TIME COMMAND
 5299 ?        Ss     0:00 /bin/sh /home/node/src/run.sh

So this is the actual container/process.

$ ps aux | grep faas-js
root      5254  0.0  0.4 943588 26996 pts/1    Sl+  09:58   0:00 ctr run docker.io/dbevenius/faas-js-example:latest faas-js-example-container
root      5276  0.0  0.1 111996  6568 pts/0    Sl   09:58   0:00 /usr/local/bin/containerd-shim-runc-v2 -namespace default -id faas-js-example-container -address /run/containerd/containerd.sock

So process 5454 is the process we used to start the containers. Notice the second process which is using containerd-shim-runc-v2

# /usr/local/bin/containerd-shim-runc-v2 --help
Usage of /usr/local/bin/containerd-shim-runc-v2:
  -address string
    	grpc address back to main containerd
  -bundle string
    	path to the bundle if not workdir
  -debug
    	enable debug output in logs
  -id string
    	id of the task
  -namespace string
    	namespace that owns the shim
  -publish-binary string
    	path to publish binary (used for publishing events) (default "containerd")
  -socket string
    	abstract socket path to serve

This binary can be found in cmd/containerd-shim-runc-v2/main.go. TODO: Take a closer look at how this is implemented.

So, we now have an idea of what is involved when running containerd and runc, and which process on the system we can inspect. We will now turn our attention to kubernetes and kubelet to see how it uses containerd.

Kubelet

In a kubernetes cluster, a worker node will have a kubelet daemon running which processes pod specs and uses the information in the pod specs to start containers.

It originally did so by using docker as the container runtime. There are other container runtime, for example rkt, and to be able to switch out the container runtime an interface needed to be provided to enable this. This interface is called the Kubernetes Container Runtime Interface (CRI).

+------------+                 +--------------+      +------------+
| Kubelet    |                 |  CRI Shim    |      |  Container |<---> Container_0
|            |  CRI protobuf   |              |<---->|  Runtime   |<---> Container_1
| gRPC Client| --------------->| gRPC Server  |      |(containerd)|<---> Container_n
+------------+                 +--------------+      +------------+

The CRI Shim I think is a plugin in containerd enabling it to access lower level services in containerd without having to go through the "normal" client API. This might be useful to make a single API call that performs multiple containerd services instead of having to go via the client API which might require multiple calls.

$ docker run --privileged -ti fedora /bin/bash
$ dnf install -y kubernetes-node
$ dockerd

Next connect to the same container, remember just another process in the same namespace etc:

$ docker run -ti --privileged fedora /bin/bash
$ kubelet --fail-swap-on=false

Contents of an image

$ docker build -t dbevenius/faas-js-example .

This filesystem is tarred (.tar) and metadata is added.

So, lets save an image to a tar:

$ docker save dbevenius/faas-js-example -o faas-js-example.tar

If you extract this to location somewhere you can see all the files that are included.

$ ls -l
total 197856
drwxr-xr-x  5 danielbevenius  staff       160 Nov 25 09:37 33f42e9c3b8312f301e51b6c2575dbf1943afe5bfde441a81959b67e17bd30fd
drwxr-xr-x  5 danielbevenius  staff       160 Nov 25 09:37 354bdf12df143f7bb58e23b66faebb6532e477bb85127dfecf206edf718f6afa
-rw-r--r--  1 danielbevenius  staff      7184 Nov 25 09:37 3e98616b38fe8a6943029ed434345adc3f01fd63dce3bec54600eb0c9e03bdff.json
drwxr-xr-x  5 danielbevenius  staff       160 Nov 25 09:37 4ce67bc3be70a3ca0cebb5c0c8cfd4a939788fd413ef7b33169fdde4ddae10c9
drwxr-xr-x  5 danielbevenius  staff       160 Nov 25 09:37 835da67a1a2d95f623ad4caa96d78e7ecbc7a8371855fc53ce8b58a380e35bb1
drwxr-xr-x  5 danielbevenius  staff       160 Nov 25 09:37 86b808b018888bf2253eae9e25231b02bce7264801dba3a72865af2a9b4f6ba9
drwxr-xr-x  5 danielbevenius  staff       160 Nov 25 09:37 91859611b06cec642fce8f8da29eb8e18433e8e895787772d509ec39aadd41f9
drwxr-xr-x  5 danielbevenius  staff       160 Nov 25 09:37 b7e513f1782880dddf7b47963f82673b3dbd5c2eeb337d0c96e1ab6d9f3b76bd
drwxr-xr-x  5 danielbevenius  staff       160 Nov 25 09:37 f3d9c7465c1b1752e5cdbe4642d98b895476998d41e21bb2bfb129620ab2aff9
-rw-r--r--  1 danielbevenius  staff       794 Jan  1  1970 manifest.json
-rw-r--r--  1 danielbevenius  staff       183 Jan  1  1970 repositories

manifest.json:

`[
  {"Config":"3e98616b38fe8a6943029ed434345adc3f01fd63dce3bec54600eb0c9e03bdff.json",
   "RepoTags":["dbevenius/faas-js-example:0.0.3","dbevenius/faas-js-example:latest"],
    "Layers":["b7e513f1782880dddf7b47963f82673b3dbd5c2eeb337d0c96e1ab6d9f3b76bd/layer.tar",
              "86b808b018888bf2253eae9e25231b02bce7264801dba3a72865af2a9b4f6ba9/layer.tar",
              "354bdf12df143f7bb58e23b66faebb6532e477bb85127dfecf206edf718f6afa/layer.tar",
              "4ce67bc3be70a3ca0cebb5c0c8cfd4a939788fd413ef7b33169fdde4ddae10c9/layer.tar",
              "91859611b06cec642fce8f8da29eb8e18433e8e895787772d509ec39aadd41f9/layer.tar",
              "835da67a1a2d95f623ad4caa96d78e7ecbc7a8371855fc53ce8b58a380e35bb1/layer.tar",
              "f3d9c7465c1b1752e5cdbe4642d98b895476998d41e21bb2bfb129620ab2aff9/layer.tar",
              "33f42e9c3b8312f301e51b6c2575dbf1943afe5bfde441a81959b67e17bd30fd/layer.tar"]}]

repositories:

{
  "dbevenius/faas-js-example": {
     "0.0.3":"33f42e9c3b8312f301e51b6c2575dbf1943afe5bfde441a81959b67e17bd30fd",
     "latest":"33f42e9c3b8312f301e51b6c2575dbf1943afe5bfde441a81959b67e17bd30fd"
  }
}

3e98616b38fe8a6943029ed434345adc3f01fd63dce3bec54600eb0c9e03bdff.json: This file contains the configuration of the container.

When we build a Docker image we specify a base image and that is usually a specific operating system. This is not a full OS but instead all the libraries and utilities expected to be found by the application. They kernel used is the host.

Pods

Is a group of one or more containers with shared storage and network. Pods are the unit of scaling.

A pod consists of a Linux namespace which is shared with all the containers in the pod, which gives them access to each other. So a container is used for isolation you can join them using namespaces which how a pod is created. This is how a pod can share the one IP address as they are in the same networking namespace. And remember that a container is just a process, so these are multiple processes that can share some resources with each other.

Kubernetes Custom Resources

Are extentions of the Kubernetes API. A resource is simply an endpoint in the kubernetes API that stores a collection of API objects (think pods or deployments and things like that). You can add your own resources just like them using custom resources. After a custom resources is installed kubectl can be used to with it just like any other object.

So the custom resource just allows for storing and retrieving structured data, and to have functionality you have custom controllers.

Controllers

Each controller is responsible for a particular resource.

Controller components:

Informer/SharedInformer

A resource can be watched which is a verb in the exposed REST API. When this is used there will be a long running connection, a http/2 stream, of event changes to the resource (create, update, delete, etc).

Watches the current state of resource instances and sends events to the Workqueue. The informer gets the information about an object it sends a request to the API server. Instead of each informer caching the objects it is interested in multiple controllers might be interested in the same resource object. Instead of them each caching the data/state they can share the cache among themselves, this is what a SharedInformer does.

The informers also contain error handling for the long running connection breaks , it will take care of reconnecting.

Resource Event Handler handles the notifications when changes occur.

type ResourceEventHandlerFuncs struct {
	AddFunc    func(obj interface{})
	UpdateFunc func(oldObj, newObj interface{})
	DeleteFunc func(obj interface{})
}

Workqueue

Items in this queue are taken by workers to perform work.

Custom Resource Def/Controller example

rust-controller is an example of a custom resource controller written in Rust. The goal is to understand how these work with the end goal being able to understand how other controllers are written and how they are installed and work.

I'm using CodeReady Container(crc) so I'll be using some none kubernetes commands:

$ oc login -u kubeadmin -p e4FEb-9dxdF-9N2wH-Dj7B8 https://api.crc.testing:6443
$ oc new-project my-controller
$ kubectl create  -f k8s-controller/docs/crd.yaml
customresourcedefinition.apiextensions.k8s.io/members.example.nodeshift.com created

We can try to access somthings using:

$ kubectl get member -o yaml

But there will not be anything in there get. We have to create something using

$ kubectl apply -f k8s-controller/docs/member.yaml
member.example.nodeshift.com/dan created

Now if we again try to list the resources we will see an entry in the items list.

$ kubectl get members -o yaml -v=7

The extra -v=7 flag gives verbose output and might be useful to know about.

And we can get all Something's using:

$ kubectl get Member
$ kubectl describe Member
$ kubectl describe Member/dan

You can see all the available short names using api-resources

$ kubectl api-resources
$ kubectl config current-context
default/api-crc-testing:6443/kube:admin

go-controller

This is a controller written in go. The motivation for having two is that most controllers I've seen are written in go and having an understanding of the code and directory structure of one will help understand others.

First to get all the dependencies onto our system we are going to use sample-controller from the kubernetes project:

$ go get k8s.io/sample-controller

We should now be able to build our go-controller:

$ unset CC CXX
$ cd go-controller 
$ go mod vendor
$ go build -o go-controller .
$ ./go-controller -kubeconfig=$HOME/.kube/config

Building/Running:

$ cargo run

Deleting a resource should trigger our controller:

$ kubectl delete -f docs/member.yaml

Keep this in mind when we are looking at Knative and Istio that this is mainly how one extends kubernetes using customer resources definitions with controllers.

Installation

Knative runs on kubernetes, and Knative depends on Istio so we need to install these. Knative depends on Istio for setting up the internal network routing and the ingress (data originating from outside the local network).

Installing Knative on top of minikube:

$ minikube start -p example --memory=8192 --cpus=6 \
  --kubernetes-version=v1.15.0 \
  --vm-driver=kvm2 \
  --disk-size=30g \
  --extra-config=apiserver.enable-admission-plugins="LimitRanger,NamespaceExists,NamespaceLifecycle,ResourceQuota,ServiceAccount,DefaultStorageClass,MutatingAdmissionWebhook"

Notice that we are using a profile which is specified with the -p option. We can later stop and start this profile by using minikube start -p example.

Sometime I've gotten the following error when trying to start up the cluster:

Error starting cluster: addon phase cmd:"/bin/bash -c \"sudo env PATH=/var/lib/minikube/binaries/v1.15.0:$PATH kubeadm init phase addon all --config /var/tmp/minikube/kubeadm.yaml\"": /bin/bash -c "sudo env PATH=/var/lib/minikube/binaries/v1.15.0:$PATH kubeadm init phase addon all --config /var/tmp/minikube/kubeadm.yaml": Process exited with status 1
stdout:

stderr:
error execution phase addon/coredns: unable to create deployment: Internal error occurred: failed calling webhook "legacysinkbindings.webhook.sources.knative.dev": Post https://eventing-webhook.knative-eventing.svc:443/legacysinkbindings?timeout=30s: dial tcp 10.104.102.208:443: connect: connection refused


😿  minikube is exiting due to an error. If the above message is not useful, open an issue:
👉  https://github.com/kubernetes/minikube/issues/new/choose
❌  Problems detected in kube-addon-manager ["8f8d6c1fca73"]:
    error: no objects pasIedFtO :p pLye
    error: No Obj c=t=  pKasbeed te tapsl y
    error: no objeatsdpoans d to appey cerror: no abjectstp a2s0se0d-t0o -p2p3ly

But trying to restart it again worked. Perhaps there is some sort of race condition of service upon cluster startup.

$ curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.15.0/bin/linux/amd64/kubectl
$ chmod +x ./kubectl
$ sudo mv ./kubectl /usr/local/bin/kubectl

$ export ISTIO_VERSION=1.3.6
$ curl -L https://git.io/getLatestIstio | sh -
$ cd istio-${ISTIO_VERSION}
$ for i in install/kubernetes/helm/istio-init/files/crd*yaml; do kubectl apply -f $i; done
$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Namespace
metadata:
  name: istio-system
  labels:
    istio-injection: disabled
EOF
namespace/istio-system created

Install helm:

$ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3
$ chmod 700 get_helm.sh
$ ./get_helm.sh
$ helm template --namespace=istio-system \
  --set prometheus.enabled=false \
  --set mixer.enabled=false \
  --set mixer.policy.enabled=false \
  --set mixer.telemetry.enabled=false \
  `# Pilot doesn't need a sidecar.` \
  --set pilot.sidecar=false \
  --set pilot.resources.requests.memory=128Mi \
  `# Disable galley (and things requiring galley).` \
  --set galley.enabled=false \
  --set global.useMCP=false \
  `# Disable security / policy.` \
  --set security.enabled=false \
  --set global.disablePolicyChecks=true \
  `# Disable sidecar injection.` \
  --set sidecarInjectorWebhook.enabled=false \
  --set global.proxy.autoInject=disabled \
  --set global.omitSidecarInjectorConfigMap=true \
  --set gateways.istio-ingressgateway.autoscaleMin=1 \
  --set gateways.istio-ingressgateway.autoscaleMax=2 \
  `# Set pilot trace sampling to 100%` \
  --set pilot.traceSampling=100 \
  --set global.mtls.auto=false \
  install/kubernetes/helm/istio \
  > ./istio-lean.yaml
$ kubectl apply -f istio-lean.yaml

Verify that istio is installed:

$ kubectl get pods --namespace istio-system -w
NAME                                   READY   STATUS    RESTARTS   AGE
istio-ingressgateway-5d9bc67ff-cgfcp   0/1     Running   0          29s
istio-pilot-54c8644bc5-8jh47           0/1     Running   0          29s
istio-pilot-54c8644bc5-8jh47           1/1     Running   0          61s

Next, we install Knative itself:

$ kubectl apply --selector knative.dev/crd-install=true --filename https://github.com/knative/serving/releases/download/v0.12.0/serving.yaml --filename https://github.com/knative/eventing/releases/download/v0.12.0/eventing.yaml --filename https://github.com/knative/serving/releases/download/v0.12.0/monitoring.yaml

$ kubectl apply --filename https://github.com/knative/serving/releases/download/v0.12.0/serving.yaml --filename https://github.com/knative/eventing/releases/download/v0.12.0/eventing.yaml --filename https://github.com/knative/serving/releases/download/v0.12.0/monitoring.yaml 

$ kubectl get pods --namespace knative-serving -w
NAME                               READY   STATUS    RESTARTS   AGE
activator-6b49796b46-lww55         1/1     Running   0          12m
autoscaler-7b46fcb475-lclgc        1/1     Running   0          12m
autoscaler-hpa-797c8c8647-zmrkc    1/1     Running   0          12m
controller-65f4f4bcb4-8gq7r        1/1     Running   0          12m
networking-istio-87d7c6686-tzvsk   1/1     Running   0          12m
webhook-59585cb6-vrmx8             1/1     Running   0          12m

To verify that Knative has been installed we can check the pods:

$ kubectl get pods --namespace knative-serving --namespace=knative-eventing

So, after this we should be good to go and deploying a Knative service should be possible:

$ kubectl apply -f service.yaml
$ kubectl get pods --watch
$ kubectl describe svc js-example

Get the port:

$ kubectl get svc istio-ingressgateway --namespace istio-system --output 'jsonpath={.spec.ports[?(@.port==80)].nodePort}'
31380

And the ip:

$ minikube ip
192.168.64.21

We can now use this information to invoke our service:

$ curl -v -H 'Host: js-example.default.example.com' 192.168.64.21:31380/

You'll notice that it might take a while for the first call if the service has been scaled down to zero. You'll can check this by first seeing if there are any pods before you run the curl command and then afterwards.

If you have stopped and restarted the cluster (perhaps because the noise of your computer fan was driving you crazy) you might get the following error message:

UNAVAILABLE:no healthy upstream* Closing connection 0

The service will eventually become avilable and I think my machine (from 2013) is just exceptionally slow for this type of load).

So we have a service and we want events to be delivered to it. For this we need something that sends events. This is called an event source in knative.

$ kubectl apply -f source.yaml

Such an event source can send events directly to a service but that means that the source will have to take care of things like retries and handle situations when the service is not available. Instead the event source can use a channel which it can send the events to.

$ kubectl apply -f channel.yaml

Something can subscribe to this channel enabling the event to get delivered to the service, these things are called subscriptions.

$ kubectl apply -f subscription.yaml

So we have our service deployed, we have a source for generating events which sends events to a channel, and we have a subscription that connects the channel to our service. Lets see if this works with our js-example.

Sometimes when reading examples online you might copy one and it fails to deploy saying that the resoures does not exist. For example:

error: unable to recognize "channel.yaml": no matches for kind "Channel" in version "eventing.knative.dev/v1alpha1"

If this happens and you have installed a channel resource you an use the following command to find the correct apiVersion to use:

$ kubectl api-resources | grep channel
channels                          ch              messaging.knative.dev              true         Channel
inmemorychannels                  imc             messaging.knative.dev              true         InMemoryChannel

Next we will create a source for events:

```console
$ kubectl describe sources

Knative focuses on three key categories:

* building your application
* serving traffic to it 
* enabling applications to easily consume and produce events.

Serving

Automatically scale based on load, including scaling to zero when there is no load. You deploy a prebuilt image to the underlying kubernetes cluster.

Serving contains a number of components/object which are described below:

Configuration

This will contain a name reference to the container image to deploy. This ref is called a Revision. Example configuration (configuration.yaml):

apiVersion: serving.knative.dev/v1alpha1
kind: Configuration
metadata:
  name: js-example
  namespace: js-event-example
spec:
  revisionTemplate:
    spec:
      container:
        image: docker.io/dbevenius/faas-js-example

This can be applied to the cluster using:

$ kubectl apply -f configuration.yaml
configuration.serving.knative.dev/js-example created
$ kubectl get configurations js-example -oyaml
$ kubectl get ksvc js-example  --output=custom-columns=NAME:.metadata.name,URL:.status.url
NAME               URL
js-example   http://js-example.default.example.com

Revision

Immutable snapshots of code and configuration. Refs a specific container image to run. Knative creates Revisions for us when we modify the Configuration. Since Revisions are immutable and multiple versions can be running at once, it’s possible to bring up a new Revision while serving traffic to the old version. Then, once you are ready to direct traffic to the new Revision, update the Route to instantly switch over. This is sometimes referred to as a blue-green deployment, with blue and green representing the different versions.

$ kubectl get revisions 

Route

Routes to a specific revision.

Service

This is our functions code.

The serving names space is knative-serving. The Serving system has four primary components.

1) Controller
Is responsible for updating the state of the cluster. It will create kubernetes
and istio resources for the knative-serving resource being created.

2) Webhook
Handles validation of the objects and actions performed

3) Activator
Brings back scaled-to-zero pods and forwards requets.

4) Autoscaler
Scales pods as requests come in.

We can see these pods by running:

$ kubectl -n knative-serving get pods

So, lets take a look at the controller. The configuration files for are located in controller.yaml

$ kubectl describe deployment/controller -n knative-serving

I'm currently using OpenShift so the details compared to the controller.yaml will probably differ but the interesting part for me is that these are "just" object that are deployed into the kubernetes cluster.

So what happens when we run the following command?

$ kubectl apply -f service.yaml

This will make a request to the API Server which will take the actions appropriate for the description of the state specified in service.yaml. For this to work there must have been something registered that can handle the apiversion:

apiVersion: serving.knative.dev/v1alpha1

I'm assuming this is done as part of installing knative

Eventing

Makes it easy to produce and consume events. Abstracts away from event sources and allows operators to run their messaging layer of choice.

Knative is installed as a set of Custom Resource Definitions (CRDs) for Kubernetes.

Sources

The source of the events. Examples:

  • GCP PubSub
  • Kubernetes Events
  • Github
  • Container Sources

Channels

While you can send events straight to a Service, this means it’s up to you to handle retry logic and queuing. And what happens when an event is sent to your Service and it happens to be down? What if you want to send the same events to multiple Services? To answer all of these questions, Knative introduces the concept of Channels. Channels handle buffering and persistence, helping ensure that events are delivered to their intended Services, even if that service is down. Additionally, Channels are an abstraction between your code and the underlying messaging solution. This means we could swap this between something like Kafka and RabbitMQ.

Each channel is a custom resource.

Subscriptions

Subscriptions are the glue between Channels and Services, instructing Knative how our events should be piped through the entire system.

Istio

Istio is a service mesh that provides many useful features on top of Kubernetes including traffic management, network policy enforcement, and observability. We don’t consider Istio to be a component of Knative, but instead one of its dependencies, just as Kubernetes is. Knative ultimately runs on a Kubernetes cluster with Istio.

Service mesh

A service mesh is a way to control how different parts of an application share data with one another. So you have your app that communicates with various other sytstems, like backend database applications or other systems. They are all moving parts and their availability might change over time. To avoid one system getting swamped with requests and overloaded a service mesh is used which routes requests from one service to the next. This indirection allows for optimizations and re-routing where needed.

Another reasons for having a service mesh like this is that a microservice architecture might be implemented in various different languages. These languages have different ways of doing things like providing stats, tracing, logging, retry, circuit breaking, rate limiting, authentication and authorization. This can make it difficult to debug latency and failures.

In a service mesh, requests are routed between microservices through proxies in their own infrastructure layer. For this reason, individual proxies that make up a service mesh are sometimes called “sidecars,” since they run alongside each service, rather than within them. These sidecars are just containers in the pod. Taken together, these “sidecar” proxies—decoupled from each service—form a mesh network.

So each service has a proxy attached to it which is called a sidecar. These side cars route network request to other side-cars, which are the services that the current service uses. The network of these side cars are the service mesh.

These sidcars also allow for collecting metric about communication so that other services can be added to monitor or take actions based on changes to the network. The sidecar will do things like service discovery, load balancing, rate limiting, circuit breaking, retry, etc. So if serviceA want to call serviceB, serviceA will talk to its local sidecar proxy which will take care of calling serviceB, where ever that serviceB might be at the current time. So there services them are decoupled from each other and also don't have to be concerned with networking, they just communicate with the local sidecar proxy.

Note that we have only been talking about communication between services and not communication with the outside world (outside of the service network/mesh). To expose a service to the outside world and allow it to be access through the service mesh, so that it can take advantage of all the features like of the service mesh instead of calling the service directly, we have to enble ingress traffic.

So we have dynamic request routing in the proxies. To manage the routing and other features of the service mesh a control plane is used for centralized management. In Istio this is called a control plan which has three components:

1) Pilot
2) Mixer
2) Istio-Auth

Sidecar

Is a utility container that supports the main container in a pod. Remember that a pod is a collection of one or more containers.

All of these instances form a mesh and share routing information with each other.

So to use Knative we need istio for the service mesh (communication between services), we also need to be able to access the target service externally which we use some ingress service for.

So I need to install istio (or other service mesh) and an ingress to the kubernetes cluster and then Knative to be able to use Knative.

Istio

Is a service mesh implementation and also a platform, including APIs that let it integrate into any logging platform, or telemetry or policy system

You add Istio support to services by deploying a special sidecar proxy throughout your environment that intercepts all network communication between microservices, then configure and manage Istio using its control plane functionality

Istio’s traffic management model relies on the Envoy proxies that are deployed along with your services. All traffic that your mesh services send and receive (data plane traffic) is proxied through Envoy, making it easy to direct and control traffic around your mesh without making any changes to your services.

Envoy

The goal on Envoy is to make the network transparent to applications. When issues occur they should be easy to figure out where the problem is.

Envoy is an out of process architecture which is great when you have services written in multiple languages. If you opt for a library approach you have to have implementations in all the languages that you use (hysterix is an example). Envoy is a layer3/layer4 filter architecture (so network layer (IP), and transport layer (TCP/UDP). There is also a layer 7 (application layer) that can operate/filter http headers.

Service discovery and active (ping the service)/passive (monitor the trafic) health checking. Has various load-balancing algorithms. Provides observability via stats, logging, and tracing. Authentication and authorization

Envoy is used as both an Edge proxy and a service proxy.

  1. Edge proxy This gives a single point of ingress (external traffic; not internal to the service mesh).
  2. Service proxy This is a separate process that keeps an eye on the services.

CloudEvent spec 1.0

This spec describes data in a common way to provide interoperability among serverless providers so that that can events can generated and consumed by different cloud providers/languages.

The spec consists of a base which is the contains the attributes for a CloudEvent. Then there is an extension to this which defines additional attributes which can be used by certain providers/consumers. One example of this given is tracing. Then there is event format encoding which defines how the base and extension information is mapped to headers and payload of an application protocol. Finally we have protocol bindings which defines how a CloudEvent is bound to an application protocol transport frame.

There is a concept of Event Formats that specify how a CloudEvent is serialized into various encoding formats (for example JSON).

Mandatory:

id		string identifier
source          url that identifies the context in which the event happend
                type of event, etc. The source+id must be unique for each event.

specversion
type

Optional:

datacontenttype
dataschema
data
subject
time

Example:

{
    "specversion" : "1.0",
    "type" : "com.github.pull.create",
    "source" : "https://github.com/cloudevents/spec/pull",
    "subject" : "123",
    "id" : "A234-1234-1234",
    "time" : "2018-04-05T17:31:00Z",
    "comexampleextension1" : "value",
    "comexampleothervalue" : 5,
    "datacontenttype" : "text/xml",
    "data" : "<much wow=\"xml\"/>"
}

There is a http-protocol-binding specification: https://github.com/cloudevents/spec/blob/v1.0/http-protocol-binding.md

This spec defines three content modes for transferring events:

1) binary
2) structured
3) batched

Binary content mode

In the binary content mode, the value of the event data is placed into the HTTP request/response body as-is, with the datacontenttype attribute value declaring its media type in the HTTP Content-Type header; all other event attributes are mapped to HTTP headers.

Structured content mode

In the structured content mode, event metadata attributes and event data are placed into the HTTP request or response body using an event format. So if the event format is JSON the complete cloud event will be in the http request/response body.

Event formats

These formats are used with structured content mode.

This format can be one of different specs, for example there is one spec for a json format (https://github.com/cloudevents/spec/blob/v1.0/json-format.md).

Content-Type: application/cloudevents+json; charset=UTF-8

datacontenttype is expected to contain a media-type expression, for example application/json;charset=utf-8.

data is encoded using the above media-type.

The content mode is chosen by the sender of the event. The receiver of the event can distinguish between the three modes by inspecting the Content-Type header value. If the value of this header is application/cloudevents it is a structured mode, if application/cloudevents-batch then it is batched, otherwise it the mode is binary.

HTTP Header names

These headers all have a ce- prefix.

Helm

Is a package manager for Kubernetes (think npm). Helm calls its packaging format charts which is a collection of files related to a set of Kubernetes resources. A chart must follow a naming and directory convention. The name of the dirctory must be the name of the chart:

chartname/
         Chart.yaml
         values.yaml
         charts
         crds (Custom Resourcde Definitions)
         templates

kubectl

Remove all the objects defined in a yaml file:

kubectl delete -f service.yaml

Update the object defined in a yaml file:

kubectl replace -f service.yaml

The Operator pattern combines custom resources and custom controllers.

Kubernetes

Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation. It gives you service discovery and load balancing, storage orchestration (mounting storage systems), automated rollouts/rollbacks, self healing, etc. There are various components to kubernetes:

Master node

Acts as a controller and is where decisions are made about scheduling, detecting and responding to cluster events. The master consists of the following components:

API Server

Exposes REST API that users/tools can interact with.

Cluster data store

This is etcd which is a key-value data store and is the persistent store that kubernetes uses.

Controller Manager

Runs all the controllers that handle all the routine tasks in the cluster. Examples are Node Controller, Replication Controller, Endpoint Controller, Service Account, and Token Controllers.

Scheduler

Watches for new pods and assigns them to nodes.

Dashboard (options)

Web UI.

Resource

A resource is an object that is stored in etcd and can be accessed through the api server. By itself this is what it does, contains information about the resource . A controller is what performs actions as far as I understand it.

Worker nodes

These run the containers and provide the runtime. A worker node is comprised of a kublet. It watches the API Server for pods that have been assigned to it. Inside each pod there are containers. Kublet runs these via Docker by pulling images, stopping, starting, etc. Another part of a worker node is the kube-proxy which maintains networking rules on the host and performing connection forwarding. It also takes care of load balancing.

Kubernetes nodes

Each node has a kubelet process and a kube-proxy process.

Kubelet

Makes sure that the containers are running in the pod. The information it uses is the PodSpecs.

Kube-proxy

Just like kubelet is responsible for starting containers, kubeproxy is responsible for making services work. It watches for service events and creates, updates, or deletes kube-proxies on the worker node. Maintains networking rules on nodes. It uses the OS packet filtering layer if available.

Container runtime

Is the software responsible for running containers (docker, containerd, cri-o, rktlet).

Addons

Cluster DNS addon

Is a DNS server which serves DNS records for kubernetes services. Containers started by Kubernetes automatically include this DNS server in their DNS searches

API Groups

Kubernetes uses a versioned API which is categoried into API groups. For example, you might find:

apiVersion: serving.knative.dev/v1alpha1

in a yaml file which is specifying the API group, serving.knative.dev and the version.

Pods

Is a group of one or more containers with shared storage and network. Pods are the unit of scaling. A pod consists of a Linux namespace which is shared with all the containers in the pod, which gives them access to each other. So a container is used for isolation, you can join them using namespaces which how a pod is created. This is how a pod can share the one IP address as they are in the same networking namespace.

ReplicaSet

The goal of a replicaset is to maintain a stable set of replica Pods. When a ReplicaSet needs to create new Pods, it uses its Pod template. Deployment is a higher-level concept that manages ReplicaSets and provides declarative updates to Pods along with a lot of other useful features. It is recommended to use Deployments instead of ReplicaSets directly.

ReplicationController

ReplicationController makes sure that a pod or a homogeneous set of pods is always up and available. If there are too many pods this controller will delete them, and if there are not enough it will create more. Pods maintined by this controller are automatically replace if deleted, which is not the case for manually create pods.

Container networking

Docker networking uses the kernel's networking stack as low level primitives to create higher level network drivers.

We have seen how a linux bridge can be used to connect containers on the same host. But what if we have containers on different host that need to communicate?
There are multiple solutions to this and one is using VXLAN, or a Macvlan and perhaps others. As these are new to me I'm going to go through what they are to help me understand networking in kubernetes better.

Virtual Local Area Network (VLAN)

To separate two networks they had to be physically separated. For example, you might have a guest network which should not be allowed to connect to the internal network. These two should not be able to communicate with each other so there was simply no connection between the hosts on one network to hosts on the other. The hosts would be connected to separate switches.

VLANs provide logical separation/segmentation, so we can have all hosts connected to one switch but they can still be separated into separate logical networks. This also allows host to be located a different locations as they don't have to be connected to the physical switch (which was not possible with pre-vlan). With vlan it does not matter where the hosts are, different floors/building/locations.

Virtual Extended Local Area Network (VXLAN)

VLANs are limited to 4094 VLANs but VXLANS allow for more than 4094 which might be required in a cloud environment. Is supported by the linux kernel and is a network tunnel. VXLAN tunnels layer 2 frames inside of UDP datagrams. This means that containers that are part of the same virtual local area network are on the same L2 network, but infact they are separated.

Macvlan

Macvlan provides MAC hardware addresses to each container allowing them to become part of the traditional network and use IPAM or VLAN trunking.

Overlay network

The physical network is called the underlay and an overly abstracts this to create a virtual network. Much like we did in the example of using a bridge in the networking example in this case a virtual tunnel endpoint (VTEP) is added to the bridge. This will then encapsulate the packet in a udp datagram with a few additional headers. VTEPs get their own MAC and IP addresses and show up as network interfaces

Kubernetes Networking

So, we understand that containers in the same namespace is what a pod is. And they will share the same network namespace hence have the same ip address, and share the same iptables, and ip routing rules. In the namespaces section above we also saw how multiple namespaces and communicate with each other.

So, a pod will have an ip address (all the processes in the same namespace) and the worker node that the pod is running on will also have a ip:

+--------------------------+
|     worker node0         |
|  +------------+          |
|  |     pod    |          |
|  | 172.16.0.2 |          |
|  +------------+          |
|                          |
|      ip: 10.0.1.3        |
|pod cidr: 10.255.0.0/24   |
+--------------------------+
+-------------------------------------+
|     service                         |
|selector:                            |
|port:80:7777                         |
|port:8080:7777                       |
|type:ClusterIP|NodePort|LoadBalancer |
+-------------------------------------+

The ClusterIP is assigned by a controller manager for this service. This will be unique across the whole cluster. This can also be a dns name. So you can have applications point to the ClusterIP and even if the underlying target pods are moved/scaled they will still continue to work. There is really nothing behind the ClusterIP, like there is no container or anything like that. Instead the cluster ip is a target for iptables. So when a packet destined for the cluster ip address it will get routed by iptables to the actual pods that implement that service.

Kubeproxy will watch for services and endpoints and update iptables on that worker node. So if an endpoint is removed iptables can be updated to remove that entry.

The NodePort type deals with getting traffic from outside of the cluster.

+-------------------------------------+
|     service                         |
|selector:                            |
|port:32599:80:7777                   |
|type:NodePort                        |
+-------------------------------------+

The port 32599 will be an entry in iptables for each node. So we can now use the nodeip:32599 to get to the service.

The LoadBalancer type is cloud specific and allows for a nicer was to access services from outside the cluster and not having to use the nodeip:port. The load balancer will still point to the NodePort so it builds on top of it.

IPTables

Is a firewall tool that interfaces with the linux kernel netfilter subsystem. Kube-proxy attaches rules to the PRE_ROUTING for services.

$ iptables -t nat -A PREROUTING -match conntrack -ctstate NEW -j KUBE_SERVICE

Above, we are adding a rule to the nat table by appending to the PREROUTING chain. The match specifies a iptables-extension which is specified as conntrack which allows access to the connection tracking state for this packet/connection. By specifying this extension we also can specify ctstate as NEW. Finally, the target is specified as `KUBE_SERVICE.

$ iptables -A KUBE_SERVICES ! -s src_cidr -d dst_cidr -p tcp -m tcp --dport 80 -j KUBE_MARQ

This appends a rule to the KUBE_SERVICES chain. TODO: add this from a real example.

When a cluster grows to many services using iptables can be a cause of performace issues as it will be applied sequentially. The packet will be checked against the rule one by one until it is found, O(n). For this reason IP Virtual Server (IPVS) which is also a Linux kernel feature can be used. I think that this was also a reason for looking into other ways to avoid this overhead, and one such way is to use eBPF.

Container Network Interface (CNI)

The following is from the cni-plugin section:

A CNI plugin is responsible for inserting a network interface into the container
network namespace (e.g. one end of a veth pair) and making any necessary changes
on the host (e.g. attaching the other end of the veth into a bridge). It should
then assign the IP to the interface and setup the routes consistent with the IP
Address Management section by invoking appropriate IPAM plugin.

This should hopefully sound familiar and is very similar to what we did in the network namespace section. The kubelet specifies the CNI to be used as a command line option.

tun/tap

Are software only interfaces

$ docker run --privileged -ti -v$PWD:/root/learning-knative -w/root/learning-knative gcc /bin/bash
$ mkdir /dev/net
$ mknod /dev/net/tun c 10 200
$ ip tuntap add mode tun dev tun0
$ ip addr add 10.0.0.0/24 dev tun0
$ ip link set dev tun0 up
$ ip route get 10.0.0.2

$ gcc -o tun tun.c
$ ./tun
Device tun0 opened

Building Knative Eventing

I've added a few environment variables to .bashrc and I also need to login to docker:

$ source ~/.bashrc
$ docker login
$ mkdir -p ${GOPATH}/src/knative.dev

Running the unit tests:

$ go test -v ./pkg/...
# runtime/cgo
ccache: invalid option -- E
Usage:
    ccache [options]
    ccache compiler [compiler options]
    compiler [compiler options]          (via symbolic link)

I had to unset CC and CXX for this to work:

$ unset CC
$ unset CXX

go-client

Is a web service client library written in go (k8s.io/client-go).

Kubernetes on AWS

You can setup an Amazon Elastic Computing (EC2) node on thier freetier which only has one CPU. It is possible to run minikube on that with a flag to avoid a CPU check as shown below:

$ ssh -i ~/.ssh/aws_key.pem ec2-user@ec2-18-225-37-245.us-east-2.compute.amazonaws.com
$ sudo yum update
$ sudo yum install -y docker
$ curl -Lo minikube https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 && chmod +x minikube && sudo mv minikube /usr/local/bin/
$ sudo -i
$ /usr/local/bin/minikube start --vm-driver=none --extra-config=kubeadm.ignore-preflight-errors=NumCPU --force --cpus 1
$ sudo mv /root/.kube /root/.minikube $HOME
$ sudo chown -R $USER $HOME/.kube $HOME/.minikube

After this it should be possible to list the resources available on the cluster:

$ kubectl api-resources

Now, we want to be able to interact with this cluster from our local machine. To enable this we need to add a configuration for this cluster to ~/.kube/config: TODO: figure out how to set this up. In the meantime I'm just checking out this repository on the same ec2 instance and using a github personal access token to be able to work and commit.

Operators

In OpenShift Operators are the preferred method of packaging, deploying, and managing services on the control plane.

API Gateway

An API Gateway is focused on offering a single entry point for external clients whereas a service mesh is for service to service communication. But there are a lot of features that both have in common. But there would be more overhead having an API gateway between all internal services (like latency for example). Ambassidor is an example of an API gateway. But an API gateway can be used at the entry point to a service mesh.

gRPC

A service is created by defining the functions it exposes and this is what the server application implements. The server side runs a gRPC server to handle calls to the service by decoding the incoming request, executing the method, and encoding the response. The service is defined using an interface definition language (IDL). gRPC can use protocol buffers (see Protocol Buffers for more details) as its IDL and as its message format.

The clients then generate a stub and can call functions on them locally. The nice thing is that clients can be generated in different languages, so the service could be written in one and then multiple clients stubs generated for languages that support gPRC.

Just to give some context where gRPC is coming from is that it is being used in stead of Restful APIs in places. Restful APIs don't have a formal machine readable API contract. The clients need to be written. Streaming is difficult. The information sent over the wire is not efficient for networks. And many restful endpoints are not actually resources (get created with put/post, retreived using get, etc).

gRPC is a protocol built on top of HTTP/2. There are three implementations:

  • C core - which is used by Ruby, Python, Node.js, PHP, C#, Objective-C, C++
  • Java (Netty + BoringSSL)
  • Go

gRPC was originally developed as Google and the the internal name was stubby.

Protocol Buffers (protobuf)

Is a mechanism for serializing structured data. You create file that defines the structure with a .proto extension:

message Something {  string name;  int32 age; }

With this file created we can use the compiler protoc to generate data access classes in the language you choose.

A gRPC service is be created and it will use message types as the types of parameters and return values. The services themselves are specified using rcp:

syntax = "proto3";

package lkn;

service Something {
  rpc doit (InputMsg) returns (OutputMsg);
}

message InputMsg{
  string input = 1;
}

message OutputMsg{
  string output = 1;
}

The gprc contains a node.js example of gRPC server and client.

Minikube on RHEL 8 issues

💣  Unable to start VM. Please investigate and run 'minikube delete' if possible: create: Error creating machine: Error in driver during machine creation: ensuring active networks: starting network minikube-net: virError(Code=89, Domain=47, Message='The name org.fedoraproject.FirewallD1 was not provided by any .service files')

😿  minikube is exiting due to an error. If the above message is not useful, open an issue:
👉  https://github.com/kubernetes/minikube/issues/new/choose

I was able to work around this by running:

$  sudo systemctl restart libvirtd

This was still not enough to get minikube to start though, I needed to delete minikube and start again:

$ minikube delete
$ minikube start

Docker/Moby-engine on Fedora

The default cgroups implementation on Fedora 31 and above is v2 which is not supported by the docker versions currently available for Fedora. You might see the following error:

$ docker run --rm hello-world:latest
docker: Error response from daemon: OCI runtime create failed: this version of runc doesn't work on cgroups v2: unknown.

One option is to revert to using v1 by running the following command and the rebooting:

sudo dnf install -y grubby && \
  sudo grubby --update-kernel=ALL --args="systemd.unified_cgroup_hierarchy=0"

Author: Danbev
Source Code: https://github.com/danbev/learning-knative 
License: 

#node #nodejs #kubernetes 

Learning-knative: Notes and Examples for Learning Knative