TL;DR

Kube-Scheduler: เป็น process หลักในการทำเรื่อง schedule โดยมีหลักการทำงาน 2 ขั้นตอนคือ Filtering คือ filter nodes ที่ไม่เหมาะสมออกไป และ Scoring เป็นการให้คะแนน node ที่ผ่านการ filering เพื่อหา node ที่้เหมาะสมกับ pod มากที่สุด
Predicates and Priorities: เป็น schedule policy ตัว stable ของ kubenetes
Extension Points and Plugins: เป็น schedule policy ตัวใหม่ของ kubernetes ที่เพิ่งเป็น beta ใน v1.18
Pod Specification: เราสามารถกำหนดความสามารถพิเศษและความต้องการของ Pod ใน pod specification เพื่อชี้นำการ schedule pod ไปยัง node ที่เหมาะสมของ kubernetes ได้
Specifiying the nodeName: เป็นการระบุ hostname ของ node ที่เราต้องการให้ Pod ไปอยู้
Specifiying the nodeSelector: เป็นการระบุ labels ของ node ที่ pod ต้องการ ให้ scheduler ส่ง pod ไป run
Pod Affinity Rules: เป็นการระบุให้ pod ชุดหนึ่งอยู่ใน location เดียวกับ
Node Affinity Rules: เหมือน nodeSelector แต่เพิ่มความยืดหยุ่นของ affinity เข้ามา
Taints: เป็นการระบุข้อจำกัดของ node เช่น เฉพาะกลุ่ม หรือ มี hardware พิเศษติดตั้งอยู่
Tolerations: เป็นการระบุความทนทานต่อ taints ของ pods โดย ถ้า pod มี toleration ต่อ taints ของ node จะสามารถถูก schedule ไป run ใน node นั้นได้
Custom Scheduler: ถ้า scheduler ที่เป็น default ไม่ถูกใจ เราสามารถ run ตัว scheduler นี้ขึ้นมาอีกตัว โดยที่ configuration ตามความต้องการของเรา เพื่อเป็น secondary scheduler ได้ หรือเขียนใหม่เองหมดก็ยังได้ แต่ ใน pod specification ต้องระบุ scheduler name ที่เราต้องการใช้ด้วย ไม่งั้นจะใช้ default scheduler

Kube-Scheduler

Kube-Scheduler ทำหน้าที่ในการหา worker node ที่เหมาะสมให้ pod ไป run โดยใช้ algorithm ที่ชื่อว่า "topology awareness" ผู้ใช้สามารถตั้ง priority ของ pod ได้ โดย pod ไหนมี priority มากกว่าจะสามารถแซงคิว pod ที่มี priority น้อยกว่า ในการ scheduling ได้

ในการ schedule pod ไป run นั้น Kube-Scheduler จะทำ 2 steps คือ

Filtering: ทำการ filter ให้เหลือแค่ node คุณสมบัติที่เหมาะสมกับ pod
Scoring: ทำให้คิดคะแนน (ranking) ด้วย algorithm ต่างๆ เพือหา node ที่เหมาะสมที่สุดที่จะนำ pod ไป run

ซึ่ง Schedule Policies ที่เลือกใช้ได้มี 2 แบบ

Predicates และ Priorities: predicates เป็นการ filtering โดยพิจารณาสถานะและพฤติกรรม ส่วน priority เป็นการใช้ function มาคำนวณคะแนน เพื่อหา node ที่เหมาะสมที่สุด
Extension points และ Plugins: เป็นการใช้ profile ที่เกิดจากการนำ state หรือเป็นมาการเรียนว่า "Extension points" มารวมกัน เช่น QueueSort, Filter, Score, Bind, Reserve, Permit, เป็นต้นมารวมกัน เพื่อหา node ที่เหมาะสมที่สุด

เราสามารถชี้นำการตัดสินใจของ Kube-Scheduler ได้ด้วยการระบุคุณสมบัติพิเศษ (Labels) หรือข้ออจำกัด (taints) ให้กับ nodes หรือ pods ที่มีอยู่ก่อน และ ระบุความสามารถที่สอดคล้องกันให้กับ pods ที่เราจะ schedule โดยตัวอย่างการชี้น ของ default scheduler มีดังนี้

affinity: ระบุว่า pods นี้ต้อง run ใน node ที pods ที่มี labels นี้ run อยู่ก่อนแล้ว
antiaffinity: ระบุว่า pods นี้ run ใน node ทีไม่มี pods ที่มี labels นี้ run อยู่
taint: เป็นข้อจำกัดของ nodes ที่เราระบุเข้าไป
toleration: เป็น property ของ pod ที่ทนต่อ taints ที่ node มีได้ (ทนได้ ก็คือ schedule pods ไป run ได้)

นอกจากนี้เรายังสามารถเขียน program ของเราเองเพื่อมา run เป็น custom scheduler ใน kubernetes cluster ก็

รายละเอียดเพิ่มเติมดูได้ในหัวข้อด้านล่างเลย 😁

Predicates and Priorities

Predicates (Deplicated in v1.18)

เป็นชุดของ Filter ที่ใช้ในการตรวจสอบความพร้อมของ nodes โดยถ้า nodes ไหนมีคุณสมบัติไม่ผ่านก็จะถูก filter ทิ้งไป โดย list ของ predicates มีดังนี้
- CheckNodeConditionPred
- CheckNodeUnschedulablePred
- GeneralPred
- HostNamePred
- PodFitsHostPortsPred
- MatchNodeSelectorPred
- PodFitsResourcesPred
- NoDiskConflictPred
- PodToleratesNodeTaintsPred
- PodToleratesNodeNoExecuteTaintsPred
- CheckNodeLabelPresencePred
- CheckServiceAffinityPred
- MaxEBSVolumeCountPred
- MaxGCEPDVolumeCountPred
- MaxCSIVolumeCountPred
- MaxAzureDiskVolumeCountPred
- MaxCinderVolumeCountPred
- CheckVolumeBindingPred
- NoVolumeZoneConflictPred
- CheckNodeMemoryPressurePred
- CheckNodePIDPressurePred
- CheckNodeDiskPressurePred
- EvenPodsSpreadPred
- MatchInterPodAffinityPred
เพิ่มเติม
Priorities (Deplicated in v1.18)

เป็นการจัดลำดับความสำคัญของ resources ถ้า Pod และ Node Affinity ไม่ได้ configure ไว้ว่าเป็น SelectorSpreadPriority (จัดลำดับตามจำนวน pods ที่ run อยู่ใน node) Kube-Scheduler จะเลือก node ที่มีจำนวน pod ถูก schedule ไปน้อยที่สุด list ของ priorities มีดังนี้
- EqualPriority
- MostRequestedPriority
- RequestedToCapacityRatioPriority
- SelectorSpreadPriority
- ServiceSpreadingPriority
- InterPodAffinityPriority
- LeastRequestedPriority
- BalancedResourceAllocation
- NodePreferAvoidPodsPriority
- NodeAffinityPriority
- TaintTolerationPriority
- ImageLocalityPriority
- ResourceLimitsPriority
- EvenPodsSpreadPriority
เพิ่มเติม

Scheduling Policies

เราสามารถ custom Predicates และ Priorities เพื่อสร้างเป็น custom scheduler ได้ โดยสร้าง file ดังนี้

{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
    {"name" : "PodFitsHostPorts"},
    {"name" : "PodFitsResources"},
    {"name" : "NoDiskConflict"},
    {"name" : "NoVolumeZoneConflict"},
    {"name" : "MatchNodeSelector"},
    {"name" : "HostName"}
    ],
"priorities" : [
    {"name" : "LeastRequestedPriority", "weight" : 1},
    {"name" : "BalancedResourceAllocation", "weight" : 1},
    {"name" : "ServiceSpreadingPriority", "weight" : 1},
    {"name" : "EqualPriority", "weight" : 1}
    ],
"hardPodAffinitySymmetricWeight" : 10,
"alwaysCheckAllPredicates" : false
}

และ run scheduler ขึ้นมาอีกตัวโดยแก้ option --policy-config-file=<file>, --use-legacy-policy-config=true และ --scheduler-name=<name> จากนั้น run pod โดย ระบุ scheduler name ดังนี้

apiVersion: v1
kind: Pod
metadata:
name: annotation-second-scheduler
labels:
    name: multischeduler-example
spec:
schedulerName: <name>
containers:
- name: pod-with-second-annotation-container
    image: k8s.gcr.io/pause:2.0

Extension Points and Plugins

เป็น feature ใหม่ใน v1.18 เลย แต่ยังเป็น beta อยู่

เป็น scheduling profile ที่เกิดจากการรวม Extension points เข้าด้วยกัน เราสามารถระบุ profile ด้วยการ เพิ่ม option ให้ kube-scheduler ดังนี้ kube-scheduler --config <filename>

สามารถระบุหลาย profile และให้ pods เรียกใช้โดยการบ้างอิง scheduler name ใน pod specification

Extension points
- QueueSort: เป็นการ sort pods ที่อยู่ใน scheduling queue
- PreFilter: เป็นการ pre-process หรือ check Pod และ cluster ก่อนทำการ Filter
- Filtering: เหมือน predicates เป็นการ filter เอา nodes ที่มีคุณสมบัติไม่ผ่านออกไป
- PreScore: เป็นการทำ pre-scoring
- Score: ให้คะแนนแต่ละ node และ เลือก node ที่มีคะแนนมากที่สุด
- Reserve: เป็น stage ที่ notify plugin ว่า resource กำลังถูก reserve ให้กับ pod
- Permit: สามารถ ป้องกัน (prevent) หรือ น่วงเวลา (delay) การ bind pod ไปยัง node
- PreBind: ทำงานต่างๆ ก่อนที่จะ bind pod
- Bind: ทำการ bind pod ไปยัง node ที่ reserve resource ไว้
- PostBind: เป็น stage หลังจาก bind pod ไปยัง node เรียบร้อย
- UnReserve: เป็น stage ที่ Pod ถูก reject หลังจากทำการ reserve resource ไม่สำเร็จ โดย pod จะกลับไปอยู่ stage Permit อีกครั้ง
Scheduling plugins
- Enable โดย default
  - DefaultTopologySpread (PreScore, Score): กระจาย pods ไปยัง node ตาม Services, ReplicaSets และ StatefulSets ของมัน
  - ImageLocality (Score): กระจาย posd ไม่ยัง node ที่มี image ของ pod นั้นอยู่แล้ว
  - TaintToleration (Filter, Prescore, Score): กระจายตามหลัก Taint and Toleration
  - NodeName (Filter): แจก Pod ไปยัง Node ทีมี name node ตรงกับที่ระบุไว้ใน pod specification
  - NodePorts (PreFilter, Filter): แจก Pod ไปยัง Node ที่มี port ที่ต้องการใช้ว่างอยู้
  - NodePreferAvoidPods (Score): ให้คะแนน node ตาม annotation scheduler.alpha.kubernetes.io/preferAvoidPods
  - NodeAffinity (Filter, Score): แจก pod ตาม nodeSelector and nodeAffinity
  - PodTopologySpread (Filter, PreScore, Score): แจก pod ตาม Pod topology
  - NodeUnschedulable (Filter): Filter node ที่มี .spec.unschedulable=true ออกไป
  - NodeResourcesFit (PreFilter, Filter): Filter node ที่มี resource ไม่พอตามที่ pod request ออกไป
  - NodeResourcesBallancedAllocation (Score): เลือก node ที่ถ้า schedule pod ไป จะทำให้ resource ในแต่ละ node ใน cluster balance ขึ้น
  - NodeResourcesLeastAllocated (Score): เลือก node ที่มี resource เหลือเยอะที่สุด
  - VolumeBinding (Filter): เลือก node ที่มี volume ที่ pod ต้องการ bind อยู่
  - VolumeRestrictions (Filter): เลือก node ที่มี volume ที่ pod ต้องการ bind อยู่ และ pod มีสิทธิ์ใช้
  - VolumeZone (Filter): เลือก node ที่มี volume ที่อยู่ใน zone ที่ pod ต้องการ
  - NodeVolumeLimits (Filter): เลือก node ที่ satisfy CSI volume limits
  - EBSLimits (Filter): เลือก node ที่ satisfy AWS EBS volume limits
  - GCEPDLimits (Filter): เลือก node ที่ satisfy GCP-PD volume limits
  - AzureDiskLimits (Filter): เลือก node ที่ satisfy Azure disk volume limits
  - InterPodAffinity (PreFilter, Filter, PreScore, Score): แจก pod ตาม inter-Pod affinity and anti-affinity
  - PrioritySort (QueueSort): เลือก node ตาม priority based sorting แบบ default
  - DefaultBinder (Bind): เลือก node ตาม binding mechanism แบบ default
- ต้อง manual enable เองผ่าน API
  - NodeResourcesMostAllocated (Score): เลือก node ที่มี resource เหลือน้อยที่สุด
  - RequestedToCapacityRatio (Score): เลือก node ตาม function ของ resource ที่ allocate
  - NodeResourceLimits (PreScore, Score): เลือก node ที่ มี resource พอ ตามที่ Pod request
  - CinderVolume (Filter): เลือก node ที่ satisfy OpenStack Cinder volume limits
  - NodeLabel (Filter, Score): filter และ ให้คะแนน node ตาม labels ที่ configure ไว้
  - ServiceAffinity (PreFilter, Filter, Score): กระจาย Pods ที่อยู่ใน service เดียวกันไปยัง แต่ละ node ไม่ซ้ำกัน

Pod Specification

ในการ schedule pod ไป run ใน node ที่เหมาะสมนอกจากความพร้อมของ node แล้ว อีกปัจจัยที่สำคัญคือ Pod Specification โดย parameter ที่มีผลต่อการ schedule เช่น

nodeName: ใช้ระบุชื่อ node ที่ต้องการให้ scheduler นำ Pod ไป run
nodeSelector: ใช้ระบุ label ของ nodes ที่เหมาะสมกับ pod ตัวนั้นๆ
affinity: มีทั้ง affinity ซึ่งบอกว่า pod นี้ต้องอยู่ร่วมกับ pod ใด และ anti-affinity คือไม่อยู่ร่วมกับ pod ใด
schedulerName: ระบุชื่อของ scheduler ที่ใช้ในการ schedule pod ถ้าไม่ระบุใช้ default scheduler
tolerations: บอกว่า Pod นี้ทนต่อข้อจำกัด (taints) อะไรของ node ได้บ้าง

Specifiying the nodeName

เป็นการระบุที่ง่ายที่สุด ที่ไม่แนะนำ ซึ่งมาด้วยข้อจำกัดหลายอย่าง เช่น

ถ้า node name ที่เราระบุไปไม่มีอยู่จริง pod ก็จะไม่ถูก run
ถ้า node name ที่เราระบุไว้มี resource ไม่เพียงพอ pod ก็จะไม่ถูก run เช่นกัน
ใน cloud environment เราไม่สามารถคาดเดา node name ได้ ทำให้เราต้องแก้ไข configure ของเราทุกครั้งที่ย้าย cluster หรือ node ที่เราระบุตายไป

ตัวอย่างของ pod specification ที่ใช้ nodeName

$ kubectl get nodes
NAME                  STATUS   ROLES    AGE   VERSION
kube-0001.novalocal   Ready    master   88d   v1.17.1
kube-0002.novalocal   Ready    <none>   88d   v1.17.1
kube-0003.novalocal   Ready    <none>   88d   v1.17.1
$ cat > nodename.yaml << EOF
apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: nginx
  nodeName: kube-0003.novalocal
EOF
$ kubectl get pods -o wide 
NAME    READY   STATUS    RESTARTS   AGE   IP              NODE                  NOMINATED NODE   READINESS GATES
nginx   1/1     Running   0          3m    192.168.32.60   kube-0003.novalocal   <none>           <none>

Specifiying the nodeSelector

เป็นการระบุที่ง่ายที่สุด ที่แนะนำ เนื่องจากเราสามารถสร้าง labels เดียวกันที่หลาย node ทำให้ถ้า node กใด node หนึ่งตายไป หรือไม่พร้อมด้วยสาเหตุใดก็ตาม ก็ยังมี node อื่นให้ schedule ไป

โดยเราต้องทำการ label node ก่อน ดังนี้

$ kubectl label nodes kube-0003.novalocal disktype=ssd

จากนั้น create pod ด้วย specification ดังนี้

$ kubectl get nodes --show-labels
NAME                  STATUS   ROLES    AGE   VERSION   LABELS
kube-0001.novalocal   Ready    master   88d   v1.17.1   (...),node-role.kubernetes.io/master=
kube-0002.novalocal   Ready    <none>   88d   v1.17.1   (...)
kube-0003.novalocal   Ready    <none>   88d   v1.17.1   (...)
$ kubectl label nodes kube-0003.novalocal disktype=ssd
node/kube-0003.novalocal labeled
$ kubectl get nodes --show-labels
NAME                  STATUS   ROLES    AGE   VERSION   LABELS
kube-0001.novalocal   Ready    master   88d   v1.17.1   (...),node-role.kubernetes.io/master=
kube-0002.novalocal   Ready    <none>   88d   v1.17.1   (...)
kube-0003.novalocal   Ready    <none>   88d   v1.17.1   (...),disktype=ssd,(...)
$ cat > nodeselector.yaml << EOF
apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    env: test
spec:
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  nodeSelector:
    disktype: ssd
EOF
$ kubectl create -f nodeselector
pod/nginx created
$ kubectl get pods -o wide 
NAME    READY   STATUS    RESTARTS   AGE   IP              NODE                  NOMINATED NODE   READINESS GATES
nginx   1/1     Running   0          19s   192.168.32.61   kube-0003.novalocal   <none>           <none>

โดย pod จะถูก schedule ไปยัง node ที่มี ทุก label ที่เรากำหนดไว้ใน pod specification ถ้าหาไม่มี node ไหน ตรงกับ specification pod นั้นจะมีสถานะเป็น Pending

เราสามารถใช้ nodeAffinity แทน nodeSelector ก็ได้

Node ใน Kubernetes cluster โดยปกติจะมี Built-in Labels ที่เรานำมาใช้ได้ โดยที่เราไม่ต้องกำหนดเอง ได้แก่

kubernetes.io/hostname
failure-domain.beta.kubernetes.io/zone
failure-domain.beta.kubernetes.io/region
topology.kubernetes.io/zone
topology.kubernetes.io/region
beta.kubernetes.io/instance-type
node.kubernetes.io/instance-type
kubernetes.io/os
kubernetes.io/arch

Pod Affinity Rules

Pod Affinity เป็นการ schedule pod ตาม pod ที่มีอยู่ก่อนหน้า เนื่องจาก scheduler ต้องเข้าไปตรวจสอบทุก nodes ดังนั้น ใน cluster ใหญ่ๆ ที่มี node หลักร้อย nodes การใช้วิธีนี้จะมีผลต่อ performance ของ cluster

โดย Affinity Rules มี 2 แบบคือ

Affinity คือ pods เหล่านี้ต้องอยู่ใน location เดียวกัน (co-location) มักจะเป็น pod ที่ share data กันเป็นจำนวนมาก
Anti-affinity คือ pods เหล่านี้ต้องอยู่คนละ location กัน เหมาะกับ pods ที่ต้องการ fault tolerance

Operations ที่สามารถใช้ในการระบุ labels เช่น

In: เลือก pod ที่มี labels ที่มีค่าตามที่ระบุไว้
NotIn: เลือก pod ที่ไม่มี labels ที่มีค่าตามที่ระบุไว้
Exists: เลือก pod ที่มี labels ที่ระบุ
DoesNotExist: เลือก pod ที่ไม่มี labels ที่ระบุ

นอกจากนี้ ยังมีการระบุรายละเอียดการ schedule pods ลงไปอีก 2 parameters คือ

requiredDuringSchedulingIgnoredDuringExecution: ต้องมี labels ตาม conditions ที่กำหนดเท่านั้น ถ้าไม่มี ไม่ schedule (hard affinity)
preferredDuringSchedulingIgnoredDuringExecution: ถ้าไม่มี labels ตาม conditions ที่กำหนด ก็ไม่เป็นไร แต่ถ้ามีก็จะ schedule ตาม conditions ที่กำหนดนั้น (soft affinity)

มากไปกว่านั้น เรายังสามารถใช้ร่วมกับ topologyKey ได้อีกด้วย ดังนี้

ถ้าระบุ Affinity พร้อมกับ Anti-affinity ที่เป็น requiredDuringSchedulingIgnoredDuringExecution ต้องระบุ topologyKey
ถ้าระบุ Anti-affinity ที่เป็น requiredDuringSchedulingIgnoredDuringExecution เท่านั้น topologyKey ต้องเป็น kubernetes.io/hostname
ถ้าระบุ Anti-affinity ที่เป็น preferredDuringSchedulingIgnoredDuringExecution เท่านั้น topologyKey ต้องระบุ topologyKey
นอกเหนือจากข้างบน topologyKey เป็นค่าอะไรก็ได้ ใน built-in Labels

ตัวอย่าง

หากเราต้องการ deploy ของ Redis cluster ที่มี 3 replicas ซึ่งแต่ละ replica อยู่คนละ node กัน เพื่อเพิ่ม fault-tolerance deployment จะเป็นดังนี้

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-cache
spec:
  selector:
    matchLabels:
      app: store
  replicas: 3
  template:
    metadata:
      labels:
        app: store
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - store
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: redis-server
        image: redis:3.2-alpine

แต่ละ Pods ใน Deployment มี Labels เป็น app: store และใน podAntiAffinity ว่าไม่ให้อยู่ร่วมกับ pods ที่มี label app: store นั้นหมายความว่า ห้าม pod Redis นี้อยู่ร่วมกัน

# เนื่องจาก มี 3 nodes cluster ดังนั้นต้องทำการ untaint master ก่อน
$ kubectl taint nodes --all node-role.kubernetes.io/master-
node/kube-0001.novalocal untainted
taint "node-role.kubernetes.io/master" not found
taint "node-role.kubernetes.io/master" not found

$ cat > redis.yaml << EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-cache
spec:
  selector:
    matchLabels:
      app: store
  replicas: 3
  template:
    metadata:
      labels:
        app: store
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - store
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: redis-server
        image: redis:3.2-alpine
EOF

$ kubectl create -f redis.yaml
deployment.apps/redis-cache created
$ kubectl get pods -o wide
NAME                           READY   STATUS    RESTARTS   AGE   IP               NODE                  NOMINATED NODE   READINESS GATES
redis-cache-6bc7d5b59d-ngnsf   1/1     Running   0          33s   192.168.93.152   kube-0001.novalocal   <none>           <none>
redis-cache-6bc7d5b59d-swzjm   1/1     Running   0          33s   192.168.32.63    kube-0003.novalocal   <none>           <none>
redis-cache-6bc7d5b59d-vzs7h   1/1     Running   0          33s   192.168.52.66    kube-0002.novalocal   <none>           <none>

หากเราต้องการ deploy frontend ของ web ecommerce ที frontend และ redis ต้องอยู่ node เดียวกัน โดยแต่ละคู่ต้องอยูู่คนละ node กัน เพื่อเพิ่ม fault-tolerance deployment จะเป็นดังนี้

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-server
spec:
  selector:
    matchLabels:
      app: web-store
  replicas: 3
  template:
    metadata:
      labels:
        app: web-store
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - web-store
            topologyKey: "kubernetes.io/hostname"
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - store
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: web-app
        image: nginx:1.16-alpine

เราให้ pod ของ frontend มี label เป็น app: web-store โดย

ระบุ podAntiAffinity ด้วย label app: web-store ซึ่งหมายความว่า ห้าม frontend อยู่ร่วม node กัน
ระบุ podAffinity ด้วย label app: store ซึ่งหมายความว่า frontend ต้องอยู่ร่วมใน node เดียวกับ backend

$ cat > web-store.yaml << EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-server
spec:
  selector:
    matchLabels:
      app: web-store
  replicas: 3
  template:
    metadata:
      labels:
        app: web-store
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - web-store
            topologyKey: "kubernetes.io/hostname"
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - store
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: web-app
        image: nginx:1.16-alpine
EOF

$ kubectl create -f web-store.yaml 
deployment.apps/web-server created
$ kubectl get pods -o wide
NAME                           READY   STATUS    RESTARTS   AGE    IP               NODE                  NOMINATED NODE   READINESS GATES
redis-cache-6bc7d5b59d-brsp5   1/1     Running   0          118s   192.168.52.67    kube-0002.novalocal   <none>           <none>
redis-cache-6bc7d5b59d-crgrb   1/1     Running   0          118s   192.168.93.153   kube-0001.novalocal   <none>           <none>
redis-cache-6bc7d5b59d-kp4kq   1/1     Running   0          118s   192.168.32.1     kube-0003.novalocal   <none>           <none>
web-server-75b987f474-4qggk    1/1     Running   0          10s    192.168.32.2     kube-0003.novalocal   <none>           <none>
web-server-75b987f474-jswff    1/1     Running   0          10s    192.168.52.68    kube-0002.novalocal   <none>           <none>
web-server-75b987f474-kkddh    1/1     Running   0          10s    192.168.93.154   kube-0001.novalocal   <none>           <none>

Node Affinity Rules

เหมือนกับ nodeSelector แต่เพิ่มเติมความสามารถของ affinity เช่น operators (In, NotIn, Exists, DoesNotExist) และ requiredDuringSchedulingIgnoredDuringExecution/preferredDuringSchedulingIgnoredDuringExecution เข้ามา

พร้อมกันนั้นก็มีแผนที่จะทำ requiredDuringSchedulingIgnoredDuringExecution เพิ่มด้วย ซึ่ง ถ้ามีการเปลี่ยน labels ของ nodes จะมีผลกับ pod ที่ run อยู่ด้วย

ตัวอย่าง

apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/e2e-az-name
            operator: In
            values:
            - e2e-az1
            - e2e-az2
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: another-node-label-key
            operator: In
            values:
            - another-node-label-value
  containers:
  - name: with-node-affinity
    image: k8s.gcr.io/pause:2.0

ถ้ามี node ที่มี label another-node-label-key: another-node-label-value pod นี้จะถูก schedule ไปยัง node นั้น แต่ถ้าไม่มีจะไปดู node ที่มี labels kubernetes.io/e2e-az-name: e2e-az1 หรือ kubernetes.io/e2e-az-name: e2e-az2

$ kubectl label nodes kube-0002.novalocal kubernetes.io/e2e-az-name=e2e-az2
node/kube-0002.novalocal labeled
$ kubectl get nodes --show-labels
NAME                  STATUS   ROLES    AGE   VERSION   LABELS
kube-0001.novalocal   Ready    master   88d   v1.17.1   (...)
kube-0002.novalocal   Ready    <none>   88d   v1.17.1   (...),kubernetes.io/e2e-az-name=e2e-az2,(...)
kube-0003.novalocal   Ready    <none>   88d   v1.17.1   (...)
$ cat > node-affinity.yaml << EOF
apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/e2e-az-name
            operator: In
            values:
            - e2e-az1
            - e2e-az2
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: another-node-label-key
            operator: In
            values:
            - another-node-label-value
  containers:
  - name: with-node-affinity
    image: k8s.gcr.io/pause:2.0
EOF
$ kubectl create -f node-affinity.yaml 
pod/with-node-affinity created
$ kubectl get pods -o wide
NAME                 READY   STATUS    RESTARTS   AGE   IP              NODE                  NOMINATED NODE   READINESS GATES
with-node-affinity   1/1     Running   0          23s   192.168.52.69   kube-0002.novalocal   <none>           <none>

Taints

taints คือความสามารถของ node ในการที่จะไม่ให้ pod ที่ไม่เหมาะสม มา run ที่มันได้ โดย pod ที่ไม่มี toleration ต่อ taints ของ node จะไม่สามารถ schedule pod มา run ที่ node นั้นได้

taints จะมีลักษณะแบบนี้ key=value:effect โดย key กับ value เป็นค่าใดๆ ก็ได้ที่กำหนดโดย adminitrator และ effect มีให้เลือก 3 แบบคือ

NoSchedule: ห้าม schedule pod มายัง node โดย pod ที่ run อยู่ก่อน assign taints จะไม่ได้รับผลกระทบ
PreferNoSchedule: ถ้าหา node ที่เหมาะสมไม่ได้แล้วจริงๆ ก็ schedule pod มายัง node นี้ได้
NoExecute: pod ใหม่ก็ห้าม schedule มา pod เก่าก็ต้องอพยบไปที่อื่น โดยถ้า pod มี tolerationSeconds กำหนดอยู่มันจะรอจนกว่าจะครบ tolerationSeconds ถึงจะ อบยพออกไป

ถ้า node มีหลาย taints pods ที่มี tolerations ทุก taints ถึงจะมา run ใน node นั้นๆ ได้

# วิธีการเพิ่ม taints ให้กับ node
$ kubectl taint nodes node1 key=value:NoSchedule

# วิธีการลบ taints ออกจาก node
$ kubectl taint nodes node1 key:NoSchedule-

Tolerations

เป็นการใส่ความสามารถให้ pod ทนต่อ taints ของ node ได้ โดย ต้อง match ทั้ง key, value และ effect

ซึ่ง key และ value เราสามารถกำหนด tolerations ได้ด้วย 2 operators คือ

Exists: ถ้า tolerations ที่เรากำหนด มี key ตรงกับ taints ของ node ก็ถือว่า match แล้ว
```
tolerations:
- key: "key"
  operator: "Exists"
  effect: "NoSchedule"
```
Equal: ต้องตรงทั้ง key และ value ถึงจะถือว่า match (default operator)
```
tolerations:
- key: "key"
  operator: "Equal"
  value: "value"
  effect: "NoSchedule"
```

ถ้าไม่ระบุ effect ถือว่า ทน (toleration) ได้ทุก effect ดังนั้น เราสามารถระบุ toleration level 999 ที่ทนทานได้แทนทุก taint ดังนี้

ระบุแค่ operator: Exists -> ทนได้ทุก taints
```
tolerations:
- operator: "Exists"
```
ระบุแค่ operator: Exists และ key: "key" -> ทนได้ทุก taints ที่มี key ตรงกัน
```
tolerations:
- key: "key"
  operator: "Exists"
```

ถ้า node มีหลาย taints pods ที่จะ schedule ได้ ต้องมี tolerations ครบทุก taints ถ้าไม่ครบ pods จะถูกปฏิบัติด้วย effect ของ taints ที่มีความแรงที่สุดที่ไม่ match

Use cases ของ taints และ toleration เช่น

Dedicated Nodes ที่เราจะ reserve ไว้ให้แค่บางกลุ่มของ users ใช้งาน
Nodes with Special Hardware

Custom Scheduler

ถ้า affinity, taints และ policies ยังไม่พอดับความต้องการของเรา เราสามารถสร้าง scheduler ของเราเองได้ โดยสามารถเข้าไปดูได้ที่ Github

เพื่อใช้ scheduler ใหม่ของเรา เราต้องระบุ scheduler name เข้าไปใน pod specification ด้วย ถ้าไม่ระบุจะใช้ default scheduler แต่ถ้าเราระบุ scheduler ผิด pod ของเราจะอยู่ในสถานะ Pending

เราสามารถดูข้อมูลของ scheduler และ information อื่นๆ ได้ด้วย kubectl get events

$ kubectl get events
LAST SEEN   TYPE     REASON      OBJECT      MESSAGE
41s         Normal   Killing     pod/nginx   Stopping container nginx
2s          Normal   Scheduled   pod/nginx   Successfully assigned default/nginx to kube-0003.novalocal
1s          Normal   Pulled      pod/nginx   Container image "nginx" already present on machine
1s          Normal   Created     pod/nginx   Created container nginx
1s          Normal   Started     pod/nginx   Started container nginx