资源分配不均匀问题

简述

资源相关的打分算法 LeastRequestedPriority 和 MostRequestedPriority 都是基于 request 来进行评分,而不是按 Node 当前资源水位进行调度(在没有安装 Prometheus/Metrics 等资源监控相关组件之前,kube-scheduler 也无法实时统计 Node 当前的资源情况)。

简单来说,k8s在进行调度时,计算的就是requests的值,不管你limits设置多少,k8s都不关心。所以当这个值没有达到资源瓶颈时,理论上,该节点就会一直有pod调度上去。

综上所述,在实际场景就可能会遇到以下几种情况

  1. 经常在 K8s 集群种部署负载的时候不设置 CPU requests (这样“看上去”就可以在每个节点上容纳更多 Pod )。在业务比较繁忙的时候,节点的 CPU 全负荷运行。业务延迟明显增加,有时甚至机器会莫名其妙地进入 CPU 软死锁等“假死”状态。
  2. 在 K8s 集群中,集群负载并不是完全均匀地在节点间分配的,通常内存不均匀分配的情况较为突出,集群中某些节点的内存使用率明显高于其他节点。
  3. cpu 负载较低的Node cpu,内存,磁盘都很充足,按理说肯定会通过过滤筛选,而且分值会很高,应该会最先调度,但实际情况却没有。所以这个时候就会出现调度不均衡的问题。
  4. 如果在业务高峰时间遇到上述问题,可能就会导致节点集群节点资源打满,并且出现机器已经 hang 住甚至无法远程 ssh 登陆。那么通常留给集群管理员的只剩下重启集群这一个选项(资源分配不均匀的情况在节点资源各不相同的时候更为明显)。

解决办法?

预留部分系统资源,保证集群稳定性

kubelet 具有以下默认硬驱逐条件:

  • memory.available<100Mi
  • nodefs.available<10%
  • imagefs.available<15%
  • nodefs.inodesFree<5%(Linux 节点)

参考:https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/

  • 可以按照官方文档编辑 Kubelet 配置文件所述,预留一部分系统资源,从而保证当可用计算资源较少时 kubelet 所在节点的稳定性。 这在处理如内存和硬盘之类的不可压缩资源时尤为重要。

  • 当为 kubelet 配置驱逐策略时, 应该确保调度程序不会在 Pod 触发驱逐时对其进行调度,因为这类 Pod 会立即引起内存压力。

    考虑以下场景:

    • 节点内存容量:10Gi
    • 操作员希望为系统守护进程(内核、kubelet 等)保留 10% 的内存容量
    • 操作员希望在节点内存利用率达到 95% 以上时驱逐 Pod,以减少系统 OOM 的概率。

    为此,kubelet 启动设置如下:

    1
    2
    --eviction-hard=memory.available<500Mi
    --system-reserved=memory=1.5Gi

    在此配置中,--system-reserved 标志为系统预留了 1.5Gi 的内存, 即 总内存的 10% + 驱逐条件量

    如果 Pod 使用的内存超过其请求值或者系统使用的内存超过 1Gi, 则节点可以达到驱逐条件,这使得 memory.available 信号低于 500Mi 并触发条件。

    eviction-hard=memory.available ~= 10 * 1024 - 10 * 1024 * 95%

    system-reserved=memory ~= 10 * 1024 - 10 * 1024 * 95% + 10 * 1024 * 0.1

限制/分配服务资源

  • 采取服务资源使用情况

    动态采 Pod 过去一段时间的资源使用率,据此来设置 Pod 的Request,才能契合 kube-scheduler 默认打分算法,让 Pod 的调度更均衡。

  • 为每一个pod设置requests和limits

    • 不同 QoS 的 Pod 具有不同的 OOM 分数,当出现资源不足时,集群会优先 Kill 掉 Best-Effort 类型的 Pod ,其次是 Burstable 类型的 Pod ,最后是Guaranteed 类型的 Pod 。
    • 因此,如果资源充足,可将 QoS pods 类型均设置为 Guaranteed 。用计算资源换业务性能和稳定性,减少排查问题时间和成本。当然如果想更好的提高资源利用率,可以设置核心业务服务为 Guaranteed ,而其他服务根据重要程度可分别设置为 BurstableBest-Effort

为资源占用较高的 Pod 设置反亲和

  • 对一些资源使用率较高的 Pod ,做反亲和配置,防止这些项目同时调度到同一个 Node,导致 Node 负载激增。

实践 - 限制/分配服务资源

分析

对于一些关键的业务容器,通常其流量和负载相比于其他 pod 都是比较高的,对于这类容器的requestslimits需要具体问题具体分析。

分析的维度是多个方面的,例如该业务容器是 CPU 密集型的,还是 IO 密集型的。是单点的还是高可用的,这个服务的上游和下游是谁等等。

另一方面,在生产环境中这类业务容器的负载从一个比较长的时间维度看的话,往往是具有周期性的。

因此,业务容器的历史监控数据可以在参数设置方面提供重要的参考价值。

横向涵盖 CPU ,内存,网络,存储等。一般,requests值可以设定为历史数据的均值,而limits要大于历史数据的均值,当然最终数值还需要结合具体情况做一些小的调整。

估算集群内pod预期资源指标

背景:例如我们的服务基本都是在10-11点/13-15点进入流量高峰期

计划:

  1. 取这个时间段每个服务的资源占用
  2. 计算其最大cpu/memory,以及平均cpu/memory
  3. 指标数量的跨度尽量大些,以下案例为每5分钟一次,时间跨度大于两周以上
  1. 利用crontab,每5分钟1次,取其指标值。将所有文件放在D:\\limit-resource
1
*/5 10-11 * * * kubectl top pods -n {NAMESPACE}|grep -v NAME|sort -nrk 3 > /root/limit-resource/resource-$(date +\%Y\%m\%d\%H\%M\%S).txt
  1. 计算

非核心服务:设置limits设置为历史数据最高点的均值,设置requestslimits的 0.6 - 0.9 倍

核心服务:设置requestslimits均设置为历史数据最高点的均值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
/**
* @author xiaowu
*/
public class ComputeResource {

// xxx-xxx-5d8f74c494-9s2zq 29m 172Mi
static final Pattern p = Pattern.compile("^([a-z-]+-)[0-9a-z-]+\\s+([0-9]+)[a-z]\\s+([0-9]+)");

public static void main(String[] args) {
Map<String, Resource> resourceMap = new HashMap<>();
Map<String, MaxResource> maxResourceMap = new HashMap<>();
// 记录tmp最大资源值
List<MaxResource> maxResources = new ArrayList<>();
try {
//文件路径
File folder = new File("D:\\limit-resource");
File[] files = folder.listFiles();
assert files != null;
for (File file : files) {
if (file.isFile() && file.exists()) {
InputStreamReader read = new InputStreamReader(Files.newInputStream(file.toPath()), Charset.defaultCharset());
BufferedReader bufferedReader = new BufferedReader(read);
String lineTxt = null;
while ((lineTxt = bufferedReader.readLine()) != null) {
// 匹配正则
MaxResource maxResource = beginMatch(resourceMap, maxResourceMap, lineTxt);
if (maxResource == null) continue;
maxResources.add(maxResource);
}
read.close();
getMaxResource(maxResources, resourceMap);
} else {
System.err.println("找不到指定的文件");
}
}
// 计算
computeResource(resourceMap, files.length);
} catch (Exception e) {
System.err.println("读取文件内容出错");
e.printStackTrace();
}
resourceMap.forEach((s, resource) -> {
System.out.println(resource);
});
}

private static void getMaxResource(List<MaxResource> maxResources, Map<String, Resource> resourceMap) {
Map<String, List<MaxResource>> maxResourceMap = maxResources.stream().collect(Collectors.groupingBy(MaxResource::getModule));
// 找到各服务最大cpu和最大内存
for (Map.Entry<String, List<MaxResource>> entry : maxResourceMap.entrySet()) {
String module = entry.getKey();
List<MaxResource> resourceList = entry.getValue();
Integer maxCpu = resourceList.stream().max(Comparator.comparingInt(MaxResource::getMaxCpu)).get().getMaxCpu();
Integer maxMemory = resourceList.stream().max(Comparator.comparingInt(MaxResource::getMaxMemory)).get().getMaxMemory();
Resource resource = resourceMap.get(module);
if (resource.getMaxCpu() == null) {
resource.setMaxCpu(maxCpu);
}else {
resource.setMaxCpu(maxCpu + resource.getMaxCpu());
}
if (resource.getMaxMemory() == null) {
resource.setMaxMemory(maxMemory);
}else {
resource.setMaxMemory(maxMemory + resource.getMaxMemory());
}
}
}

private static void computeResource(Map<String, Resource> map, int filesSize) {
for (Map.Entry<String, Resource> entry : map.entrySet()) {
Resource resource = entry.getValue();
int count = resource.getCount();
resource.setCpu(resource.getCpu() / count);
resource.setMemory(resource.getMemory() / count);

resource.setMaxCpu(resource.getMaxCpu() / filesSize);
resource.setMaxMemory(resource.getMaxMemory() / filesSize);
}
}

private static MaxResource beginMatch(Map<String, Resource> models, Map<String, MaxResource> maxResourceMap, String lineTxt) {
Matcher m = p.matcher(lineTxt);
boolean result = m.find();
if (result) {
Resource resource = new Resource();
MaxResource maxResource = new MaxResource();
// 去掉最后一个横杠
String module = m.group(1).replaceFirst("-$", "");
int cpu = Integer.parseInt(m.group(2));
int memory = Integer.parseInt(m.group(3));
if (models.containsKey(module)) {
// 累加
resource = models.get(module);
maxResource = maxResourceMap.get(module);
resource.setCount(resource.getCount() + 1);
resource.setCpu(resource.getCpu() + cpu);
resource.setMemory(resource.getMemory() + memory);

if (cpu > maxResource.getMaxCpu()) {
maxResource.setMaxCpu(cpu);
}
if (memory > maxResource.getMaxMemory()) {
maxResource.setMaxMemory(memory);
}
} else {
resource.setModule(module);
resource.setCount(1);
resource.setCpu(cpu);
resource.setMemory(memory);

maxResource.setModule(module);
maxResource.setMaxCpu(cpu);
maxResource.setMaxMemory(memory);
}
models.put(module, resource);
maxResourceMap.put(module, maxResource);
return maxResource;
}
return null;
}

@Setter
@Getter
static class Resource {
private String module;

/*副本数*/
private int count;
/*副本数*/

/*均值*/
private Integer cpu;
private Integer memory;
/*均值*/

/*最高点均值*/
private Integer maxCpu;
private Integer maxMemory;
/*最高点均值*/

@Override
public String toString() {
return "服务名称 = " + module +
", 副本数量 = " + count +
", cpu均值 = " + cpu +
"m, memory均值 = " + memory +
"Mi, maxCpu均值 = " + maxCpu +
"m, maxMemory均值 = " + maxMemory +
"Mi";
}
}

@Setter
@Getter
static class MaxResource {
private String module;

/*最高点*/
private Integer maxCpu;
private Integer maxMemory;
/*最高点*/
}
}

问题

在配置完resources后,可能会产生一个新问题:集群预留资源实际使用情况完全不成正比,造成了资源浪费。如下图

出现这个问题的原因有几个:

  1. 有些服务在启动时可能会吃掉400m cpu,但运行起来可能高峰也就200m。简单说就是设置小了跑不起来,设置大了浪费。
  2. 可能有些cpu密集型服务在某个高峰点可以吃掉1000m cpu,但它不会一直消耗这么多(可能一周有这么几回吃到1000m),这就导致计算得到的数值偏大。

新方案 - 延申

在实际项目中,并不是所有情况都能较为准确的估算出 Pod 资源用量,所以依赖 request 配置来保障 Pod 调度的均衡性不是太过准确。

延申1 - 实时资源打分插件 Trimaran

简述

Trimaran 官网地址:https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/trimaran

Trimaran 是一个实时负载感知调度插件,它利用 load-watcher 获取程序资源利用率数据,可以通过 Node 当前实时资源进行打分调度。目前,load-watcher支持三种指标监测工具:Metrics Server、Prometheus 和 SignalFx。

Trimaran - kube-scheduler打分的过程中,Trimaran 会通过 load-watcher 获取当前 node 的实时资源水位,然后据此打分从而干预调度结果。

Trimaran 打分原理:https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/kep/61-Trimaran-real-load-aware-scheduling

准备集群

k8s version:v1.24.16 + k3s1,安装略

scheduler-plugins/trimaran/load-watcher

scheduler-plugins提供了两种使用方式:

  1. 安装scheduler-plugins binary,这里安装方案也提供了两种,参考:https://github.com/kubernetes-sigs/scheduler-plugins/blob/release-1.24/doc/install.md

    • 作为第二套scheduler,与默认的kube-scheduler共存于集群(使用helm chart安装)

      如果网络环境不允许连接外网,可参考如下方式,换第三方源

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      # 替换 k8s.m.daocloud.io 镜像源
      [root@k3s-node1 charts]# crictl pull k8s.m.daocloud.io/scheduler-plugins/kube-scheduler:v0.24.9
      [root@k3s-node1 charts]# crictl pull k8s.m.daocloud.io/scheduler-plugins/controller:v0.24.9
      # retag
      [root@k3s-node1 charts]# ctr images tag k8s.m.daocloud.io/scheduler-plugins/kube-scheduler:v0.24.9 registry.k8s.io/scheduler-plugins/kube-scheduler:v0.24.9
      [root@k3s-node1 charts]# ctr images tag k8s.m.daocloud.io/scheduler-plugins/controller:v0.24.9 registry.k8s.io/scheduler-plugins/controller:v0.24.9
      # 检查
      [root@k3s-node1 charts]# crictl images
      IMAGE TAG IMAGE ID SIZE
      docker.io/rancher/klipper-helm v0.8.0-build20230510 6f42df210d7fa 95MB
      docker.io/rancher/klipper-lb v0.4.4 af74bd845c4a8 4.92MB
      docker.io/rancher/local-path-provisioner v0.0.24 b29384aeb4b13 14.9MB
      docker.io/rancher/mirrored-coredns-coredns 1.10.1 ead0a4a53df89 16.2MB
      docker.io/rancher/mirrored-library-traefik 2.9.10 d1e26b5f8193d 39.6MB
      docker.io/rancher/mirrored-metrics-server v0.6.3 817bbe3f2e517 29.9MB
      docker.io/rancher/mirrored-pause 3.6 6270bb605e12e 301kB
      docker.io/wangxiaowu950330/load-watcher 0.2.3 53b2340fed4ce 21.9MB
      docker.io/wangxiaowu950330/trimaran 1.24 3fc0109c9b9cb 377MB
      k8s.m.daocloud.io/scheduler-plugins/controller v0.24.9 b7e8f1d464e7c 15.7MB
      registry.k8s.io/scheduler-plugins/controller v0.24.9 b7e8f1d464e7c 15.7MB
      k8s.m.daocloud.io/scheduler-plugins/kube-scheduler v0.24.9 b8fa20c9c006d 19MB
      registry.k8s.io/scheduler-plugins/kube-scheduler v0.24.9 b8fa20c9c006d 19MB
      [root@k3s-node1 charts]# kubectl get pods -n scheduler-plugins -w
      NAME READY STATUS RESTARTS AGE
      scheduler-plugins-controller-859bfc6f78-ljkq6 1/1 Running 0 46s
      scheduler-plugins-scheduler-6555ff78d7-p4bz4 1/1 Running 0 38s
    • 替换掉默认的kube-scheduler

  2. 我们也可以选择将load-watcher作为独立服务部署,然后自行构建kube-scheduler/trimaran/load-watcher组件,参考:https://github.com/kubernetes-sigs/scheduler-plugins/blob/release-1.24/pkg/trimaran/README.md。最终架构如图:

这里选择第二种使用方式

构建load-watcher镜像(wangxiaowu950330/load-watcher

参考:https://github.com/paypal/load-watcher

  • 下载源码
1
2
3
git clone https://github.com/paypal/load-watcher.git \
&& cd load-watcher \
&& git checkout 0.2.3
  • 执行
1
2
3
docker build -t load-watcher:<version> .
docker tag load-watcher:<version> <your-docker-repo>:<version>
docker push <your-docker-repo>

构建kube-scheduler镜像(wangxiaowu950330/trimaran

  • 下载源码
1
2
3
git clone https://github.com/kubernetes-sigs/scheduler-plugins \
&& cd scheduler-plugins \
&& git checkout release-1.24
  • Makefile.tm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
COMMONENVVAR=GOOS=$(shell uname -s | tr A-Z a-z) GOARCH=$(subst x86_64,amd64,$(patsubst i%86,386,$(shell uname -m)))
BUILDENVVAR=CGO_ENABLED=0

.PHONY: all
all: build
chmod +x bin/kube-scheduler

.PHONY: build
build:
$(COMMONENVVAR) $(BUILDENVVAR) go build -o bin/kube-scheduler cmd/scheduler/main.go

.PHONY: clean
clean:
rm -rf ./bin
  • Dockerfile
1
2
3
4
5
6
7
FROM golang:1.17.3
WORKDIR /go/src/github.com/kubernetes-sigs/scheduler-plugins
COPY . .
RUN make --file=Makefile.tm build
FROM golang:1.17.3
COPY --from=0 /go/src/github.com/kubernetes-sigs/scheduler-plugins/bin/kube-scheduler /bin/kube-scheduler
CMD ["/bin/kube-scheduler"]
  • 执行
1
2
3
docker build -t trimaran .
docker tag trimaran:latest <your-docker-repo>:latest
docker push <your-docker-repo>

部署

serviceaccount

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# 这里以admin sa进行测试部署
kubectl apply -f - <<EOF
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: admin
annotations:
rbac.authorization.kubernetes.io/autoupdate: "true"
roleRef:
kind: ClusterRole
name: cluster-admin
apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
name: admin
namespace: kube-system
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: admin
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
EOF

trimaran-cm.yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
apiVersion: v1
kind: ConfigMap
metadata:
name: trimaran
namespace: kube-system
data:
# KUBECONFIG文件
k3s.yaml: |-
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJkekNDQVIyZ0F3SUJBZ0lCQURBS0JnZ3Foa2pPUFFRREFqQWpNU0V3SHdZRFZRUUREQmhyTTNNdGMyVnkKZG1WeUxXTmhRREUyT1RFek9EY3lNVGt3SGhjTk1qTXdPREEzTURVME5qVTVXaGNOTXpNd09EQTBNRFUwTmpVNQpXakFqTVNFd0h3WURWUVFEREJock0zTXRjMlZ5ZG1WeUxXTmhRREUyT1RFek9EY3lNVGt3V1RBVEJnY3Foa2pPClBRSUJCZ2dxaGtqT1BRTUJCd05DQUFRMWpCZEtWVXFhTFI0OFdwdGg3RWp6cFBWRHV3ekJMWG9MTTJ6OEpLaFMKUkltbHZjaVZBejJWdjRPR085SlRiQzVtM3l0ZyszUFpKMmRBOGxWNldIQUxvMEl3UURBT0JnTlZIUThCQWY4RQpCQU1DQXFRd0R3WURWUjBUQVFIL0JBVXdBd0VCL3pBZEJnTlZIUTRFRmdRVTVvS0ExZDlVRTllT2UyRE1aTEV0CkpPeHJwcjB3Q2dZSUtvWkl6ajBFQXdJRFNBQXdSUUloQUxRMitvK25BNkFYeG5rQ0ljVjJ4Rk9lY1pSZGpNczgKUitMelIwZ2htTXUzQWlBOWZuZ29TSkkwM2FsaGpOaEo3QmU5dTdZL1FnT0RjU2hpYjZxL1kwNEdiUT09Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K
server: https://10.0.2.15:6443
name: default
contexts:
- context:
cluster: default
user: default
name: default
current-context: default
kind: Config
preferences: {}
users:
- name: default
user:
client-certificate-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJrakNDQVRlZ0F3SUJBZ0lJYURlU3llbWw4ZDR3Q2dZSUtvWkl6ajBFQXdJd0l6RWhNQjhHQTFVRUF3d1kKYXpOekxXTnNhV1Z1ZEMxallVQXhOamt4TXpnM01qRTVNQjRYRFRJek1EZ3dOekExTkRZMU9Wb1hEVEkwTURndwpOakExTkRZMU9Wb3dNREVYTUJVR0ExVUVDaE1PYzNsemRHVnRPbTFoYzNSbGNuTXhGVEFUQmdOVkJBTVRESE41CmMzUmxiVHBoWkcxcGJqQlpNQk1HQnlxR1NNNDlBZ0VHQ0NxR1NNNDlBd0VIQTBJQUJBU0c1THphQy9TRVNvU0gKK3BOWDlaR1lmVE0xV1NHRVh4V3U1S2FrMzBCY2toRFpCenFOZWZJbzNEbXhaNjFTSkxZOGMrQXVSQksxVjM1LwpqUzAzWjltalNEQkdNQTRHQTFVZER3RUIvd1FFQXdJRm9EQVRCZ05WSFNVRUREQUtCZ2dyQmdFRkJRY0RBakFmCkJnTlZIU01FR0RBV2dCU29SNDdySE9CaXZPeTFmczlYL3cxdWs5cGo5akFLQmdncWhrak9QUVFEQWdOSkFEQkcKQWlFQXJLV2VsNERLemZLTTdXQmJMOEJ1V01LUlNRSzd6SUZ0cnAwUmFuSWc5b29DSVFDZ1gxWDR5cGR2bnMyZgpGaWRVa1VpdVBEejhsdEdabUZlUnBkS1QrY3RVaFE9PQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCi0tLS0tQkVHSU4gQ0VSVElGSUNBVEUtLS0tLQpNSUlCZHpDQ0FSMmdBd0lCQWdJQkFEQUtCZ2dxaGtqT1BRUURBakFqTVNFd0h3WURWUVFEREJock0zTXRZMnhwClpXNTBMV05oUURFMk9URXpPRGN5TVRrd0hoY05Nak13T0RBM01EVTBOalU1V2hjTk16TXdPREEwTURVME5qVTUKV2pBak1TRXdId1lEVlFRRERCaHJNM010WTJ4cFpXNTBMV05oUURFMk9URXpPRGN5TVRrd1dUQVRCZ2NxaGtqTwpQUUlCQmdncWhrak9QUU1CQndOQ0FBU0JnalNUWnhOTVJ1VURUc05Qc0VQdnM0K3lNdkVyT24vYnhSTUY4L1V3ClAva3QrMlZpN1Z6WVUzS09FYytSWGVibUJaeTlSdXJxNXdEREJkS3hPM0t2bzBJd1FEQU9CZ05WSFE4QkFmOEUKQkFNQ0FxUXdEd1lEVlIwVEFRSC9CQVV3QXdFQi96QWRCZ05WSFE0RUZnUVVxRWVPNnh6Z1lyenN0WDdQVi84TgpicFBhWS9Zd0NnWUlLb1pJemowRUF3SURTQUF3UlFJZ0JHKzZUbjhBUWxYYUZxRFJOakpCemFUMjNhR2VqVjVNCkNMaFN6cmhSZmZvQ0lRQ0NMT25KeG9FUmpUQmV1clRFMk4zS1FsQ2M1T2xqU1BqOXdHSWNKcHZSaXc9PQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
client-key-data: LS0tLS1CRUdJTiBFQyBQUklWQVRFIEtFWS0tLS0tCk1IY0NBUUVFSUQyL0JCN1krQnR4V3FEeFZTckIvcytjaG9ZWElyazU4U2N0UTVZcDdNZWZvQW9HQ0NxR1NNNDkKQXdFSG9VUURRZ0FFQklia3ZOb0w5SVJLaElmNmsxZjFrWmg5TXpWWklZUmZGYTdrcHFUZlFGeVNFTmtIT28xNQo4aWpjT2JGbnJWSWt0anh6NEM1RUVyVlhmbitOTFRkbjJRPT0KLS0tLS1FTkQgRUMgUFJJVkFURSBLRVktLS0tLQo=
scheduler-config.yaml: |-
apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: false
clientConnection:
kubeconfig: "/etc/rancher/k3s/k3s.yaml"
profiles:
- schedulerName: trimaran
plugins:
score:
disabled:
- name: NodeResourcesBalancedAllocation
- name: NodeResourcesLeastAllocated
enabled:
- name: TargetLoadPacking
- name: LoadVariationRiskBalancing
pluginConfig:
- name: TargetLoadPacking
# README:https://github.com/kubernetes-sigs/scheduler-plugins/blob/release-1.24/pkg/trimaran/targetloadpacking/README.md
args:
# 这将为没有请求或限制的容器配置CPU请求,即QoS:BestEffort 默认值为1个核心
defaultRequests:
cpu: "1000m"
# 这将为没有限制的容器配置乘数,即可突发性QoS。默认值为1.5
defaultRequestsMultiplier: "1.5"
# 希望在装箱时达到的 CPU 利用率 % 目标。建议将此值保持为比想要的值小 10。如果未指定,则默认为 40
targetUtilization: 70
# load-watcher service
watcherAddress: http://load-watcher.kube-system.svc.cluster.local:2020
- name: LoadVariationRiskBalancing
# README:https://github.com/kubernetes-sigs/scheduler-plugins/blob/release-1.24/pkg/trimaran/loadvariationriskbalancing/README.md
args:
# 标准差的乘数(非负浮点)默认1
safeVarianceMargin: 1
# 标准差的根幂(非负浮点)默认1
safeVarianceSensitivity: 1
# load-watcher service
watcherAddress: http://load-watcher.kube-system.svc.cluster.local:2020

load-watcher.yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
apiVersion: apps/v1
kind: Deployment
metadata:
name: load-watcher-deployment
namespace: kube-system
labels:
app: load-watcher
spec:
replicas: 1
selector:
matchLabels:
app: load-watcher
template:
metadata:
labels:
app: load-watcher
spec:
serviceAccountName: admin
containers:
- name: load-watcher
image: wangxiaowu950330/load-watcher:0.2.3
imagePullPolicy: IfNotPresent
env:
- name: KUBE_CONFIG
value: /etc/rancher/k3s/k3s.yaml
ports:
- containerPort: 2020
volumeMounts:
- mountPath: /etc/rancher/k3s/k3s.yaml
name: kube-config
subPath: k3s.yaml
readOnly: true
volumes:
- name: kube-config
configMap:
name: trimaran
defaultMode: 0644
---
apiVersion: v1
kind: Service
metadata:
namespace: kube-system
name: load-watcher
labels:
app: load-watcher
spec:
type: ClusterIP
ports:
- name: http
port: 2020
targetPort: 2020
protocol: TCP
selector:
app: load-watcher

trimaran-scheduler.yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
component: scheduler
tier: control-plane
name: trimaran
namespace: kube-system
spec:
selector:
matchLabels:
component: scheduler
tier: control-plane
replicas: 1
template:
metadata:
labels:
component: scheduler
tier: control-plane
version: second
spec:
serviceAccountName: admin
hostNetwork: false
hostPID: false
containers:
- name: trimaran
command:
- /bin/kube-scheduler
- --leader-elect=false
- --config=/home/scheduler-config.yaml
- -v=9
image: wangxiaowu950330/trimaran:1.24
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
volumeMounts:
- mountPath: /shared
name: shared
- mountPath: /etc/rancher/k3s
name: kube-config
readOnly: true
- mountPath: /home
name: kube-config
readOnly: true
volumes:
- name: shared
hostPath:
path: /tmp
type: Directory
- name: kube-config
configMap:
name: trimaran
items:
- key: k3s.yaml
path: k3s.yaml
- key: scheduler-config.yaml
path: scheduler-config.yaml
defaultMode: 0644

测试

创建 指定 scheduler 为trimaran的pod

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
namespace: test
labels:
app: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
name: nginx
labels:
app: nginx
spec:
# 指定 scheduler 为trimaran
schedulerName: trimaran
containers:
- name: nginx
image: nginx:alpine
imagePullPolicy: IfNotPresent
restartPolicy: Always
EOF

验证

1
2
3
4
- 执行资源的创建/删除

- 观察triraman
kubectl logs trimaran-6485868988-7jrx8 -n kube-system --tail 10 -f

延申2 - 重平衡工具 descheduler

从 kube-scheduler 的角度来看,调度程序会根据其当时对 Kubernetes 集群的资源描述做出最佳调度决定,但调度是静态的,Pod 一旦被绑定了节点是不会触发重新调度的。虽然打分插件可以有效的解决调度时的资源不均衡问题,但每个 Pod 在长期的运行中所占用的资源也是会有变化的(通常内存会增加)。假如一个应用在启动的时候只占 2G 内存,但运行一段时间之后就会占用 4G 内存,如果这样的应用比较多的话,Kubernetes 集群在运行一段时间后就可能会出现不均衡的状态,所以需要重新平衡集群。
除此之外,也还有一些其他的场景需要重平衡:

  • 集群添加新节点,一些节点不足或过度使用;
  • 某些节点发生故障,其pod已移至其他节点;
  • 原始调度决策不再适用,因为在节点中添加或删除了污点或标签,不再满足 pod/node 亲和性要求。

当然我们可以去手动做一些集群的平衡,比如手动去删掉某些 Pod,触发重新调度就可以了,但是显然这是一个繁琐的过程,也不是解决问题的方式。为了解决实际运行中集群资源无法充分利用或浪费的问题,可以使用 descheduler 组件对集群的 Pod 进行调度优化,descheduler 可以根据一些规则和配置策略来帮助我们重新平衡集群状态,其核心原理是根据其策略配置找到可以被移除的 Pod 并驱逐它们,其本身并不会进行调度被驱逐的 Pod,而是依靠默认的调度器来实现,descheduler 重平衡原理可参见官网。

策略介绍

策略描述
RemoveDuplicates将节点上同类型的Pod进行迁移,确保只有一个Pod与同一节点上运行的ReplicaSet、Replication Controller、StatefulSet或者Job关联。
LowNodeUtilization将 requests 比率较高节点上的Pod进行迁移。
HighNodeUtilization将 requests 比率较低节点上的Pod进行迁移。
RemovePodsViolatingInterPodAntiAffinity将不满足反亲和性的Pod进行迁移。
RemovePodsViolatingNodeAffinity将不满足节点节点亲和性策略的Pod进行迁移。
RemovePodsViolatingNodeTaints将不满足节点污点策略的Pod进行迁移。
RemovePodsViolatingTopologySpreadConstraint将不满足拓扑分布约束的Pod进行迁移。
RemovePodsHavingTooManyRestarts将重启次数过多的Pod进行迁移。
PodLifeTime将运行时间较长的Pod进行迁移。
RemoveFailedPods将运行失败的Pod进行迁移。

主要记录下两个较为重要的策略

LowNodeUtilization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# 如果使用率均在targetThresholds和thresholds之间,则被认为已得到适当利用,并且不被考虑驱逐
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"LowNodeUtilization":
enabled: true
params:
nodeResourceUtilizationThresholds:
# 节点利用率是否充足由 thresholds 确定
# 如果节点的所有使用率(CPU、内存、Pod 数量)均低于thresholds,则该节点被视为未充分利用。即触发重平衡调度
# 计算节点资源利用率计算的是Pod的request resource。
# cpu和memory的单位为百分比
thresholds:
"cpu" : 20
"memory": 20
"pods": 20
# targetThresholds用于计算可以驱逐 pod 的潜在节点。
# 如果节点的使用率高于任何targetThresholds(CPU、内存、pod 数量),则该节点被视为过度使用。即触发重平衡调度
# cpu和memory的单位为百分比
targetThresholds:
"cpu" : 50
"memory": 50
"pods": 50

HighNodeUtilization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"HighNodeUtilization":
enabled: true
params:
nodeResourceUtilizationThresholds:
# 节点利用率是否充足由 thresholds 确定
# 如果节点的所有使用率(CPU、内存、Pod 数量)均低于thresholds,则该节点被视为未充分利用。即触发重平衡调度
# 计算节点资源利用率计算的是Pod的request resource。
# cpu和memory的单位为百分比
thresholds:
"cpu" : 20
"memory": 20
"pods": 20

驱逐 pod - 遵循机制

当 Descheduler 调度器决定于驱逐 pod 时,它将遵循下面的机制:

  • Critical pods (with annotations scheduler.alpha.kubernetes.io/critical-pod) are never evicted:关键型 pod(annotations带有scheduler.alpha.kubernetes.io/critical-pod属性的pod)永远不会被驱逐。
  • Pods (static or mirrored pods or stand alone pods) not part of an RC, RS, Deployment or Jobs are never evicted because these pods won’t be recreated: 不属于RC,RS,部署或作业的Pod(静态或镜像pod或独立pod)永远不会被驱逐,因为这些pod不会被重新创建。
  • Pods associated with DaemonSets are never evicted: 与 DaemonSets 关联的 Pod 永远不会被驱逐。
  • Pods with local storage are never evicted: 具有本地存储的 Pod 永远不会被驱逐。
  • BestEffort pods are evicted before Burstable and Guaranteed pods QoS: 等级为 BestEffort 的 pod 将会在等级为 Burstable 和 Guaranteed 的 pod 之前被驱逐。

部署

为了避免被自己驱逐,Descheduler 将会以 关键型 pod 运行,因此它只能被创建建到 kube-system namespace 内。 关于 Critical pod 的介绍参考:Guaranteed Scheduling For Critical Add-On Pods

资源文件

如果镜像源无法访问,可修改为lank8s.cn/descheduler/descheduler:xxx

测试验证

启动之后,可以来验证下descheduler是否启动成功

1
2
# kubectl get pod -n kube-system | grep descheduler
descheduler-job-6qtk2 1/1 Running 0 158m

再来验证下pod是否分布均匀

可以看到,目前node02这个节点的pod数是20个,相比较其他节点,还是差了几个,那么我们只对pod数量做重平衡的话,可以注释掉关于cpu和内存的配置项(默认是)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# cat kubernetes/base/configmap.yaml 
apiVersion: v1
kind: ConfigMap
metadata:
name: descheduler-policy-configmap
namespace: kube-system
data:
policy.yaml: |
......
thresholds:
# "cpu" : 20
# "memory": 20
"pods": 20
# 目标阈值
targetThresholds:
# "cpu" : 50
# "memory": 50
"pods": 50
......

修改完成后,重启下即可

1
2
kubectl apply -f kubernetes/base/configmap.yaml
kubectl rollout restart deployment -n kube-system descheduler

然后,看下Descheduler的调度日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# kubectl logs -n kube-system descheduler-job-9rc9h
I0729 08:48:45.361655 1 lownodeutilization.go:151] Node "k8s-node02" is under utilized with usage: api.ResourceThresholds{"cpu":44.375, "memory":24.682000160690105, "pods":22.727272727272727}
I0729 08:48:45.361772 1 lownodeutilization.go:154] Node "k8s-node03" is over utilized with usage: api.ResourceThresholds{"cpu":49.375, "memory":27.064916842870552, "pods":24.545454545454547}
I0729 08:48:45.361807 1 lownodeutilization.go:151] Node "k8s-master01" is under utilized with usage: api.ResourceThresholds{"cpu":50, "memory":3.6347778465158265, "pods":8.181818181818182}
I0729 08:48:45.361828 1 lownodeutilization.go:151] Node "k8s-master02" is under utilized with usage: api.ResourceThresholds{"cpu":40, "memory":0, "pods":5.454545454545454}
I0729 08:48:45.361863 1 lownodeutilization.go:151] Node "k8s-master03" is under utilized with usage: api.ResourceThresholds{"cpu":40, "memory":0, "pods":5.454545454545454}
I0729 08:48:45.361977 1 lownodeutilization.go:154] Node "k8s-node01" is over utilized with usage: api.ResourceThresholds{"cpu":46.875, "memory":32.25716687667426, "pods":27.272727272727273}
I0729 08:48:45.361994 1 lownodeutilization.go:66] Criteria for a node under utilization: CPU: 0, Mem: 0, Pods: 23
I0729 08:48:45.362016 1 lownodeutilization.go:73] Total number of underutilized nodes: 4
I0729 08:48:45.362025 1 lownodeutilization.go:90] Criteria for a node above target utilization: CPU: 0, Mem: 0, Pods: 23
I0729 08:48:45.362033 1 lownodeutilization.go:92] Total number of nodes above target utilization: 2
I0729 08:48:45.362051 1 lownodeutilization.go:202] Total capacity to be moved: CPU:0, Mem:0, Pods:55.2
I0729 08:48:45.362059 1 lownodeutilization.go:203] ********Number of pods evicted from each node:***********
I0729 08:48:45.362066 1 lownodeutilization.go:210] evicting pods from node "k8s-node01" with usage: api.ResourceThresholds{"cpu":46.875, "memory":32.25716687667426, "pods":27.272727272727273}
I0729 08:48:45.362236 1 lownodeutilization.go:213] allPods:30, nonRemovablePods:3, bestEffortPods:2, burstablePods:25, guaranteedPods:0
I0729 08:48:45.362246 1 lownodeutilization.go:217] All pods have priority associated with them. Evicting pods based on priority
I0729 08:48:45.381931 1 evictions.go:102] Evicted pod: "flink-taskmanager-7c7557d6bc-ntnp2" in namespace "default"
I0729 08:48:45.381967 1 lownodeutilization.go:270] Evicted pod: "flink-taskmanager-7c7557d6bc-ntnp2"
I0729 08:48:45.381980 1 lownodeutilization.go:283] updated node usage: api.ResourceThresholds{"cpu":46.875, "memory":32.25716687667426, "pods":26.363636363636363}
I0729 08:48:45.382268 1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"flink-taskmanager-7c7557d6bc-ntnp2", UID:"6a5374de-a204-4d2c-a302-ff09c054a43b", APIVersion:"v1", ResourceVersion:"4945574", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler
I0729 08:48:45.399567 1 evictions.go:102] Evicted pod: "flink-taskmanager-7c7557d6bc-t2htk" in namespace "default"
I0729 08:48:45.399613 1 lownodeutilization.go:270] Evicted pod: "flink-taskmanager-7c7557d6bc-t2htk"
I0729 08:48:45.399626 1 lownodeutilization.go:283] updated node usage: api.ResourceThresholds{"cpu":46.875, "memory":32.25716687667426, "pods":25.454545454545453}
I0729 08:48:45.400503 1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"flink-taskmanager-7c7557d6bc-t2htk", UID:"bd255dbc-bb05-4258-ac0b-e5be3dc4efe8", APIVersion:"v1", ResourceVersion:"4705479", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler
I0729 08:48:45.450568 1 evictions.go:102] Evicted pod: "oauth-center-tools-api-645d477bcf-hnb8g" in namespace "default"
I0729 08:48:45.450603 1 lownodeutilization.go:270] Evicted pod: "oauth-center-tools-api-645d477bcf-hnb8g"
I0729 08:48:45.450619 1 lownodeutilization.go:283] updated node usage: api.ResourceThresholds{"cpu":45.625, "memory":31.4545002047819, "pods":24.545454545454543}
I0729 08:48:45.451240 1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"oauth-center-tools-api-645d477bcf-hnb8g", UID:"caba0aa8-76de-4e23-b163-c660df0ba54d", APIVersion:"v1", ResourceVersion:"3800151", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler
I0729 08:48:45.477605 1 evictions.go:102] Evicted pod: "dazzle-core-api-5d4c899b84-xhlkl" in namespace "default"
I0729 08:48:45.477636 1 lownodeutilization.go:270] Evicted pod: "dazzle-core-api-5d4c899b84-xhlkl"
I0729 08:48:45.477649 1 lownodeutilization.go:283] updated node usage: api.ResourceThresholds{"cpu":44.375, "memory":30.65183353288954, "pods":23.636363636363633}
I0729 08:48:45.477992 1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"dazzle-core-api-5d4c899b84-xhlkl", UID:"ce216892-6c50-4c31-b30a-cbe5c708285e", APIVersion:"v1", ResourceVersion:"3800074", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler
I0729 08:48:45.523774 1 request.go:557] Throttling request took 141.499557ms, request: POST:https://10.96.0.1:443/api/v1/namespaces/default/events
I0729 08:48:45.569073 1 evictions.go:102] Evicted pod: "live-foreignapi-api-7bc679b789-z8jnr" in namespace "default"
I0729 08:48:45.569105 1 lownodeutilization.go:270] Evicted pod: "live-foreignapi-api-7bc679b789-z8jnr"
I0729 08:48:45.569119 1 lownodeutilization.go:283] updated node usage: api.ResourceThresholds{"cpu":43.125, "memory":29.84916686099718, "pods":22.727272727272723}
I0729 08:48:45.569151 1 lownodeutilization.go:236] 6 pods evicted from node "k8s-node01" with usage map[cpu:43.125 memory:29.84916686099718 pods:22.727272727272723]
I0729 08:48:45.569172 1 lownodeutilization.go:210] evicting pods from node "k8s-node03" with usage: api.ResourceThresholds{"cpu":49.375, "memory":27.064916842870552, "pods":24.545454545454547}
I0729 08:48:45.569418 1 lownodeutilization.go:213] allPods:27, nonRemovablePods:2, bestEffortPods:0, burstablePods:25, guaranteedPods:0
I0729 08:48:45.569430 1 lownodeutilization.go:217] All pods have priority associated with them. Evicting pods based on priority
I0729 08:48:45.603962 1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"live-foreignapi-api-7bc679b789-z8jnr", UID:"37c698e3-b63e-4ef1-917b-ac6bc1be05e0", APIVersion:"v1", ResourceVersion:"3800113", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler
I0729 08:48:45.639483 1 evictions.go:102] Evicted pod: "dazzle-contentlib-api-575f599994-khdn5" in namespace "default"
I0729 08:48:45.639512 1 lownodeutilization.go:270] Evicted pod: "dazzle-contentlib-api-575f599994-khdn5"
I0729 08:48:45.639525 1 lownodeutilization.go:283] updated node usage: api.ResourceThresholds{"cpu":48.125, "memory":26.26225017097819, "pods":23.636363636363637}
I0729 08:48:45.645446 1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"dazzle-contentlib-api-575f599994-khdn5", UID:"068aa2ad-f160-4aaa-b25b-f0a9603f9011", APIVersion:"v1", ResourceVersion:"3674763", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler
I0729 08:48:45.780324 1 evictions.go:102] Evicted pod: "dazzle-datasync-task-577c46668-lltg4" in namespace "default"
I0729 08:48:45.780544 1 lownodeutilization.go:270] Evicted pod: "dazzle-datasync-task-577c46668-lltg4"
I0729 08:48:45.780565 1 lownodeutilization.go:283] updated node usage: api.ResourceThresholds{"cpu":46.875, "memory":25.45958349908583, "pods":22.727272727272727}
I0729 08:48:45.780600 1 lownodeutilization.go:236] 4 pods evicted from node "k8s-node03" with usage map[cpu:46.875 memory:25.45958349908583 pods:22.727272727272727]
I0729 08:48:45.780620 1 lownodeutilization.go:102] Total number of pods evicted: 11

通过这个日志,可以看到Node “k8s-node01” is over utilized,然后就是有提示evicting pods from node “k8s-node01”,这就说明,Descheduler已经在重新调度了,最终调度结果如下:

PDB

由于使用 descheduler 会将 Pod 驱逐进行重调度,但是如果一个服务的所有副本都被驱逐的话,则可能导致该服务不可用。如果服务本身存在单点故障,驱逐的时候肯定就会造成服务不可用了,这种情况我们强烈建议使用反亲和性和多副本来避免单点故障,但是如果服务本身就被打散在多个节点上,这些 Pod 都被驱逐的话,这个时候也会造成服务不可用了,这种情况下我们可以通过配置 PDB(PodDisruptionBudget) 对象来避免所有副本同时被删除,比如我们可以设置在驱逐的时候某应用最多只有一个副本不可用,则创建如下所示的资源清单即可:

1
2
3
4
5
6
7
8
9
10
11
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: pdb-demo
spec:
# 设置最多不可用的副本数量,或者使用 minAvailable,可以使用整数或百分比
maxUnavailable: 1
selector:
# 匹配Pod标签
matchLabels:
app: demo

关于 PDB 的更多详细信息可以查看官方文档:https://kubernetes.io/docs/tasks/run-application/configure-pdb/

所以如果我们使用 descheduler 来重新平衡集群状态,那么强烈建议给应用创建一个对应的 PodDisruptionBudget 对象进行保护。

问题