KOps: InPlacePodVerticalScaling E2e Tests Fail On Containerd V2.1.5

by Admin 68 views
kOps: InPlacePodVerticalScaling e2e Tests Fail on containerd v2.1.5

Hey guys! We've got a situation where the InPlacePodVerticalScaling end-to-end (e2e) tests are failing after upgrading containerd to version 2.1.5 on kOps. Let's dive into the details, figure out what's going on, and see how we can fix it. This article breaks down the issue, the failing tests, and potential reasons for the failure, offering a comprehensive overview for anyone encountering similar problems.

The Problem: Failing e2e Tests

So, what's the main issue? After a recent upgrade of containerd to version 2.1.5 within a kOps environment, the InPlacePodVerticalScaling e2e tests started to fail. This is a pretty big deal because these tests are crucial for ensuring that our pod scaling mechanisms are working correctly. When these tests fail, it indicates a potential problem with how resources are being managed and allocated to pods, which can lead to performance issues or even application downtime. It's like your car's engine light coming on – you know something's not quite right, and you need to investigate before things get worse.

Which Jobs Are Failing?

Specifically, the failing jobs can be found here: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kops/17743/pull-kops-e2e-cni-calico/1986440803127398400. This link takes you to the Prow job execution details, giving you a direct look at the logs and error messages associated with the failing tests. This is super helpful because it allows us to see exactly what went wrong during the test runs, making it easier to pinpoint the root cause of the problem. Think of it as having a detective's magnifying glass to examine the crime scene.

Specific Failing Tests

Digging deeper, let's look at the specific tests that are throwing errors. Here’s a summary of the failures:

Summarizing 41 Failures:
  [FAIL] [sig-node] Pod InPlace Resize Container [FeatureGate:InPlacePodVerticalScaling] [Beta] [It] Burstable QoS pod, mixed containers - scale up cpu and memory [sig-node, FeatureGate:InPlacePodVerticalScaling, Beta]
  k8s.io/kubernetes/test/e2e/common/node/pod_resize.go:1072
  [FAIL] [sig-node] Pod InPlace Resize Container [FeatureGate:InPlacePodVerticalScaling] [Beta] [It] Burstable QoS pod, one container with cpu & memory requests + limits - increase memory requests and limits [sig-node, FeatureGate:InPlacePodVerticalScaling, Beta]
  k8s.io/kubernetes/test/e2e/common/node/pod_resize.go:1072
  [FAIL] [sig-node] Pod InPlace Resize Container [FeatureGate:InPlacePodVerticalScaling] [Beta] [It] Burstable QoS pod, one container with cpu & memory requests + limits - increase memory requests only [sig-node, FeatureGate:InPlacePodVerticalScaling, Beta]
  k8s.io/kubernetes/test/e2e/common/node/pod_resize.go:1072
  [FAIL] [sig-node] Pod InPlace Resize Container [FeatureGate:InPlacePodVerticalScaling] [Beta] [It] Burstable QoS pod, one container with cpu & memory requests + limits - decrease CPU requests and increase CPU limits [sig-node, FeatureGate:InPlacePodVerticalScaling, Beta]
  k8s.io/kubernetes/test/e2e/common/node/pod_resize.go:1072
  [FAIL] [sig-node] Pod InPlace Resize Container [FeatureGate:InPlacePodVerticalScaling] [Beta] [It] Burstable QoS pod, one container with cpu & memory requests + limits - increase memory limits only [sig-node, FeatureGate:InPlacePodVerticalScaling, Beta]
  k8s.io/kubernetes/test/e2e/common/node/pod_resize.go:1072
  [FAIL] [sig-node] Pod InPlace Resize Container [FeatureGate:InPlacePodVerticalScaling] [Beta] [It] Burstable QoS pod, one container - decrease memory request (RestartContainer memory resize policy) [sig-node, FeatureGate:InPlacePodVerticalScaling, Beta]
  k8s.io/kubernetes/test/e2e/common/node/pod_resize.go:1072
  [FAIL] [sig-node] Pod InPlace Resize Container [FeatureGate:InPlacePodVerticalScaling] [Beta] [It] Guaranteed QoS pod, three containers (c1, c2, c3) - increase: CPU (c1,c3), memory (c2, c3) ; decrease: CPU (c2) [sig-node, FeatureGate:InPlacePodVerticalScaling, Beta]
  k8s.io/kubernetes/test/e2e/common/node/pod_resize.go:1072
  [FAIL] [sig-node] Pod InPlace Resize Container [FeatureGate:InPlacePodVerticalScaling] [Beta] [It] Burstable QoS pod, one container - increase memory request (NoRestart memory resize policy) [sig-node, FeatureGate:InPlacePodVerticalScaling, Beta]
  k8s.io/kubernetes/test/e2e/common/node/pod_resize.go:1072
  [FAIL] [sig-node] Pod InPlace Resize Container [FeatureGate:InPlacePodVerticalScaling] [Beta] [It] Burstable QoS pod, three containers - decrease c1 resources, increase c2 resources, no change for c3 (net increase for pod) [sig-node, FeatureGate:InPlacePodVerticalScaling, Beta]
  k8s.io/kubernetes/test/e2e/common/node/pod_resize.go:1072
  [FAIL] [sig-node] Pod InPlace Resize Container [FeatureGate:InPlacePodVerticalScaling] [Beta] [It] Burstable QoS pod, one container with cpu & memory requests + limits - increase CPU requests and limits [sig-node, FeatureGate:InPlacePodVerticalScaling, Beta]
  k8s.io/kubernetes/test/e2e/common/node/pod_resize.go:1072
  [FAIL] [sig-node] Pod InPlace Resize Container [FeatureGate:InPlacePodVerticalScaling] [Beta] [It] Guaranteed QoS pod, one container - increase CPU & memory with an extended resource [sig-node, FeatureGate:InPlacePodVerticalScaling, Beta]
  k8s.io/kubernetes/test/e2e/common/node/pod_resize.go:1072
  ...

These failures cover a wide range of scenarios, including:

  • Burstable QoS pods: These are pods that can exceed their resource limits if the node has available resources. The tests are failing when trying to scale CPU and memory in various ways (increasing, decreasing, or mixing requests and limits).
  • Guaranteed QoS pods: These pods have strict resource guarantees and are failing under scenarios where CPU and memory are increased or decreased.
  • Mixed containers: Tests involving multiple containers within a pod are also failing, indicating a potential issue with how resources are being distributed and managed across containers.

It's like a bunch of dominoes falling – one problem seems to be triggering a cascade of failures across different test cases. This suggests that the underlying issue might be pretty fundamental to how resource resizing is handled.

When Did the Failures Start?

These failures popped up as part of a pull request (PR) that updates containerd to v2.1.5 and its corresponding runc version (https://github.com/kubernetes/kops/pull/17743). This is a crucial clue! It strongly suggests that the upgrade to containerd v2.1.5 (or the associated runc version) is the culprit. When you see failures immediately after a specific change, it's a good bet that the change is the source of the problem. It’s like changing a tire on your car and then hearing a weird noise – you'd immediately suspect the new tire or the way it was installed.

Testgrid Link

For a more detailed view, you can check out the Testgrid link: https://testgrid.k8s.io/kops-presubmits#e2e-gce-cni-calico. Testgrid provides a visual representation of test results over time, making it easier to spot trends and patterns. This link allows you to see the history of these tests and confirm that the failures started occurring after the containerd upgrade.

Potential Reasons for Failure

Okay, so we know the tests are failing after the containerd upgrade. But why? Let's brainstorm some potential reasons.

runc v1.3.3 Changes

One potential cause highlighted in the issue is changes in runc v1.3.3. Runc is a lightweight container runtime that containerd uses to run containers. The release notes mention an improvement in the conversion from cgroup v1 CPU shares to cgroup v2 CPU weight (https://github.com/opencontainers/runc/releases/tag/v1.3.2).

Specifically, these changes are related to:

These changes aim to better align the behavior of CPU shares in cgroup v1 with CPU weights in cgroup v2. Cgroups (Control Groups) are a Linux kernel feature that limits, accounts for, and isolates the resource usage (CPU, memory, disk I/O, network, etc.) of a collection of processes. This is a critical component for containerization, as it ensures that containers don't consume more resources than they're allowed to. If the conversion between cgroup versions isn't accurate, it can lead to resource contention and performance issues.

Why might this be a problem? If the new conversion logic in runc v1.3.3 isn't correctly handling resource allocation in the context of InPlacePodVerticalScaling, it could explain the test failures. For example, if CPU weights are not being set correctly when a pod's resources are resized, it could lead to some containers being starved of CPU time while others get more than their fair share. This imbalance can cause tests to fail, especially those that rely on precise resource allocation.

Broader Impact of containerd v2.1.5

It's also possible that the issue isn't solely due to runc but involves other changes or interactions within containerd v2.1.5. Containerd is a complex system, and upgrades can sometimes introduce subtle changes in behavior that are not immediately obvious. These changes can affect various aspects of container management, including networking, storage, and resource allocation.

For instance, there might be changes in how containerd interacts with the Kubernetes API, how it manages container lifecycles, or how it handles resource requests. Any of these changes could potentially impact the InPlacePodVerticalScaling feature, especially if they introduce inconsistencies or race conditions in resource management.

FeatureGate: InPlacePodVerticalScaling

The tests that are failing are specifically related to the InPlacePodVerticalScaling feature, which is currently in Beta and controlled by a FeatureGate. FeatureGates are a mechanism in Kubernetes to enable or disable certain features. This allows new features to be tested and rolled out gradually without affecting the stability of the entire system. However, because these features are still under development, they can sometimes be more prone to bugs and unexpected behavior.

Given that InPlacePodVerticalScaling is a relatively new feature, it's possible that the upgrade to containerd v2.1.5 has uncovered a bug or incompatibility that wasn't previously apparent. This could be due to changes in how containerd handles resource resizing or how it interacts with the underlying cgroup system.

What Else Do We Need to Know?

An important observation is that this behavior is visible in all e2e tests that didn't fail for other reasons from the PR. This reinforces the idea that the issue is widespread and not specific to a single test case. It's like seeing the same symptom across different patients – it points to a systemic problem rather than an isolated incident.

Additionally, the same (or very similar) tests were run periodically with the previous containerd version, and there were no issues (https://testgrid.k8s.io/kops-network-plugins#kops-aws-cni-calico). This further strengthens the link between the containerd v2.1.5 upgrade and the test failures. It's a classic case of