fix(multi-slb): fix race condition for backendpoolUpdater#10107
fix(multi-slb): fix race condition for backendpoolUpdater#10107hebo4096 wants to merge 1 commit intokubernetes-sigs:masterfrom
Conversation
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: hebo4096 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @hebo4096! |
|
Hi @hebo4096. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
72bc7f9 to
b86b388
Compare
What type of PR is this?
/kind bug
What this PR does / why we need it:
In multi-SLB clusters with
ExternalTrafficPolicy: LocaltheloadBalancerBackendPoolUpdater.process()and the main service reconciliation loop (i.e.,reconcileLoadBalancer→CreateOrUpdateLB) can issue concurrent ARM calls targeting the same load balancer.Because these are parent PUT (LB-level) and child PUT (backend pool-level) operations.
ARM rejects the operation with HTTP 412 (ETag mismatch / PreconditionFailed) when they race.
This PR fixes the race by splitting
process()into two phases:Dequeue — drains the operation queue under the existing queue lock (updater.lock), then releases it.
ARM operations — performs Get/CreateOrUpdate under serviceReconcileLock, serializing with the main reconciliation loop.
This ensures that backend pool PUTs and LB PUTs never overlap, eliminating the ETag conflict. The queue lock is released before acquiring serviceReconcileLock to avoid nested locks and to keep addOperation() non-blocking during ARM calls.
Which issue(s) this PR fixes:
Fixes #9839
Special notes for your reviewer:
The core fix is in
azure_local_services.go:process()is split intodequeue(), and ARM operations underserviceReconcileLock.Two new unit tests verify the lock behavior:
TestLoadBalancerBackendPoolUpdaterSerialization- ARM calls are blocked while serviceReconcileLock is held.TestLoadBalancerBackendPoolUpdaterDequeueUnderQueueLock- queue lock is released before ARM calls, so addOperation() forbackendPoolUpdateris not blocked.Does this PR introduce a user-facing change?
NONE
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: