Skip to content

WIP: Minimal refactoring to have a single shard#2658

Open
shmuelk wants to merge 1 commit intokubernetes-sigs:mainfrom
shmuelk:fc-refactor-1
Open

WIP: Minimal refactoring to have a single shard#2658
shmuelk wants to merge 1 commit intokubernetes-sigs:mainfrom
shmuelk:fc-refactor-1

Conversation

@shmuelk
Copy link
Copy Markdown
Contributor

@shmuelk shmuelk commented Mar 22, 2026

What type of PR is this?
/kind cleanup

What this PR does / why we need it:
The Flow Control component is a critical component in the Endpoint Picker (EPP), enabling it to throttle workloads thus preventing over committing Model Server resources.

Issue #2628 was created to describe a set of simplifications to the Flow Control layer. This PR is the first in a series to implement issue #2628.

In particular this PR changes the Flow Control layer to only have a single shard.

Which issue(s) this PR fixes:
Refs #2628

Does this PR introduce a user-facing change?:

NONE

Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Mar 22, 2026
@netlify
Copy link
Copy Markdown

netlify bot commented Mar 22, 2026

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 73ea409
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/69bfc8cda3bfbf00099688ab
😎 Deploy Preview https://deploy-preview-2658--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 22, 2026
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Mar 22, 2026
@shmuelk shmuelk changed the title WIP: Minimal refacctoring to have a single shard WIP: Minimal refactoring to have a single shard Mar 22, 2026
@Gregory-Pereira
Copy link
Copy Markdown
Member

cc @RishabhSaini

@nirrozenbaum
Copy link
Copy Markdown
Contributor

/cc @LukeAVanDrie

Copy link
Copy Markdown
Contributor

@LukeAVanDrie LukeAVanDrie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shmuelk! This LGTM. I noticed one place we can simplify the new createShard method, else I only have a few nits.

allShards []*registryShard // Cached, sorted combination of Active and Draining shards
nextShardID uint64
mu sync.RWMutex
shard *registryShard
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This is currently read in ActiveShards() and ShardStats() without acquiring fr.mu.RLock(). With dynamic sharding removed, fr.shard is initialized once and never mutated, making these lock-free reads completely safe from data races.

Could we move the shard *registryShard field of the "Administrative state (protected by mu)" block and up to the "Immutable dependencies" block?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do this in the next PR

}

// createShard creates the shard.
func (fr *FlowRegistry) createShard() error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire block of code in here that iterates over fr.flowStates.Range(...) to build allComponents and synchronizeFlow is actually dead code.

Because createShard() is exclusively called during NewFlowRegistry() before the EPP accepts any connections, fr.flowStates is guaranteed to be empty.

You can simplify this initialization method to just:

func (fr *FlowRegistry) createShard() error {
	fr.mu.Lock()
	defer fr.mu.Unlock()
	partitionedConfig := fr.config.partition(0, 1)
	fr.shard = newShard("shard-0", partitionedConfig, fr.logger, fr.propagateStatsDelta)
	return nil
}


// repartitionShardConfigsLocked updates the configuration for all active shards.
// Expects the registry's write lock to be held.
func (fr *FlowRegistry) repartitionShardConfigsLocked() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for other reviewers... This looks weird considering a single-shard view, but we must preserve it for now. When ensurePriorityBand dynamically creates a new band, this partition path acts as a deep-copy mechanism to push the mutated registry config down to the isolated shard state.

In a follow-up PR (if/when we eliminate the boundary between registryShard and FlowRegistry entirely), we can have everything reference a single unified Config, allowing us to drop this path entirely.

defer fr.mu.RUnlock()

components, err := fr.buildFlowComponents(key, len(fr.allShards))
components, err := fr.buildFlowComponents(key, 1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Consider updating buildFlowComponents to drop the numInstances arg and just return a single (flowComponents, error) tuple. Fine if we want to defer this to a different PR though.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do this in the next PR

@LukeAVanDrie
Copy link
Copy Markdown
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 23, 2026
for i, s := range c.registry.activeShards {
shardsCopy[i] = s
}
shardsCopy := make([]contracts.RegistryShard, 1)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to remove the concept of a shard in a later PR?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there will be no "data-parallel" concept anymore. See the attached issue for more context.

For these PRs though, as long as the FC layer has 0 regressions between revisions, I want to get these in even if there are some minor stylistic/semantic improvements that could be made. This code should look significantly different after these are all in, so I think it is most expedient to focus on polish at the end of the refactoring effort rather than at each intermediary step.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's totally fine with me, just making sure we are all pointed in the same direction

@kfswain
Copy link
Copy Markdown
Collaborator

kfswain commented Mar 23, 2026

/approve

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kfswain, LukeAVanDrie, shmuelk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 23, 2026
@shmuelk
Copy link
Copy Markdown
Contributor Author

shmuelk commented Mar 24, 2026

@LukeAVanDrie and @kfswain thank you for the reviews.

I would remove the WIP (i.e. hold) except that in some attempts to compare performance, I see some strange results. I'm trying to get to the bottom of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants