WIP: Minimal refactoring to have a single shard#2658
WIP: Minimal refactoring to have a single shard#2658shmuelk wants to merge 1 commit intokubernetes-sigs:mainfrom
Conversation
Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
/cc @LukeAVanDrie |
LukeAVanDrie
left a comment
There was a problem hiding this comment.
Thanks @shmuelk! This LGTM. I noticed one place we can simplify the new createShard method, else I only have a few nits.
| allShards []*registryShard // Cached, sorted combination of Active and Draining shards | ||
| nextShardID uint64 | ||
| mu sync.RWMutex | ||
| shard *registryShard |
There was a problem hiding this comment.
nit: This is currently read in ActiveShards() and ShardStats() without acquiring fr.mu.RLock(). With dynamic sharding removed, fr.shard is initialized once and never mutated, making these lock-free reads completely safe from data races.
Could we move the shard *registryShard field of the "Administrative state (protected by mu)" block and up to the "Immutable dependencies" block?
There was a problem hiding this comment.
I will do this in the next PR
| } | ||
|
|
||
| // createShard creates the shard. | ||
| func (fr *FlowRegistry) createShard() error { |
There was a problem hiding this comment.
The entire block of code in here that iterates over fr.flowStates.Range(...) to build allComponents and synchronizeFlow is actually dead code.
Because createShard() is exclusively called during NewFlowRegistry() before the EPP accepts any connections, fr.flowStates is guaranteed to be empty.
You can simplify this initialization method to just:
func (fr *FlowRegistry) createShard() error {
fr.mu.Lock()
defer fr.mu.Unlock()
partitionedConfig := fr.config.partition(0, 1)
fr.shard = newShard("shard-0", partitionedConfig, fr.logger, fr.propagateStatsDelta)
return nil
}
|
|
||
| // repartitionShardConfigsLocked updates the configuration for all active shards. | ||
| // Expects the registry's write lock to be held. | ||
| func (fr *FlowRegistry) repartitionShardConfigsLocked() { |
There was a problem hiding this comment.
Note for other reviewers... This looks weird considering a single-shard view, but we must preserve it for now. When ensurePriorityBand dynamically creates a new band, this partition path acts as a deep-copy mechanism to push the mutated registry config down to the isolated shard state.
In a follow-up PR (if/when we eliminate the boundary between registryShard and FlowRegistry entirely), we can have everything reference a single unified Config, allowing us to drop this path entirely.
| defer fr.mu.RUnlock() | ||
|
|
||
| components, err := fr.buildFlowComponents(key, len(fr.allShards)) | ||
| components, err := fr.buildFlowComponents(key, 1) |
There was a problem hiding this comment.
nit: Consider updating buildFlowComponents to drop the numInstances arg and just return a single (flowComponents, error) tuple. Fine if we want to defer this to a different PR though.
There was a problem hiding this comment.
I will do this in the next PR
|
/lgtm |
| for i, s := range c.registry.activeShards { | ||
| shardsCopy[i] = s | ||
| } | ||
| shardsCopy := make([]contracts.RegistryShard, 1) |
There was a problem hiding this comment.
Are we going to remove the concept of a shard in a later PR?
There was a problem hiding this comment.
Yes, there will be no "data-parallel" concept anymore. See the attached issue for more context.
For these PRs though, as long as the FC layer has 0 regressions between revisions, I want to get these in even if there are some minor stylistic/semantic improvements that could be made. This code should look significantly different after these are all in, so I think it is most expedient to focus on polish at the end of the refactoring effort rather than at each intermediary step.
There was a problem hiding this comment.
That's totally fine with me, just making sure we are all pointed in the same direction
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kfswain, LukeAVanDrie, shmuelk The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@LukeAVanDrie and @kfswain thank you for the reviews. I would remove the WIP (i.e. hold) except that in some attempts to compare performance, I see some strange results. I'm trying to get to the bottom of it. |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
The Flow Control component is a critical component in the Endpoint Picker (EPP), enabling it to throttle workloads thus preventing over committing Model Server resources.
Issue #2628 was created to describe a set of simplifications to the Flow Control layer. This PR is the first in a series to implement issue #2628.
In particular this PR changes the Flow Control layer to only have a single shard.
Which issue(s) this PR fixes:
Refs #2628
Does this PR introduce a user-facing change?: