Add Poison Message Handling to the Dispatchers#1331
Conversation
| var failureAction = new OrchestrationCompleteOrchestratorAction | ||
| { | ||
| Id = runtimeState.PastEvents.Count, | ||
| FailureDetails = new FailureDetails( |
There was a problem hiding this comment.
Given that we perform this logic before any other orchestration processing, we will potentially override the failure of an already-failing-orchestration with this failure. Are we okay with this? The problem of course is that if we attempt to process the poison messages that correspond to the already-failing-orchestration, we could run into issues or exceptions (they are "poisoned" for a reason).
There was a problem hiding this comment.
What's an example of an "already-failing orchestration"?
There was a problem hiding this comment.
Pull request overview
This PR introduces “poison message” handling across the orchestration, entity, and activity dispatchers by tracking per-event dispatch attempts and failing/dropping work once a configured maximum dispatch count is exceeded.
Changes:
- Adds
DispatchCounttoHistoryEvent(and propagates it into entity request messages) and addsMaxDispatchCounttoIOrchestrationService. - Adds poison detection logic in
TaskOrchestrationDispatcher,TaskEntityDispatcher, andTaskActivityDispatcherto fail/drop over-dispatched messages. - Adds structured logging support for poison message detection.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| src/DurableTask.Core/TaskOrchestrationDispatcher.cs | Detects over-dispatched orchestration events and fails the orchestration with non-retriable FailureDetails. |
| src/DurableTask.Core/TaskEntityDispatcher.cs | Propagates dispatch counts into entity requests, filters/handles poison operations, and emits poison logs/failures. |
| src/DurableTask.Core/TaskActivityDispatcher.cs | Detects poison activity events and either discards or fails activities based on dispatch count. |
| src/DurableTask.Core/Logging/LogHelper.cs | Adds PoisonMessageDetected structured logging helpers. |
| src/DurableTask.Core/Logging/LogEvents.cs | Adds a new structured log event type for poison message detection. |
| src/DurableTask.Core/Logging/EventIds.cs | Introduces a new event id for poison detection logs. |
| src/DurableTask.Core/IOrchestrationService.cs | Adds MaxDispatchCount configuration knob for providers. |
| src/DurableTask.Core/History/HistoryEvent.cs | Adds DispatchCount to all history events for serialization/transport. |
| src/DurableTask.Core/Entities/OrchestrationEntityContext.cs | Adds AbandonAcquire() to reset critical section lock acquisition state. |
| src/DurableTask.Core/Entities/EventFormat/RequestMessage.cs | Adds DispatchCount field to entity request messages. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Currently the poison-message-processing occurs after the call to ReconcileMessagesWithState. I wasn't sure if we wanted to alter the behavior of this method.
As far as I understand, currently what is done is if we get a work item with an "invalid state", we just skip processing it entirely and call CompleteTaskOrchestrationWorkItemAsync on it immediately, which will ultimately just delete the messages for this invalid orchestration from the queue without processing them in any way. There is currently a TODO to mark the orchestration as "failed" in this case, but this seems non-trivial - just as one example, if we somehow don't have an ExecutionStartedEvent in the orchestration's history then I am not sure we want to commit an ExecutionCompletedEvent. This edge case might be complicated to handle depending on each backend's implementation.
Additionally, this would itself be an edge case in terms of poison message handling. If we have a corrupted orchestration we probably don't want to continuously attempt dequeuing the messages until we hit the max dispatch count since presumably nothing will change across these execution attempts. We would have to add logic to instead, on the first dispatch, immediately mark these messages as "poisoned", add them to the poison storage, and delete them from the "normal" storage. (This isn't hard, just thought it worth mentioning).
So, all that to say - do we want to change the behavior of this method or no?
There was a problem hiding this comment.
If we don't have an ExecutionStartedEvent in the history than I'm not sure we have enough information to reliably mark the orchestration as failed. The orchestration data might have been purged. Dropping the message is probably the best option since this is almost certainly a non-recoverable problem.
Are there any other cases you're concerned about? I'm not sure how to answer the question about whether we want to change the behavior. I think the simple answer is that it's fine to change the behavior - just be thoughtful about behavior changes. One example change might be to have it return a WorkItemDecision enum instead of a bool, with values such as .Process, .Drop, or .Poison if you want to add poison message handling logic.
| $"exceeds the maximum dispatch count of {this.maxDispatchCount}. Stopping processing of remaining requests."); | ||
| break; | ||
| } | ||
| throw new EntitySchedulerException("Failed to deserialize incoming request message - may be corrupted or wrong version.", exception); |
There was a problem hiding this comment.
What do we want to do if we encounter this deserialization error for the ExecutionStartedEvent? Currently I just left it as throwing an exception rather than trying to add poison message handling. The problem is if we can't deserialize this initial event, we can't process any operations afterwards since it contains the entity state. Maybe a reasonable choice would be if we encounter a deserialization exception for the ExecutionStartedEvent, we don't attempt to process any other messages, we move the event to poison storage, and delete the entire message batch from the backend storage. (If we don't want to delete the entire message we will need to add special logic to handle this specific case in the backend implementations).
There was a problem hiding this comment.
Taking a step back, I worry that having case-by-case checks for DispatchCount scattered throughout the dispatcher code will lead to a whack-a-mole situation that's hard for customers and for us to reason about. My expectation is that we'd check for DispatchCount in a single place and apply a simple policy. Would it be helpful to pause work on this PR to have a higher-level design discussion about poison message handling?
| } | ||
| catch (Exception exception) | ||
| { | ||
| if (eventRaisedEvent.DispatchCount > this.maxDispatchCount) |
There was a problem hiding this comment.
Does this check make sense? Presumably we will throw this exception on all other execution attempts as well. Maybe we should just mark this message as "poisoned" immediately in this case regardless of the dispatch count.
There was a problem hiding this comment.
It helps to think through the situations where this may happen. I don't have a lot of context here, but the existing exception message gives us a hint that it could be caused by corruption (which is fatal) or it could be caused by a "wrong version", which I assume means that some upgrade caused an issue. In the latter case, this could be non-fatal if the app owner rolls back the change. If we believe there are non-fatal cases, then app owners could potentially benefit from retries because they can quickly roll back (before the retry policy expires) and have the system recover automatically.
| this.logHelper.TaskActivityDispatcherError( | ||
| workItem, | ||
| $"The activity worker received a message that does not have any OrchestrationInstance information."); | ||
| if (taskMessage.Event.DispatchCount > this.maxDispatchCount) |
There was a problem hiding this comment.
Does this dispatch count check make sense in this and the following cases? Presumably nothing will change across execution attempts. Should we immediately mark the message as "poisoned"?
And is my choice to call CompleteTaskActivityWorkItemAsync(workItem, responseMessage: null); reasonable, and just special-casing the null response message in the backends?
There was a problem hiding this comment.
This looks like an error case where we've received an invalid message. Rather than throwing an InvalidOperationException, I think we should complete with a TaskFailed response message. Then we don't have to think about poison messages here because we're treating this case as non-retriable.
There was a problem hiding this comment.
I'll comment on the other cases individually.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 10 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This PR adds poison message handling to the activity, entity, and orchestration dispatchers. There are a lot of open questions that I need input on which I included as comments below.