Skip to content

Add Poison Message Handling to the Dispatchers#1331

Open
sophiatev wants to merge 6 commits intomainfrom
stevosyan/add-poison-message-handling
Open

Add Poison Message Handling to the Dispatchers#1331
sophiatev wants to merge 6 commits intomainfrom
stevosyan/add-poison-message-handling

Conversation

@sophiatev
Copy link
Copy Markdown
Contributor

This PR adds poison message handling to the activity, entity, and orchestration dispatchers. There are a lot of open questions that I need input on which I included as comments below.

Copilot AI review requested due to automatic review settings March 24, 2026 19:21
var failureAction = new OrchestrationCompleteOrchestratorAction
{
Id = runtimeState.PastEvents.Count,
FailureDetails = new FailureDetails(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we perform this logic before any other orchestration processing, we will potentially override the failure of an already-failing-orchestration with this failure. Are we okay with this? The problem of course is that if we attempt to process the poison messages that correspond to the already-failing-orchestration, we could run into issues or exceptions (they are "poisoned" for a reason).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's an example of an "already-failing orchestration"?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces “poison message” handling across the orchestration, entity, and activity dispatchers by tracking per-event dispatch attempts and failing/dropping work once a configured maximum dispatch count is exceeded.

Changes:

  • Adds DispatchCount to HistoryEvent (and propagates it into entity request messages) and adds MaxDispatchCount to IOrchestrationService.
  • Adds poison detection logic in TaskOrchestrationDispatcher, TaskEntityDispatcher, and TaskActivityDispatcher to fail/drop over-dispatched messages.
  • Adds structured logging support for poison message detection.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
src/DurableTask.Core/TaskOrchestrationDispatcher.cs Detects over-dispatched orchestration events and fails the orchestration with non-retriable FailureDetails.
src/DurableTask.Core/TaskEntityDispatcher.cs Propagates dispatch counts into entity requests, filters/handles poison operations, and emits poison logs/failures.
src/DurableTask.Core/TaskActivityDispatcher.cs Detects poison activity events and either discards or fails activities based on dispatch count.
src/DurableTask.Core/Logging/LogHelper.cs Adds PoisonMessageDetected structured logging helpers.
src/DurableTask.Core/Logging/LogEvents.cs Adds a new structured log event type for poison message detection.
src/DurableTask.Core/Logging/EventIds.cs Introduces a new event id for poison detection logs.
src/DurableTask.Core/IOrchestrationService.cs Adds MaxDispatchCount configuration knob for providers.
src/DurableTask.Core/History/HistoryEvent.cs Adds DispatchCount to all history events for serialization/transport.
src/DurableTask.Core/Entities/OrchestrationEntityContext.cs Adds AbandonAcquire() to reset critical section lock acquisition state.
src/DurableTask.Core/Entities/EventFormat/RequestMessage.cs Adds DispatchCount field to entity request messages.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the poison-message-processing occurs after the call to ReconcileMessagesWithState. I wasn't sure if we wanted to alter the behavior of this method.

As far as I understand, currently what is done is if we get a work item with an "invalid state", we just skip processing it entirely and call CompleteTaskOrchestrationWorkItemAsync on it immediately, which will ultimately just delete the messages for this invalid orchestration from the queue without processing them in any way. There is currently a TODO to mark the orchestration as "failed" in this case, but this seems non-trivial - just as one example, if we somehow don't have an ExecutionStartedEvent in the orchestration's history then I am not sure we want to commit an ExecutionCompletedEvent. This edge case might be complicated to handle depending on each backend's implementation.

Additionally, this would itself be an edge case in terms of poison message handling. If we have a corrupted orchestration we probably don't want to continuously attempt dequeuing the messages until we hit the max dispatch count since presumably nothing will change across these execution attempts. We would have to add logic to instead, on the first dispatch, immediately mark these messages as "poisoned", add them to the poison storage, and delete them from the "normal" storage. (This isn't hard, just thought it worth mentioning).

So, all that to say - do we want to change the behavior of this method or no?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't have an ExecutionStartedEvent in the history than I'm not sure we have enough information to reliably mark the orchestration as failed. The orchestration data might have been purged. Dropping the message is probably the best option since this is almost certainly a non-recoverable problem.

Are there any other cases you're concerned about? I'm not sure how to answer the question about whether we want to change the behavior. I think the simple answer is that it's fine to change the behavior - just be thoughtful about behavior changes. One example change might be to have it return a WorkItemDecision enum instead of a bool, with values such as .Process, .Drop, or .Poison if you want to add poison message handling logic.

$"exceeds the maximum dispatch count of {this.maxDispatchCount}. Stopping processing of remaining requests.");
break;
}
throw new EntitySchedulerException("Failed to deserialize incoming request message - may be corrupted or wrong version.", exception);
Copy link
Copy Markdown
Contributor Author

@sophiatev sophiatev Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we want to do if we encounter this deserialization error for the ExecutionStartedEvent? Currently I just left it as throwing an exception rather than trying to add poison message handling. The problem is if we can't deserialize this initial event, we can't process any operations afterwards since it contains the entity state. Maybe a reasonable choice would be if we encounter a deserialization exception for the ExecutionStartedEvent, we don't attempt to process any other messages, we move the event to poison storage, and delete the entire message batch from the backend storage. (If we don't want to delete the entire message we will need to add special logic to handle this specific case in the backend implementations).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a step back, I worry that having case-by-case checks for DispatchCount scattered throughout the dispatcher code will lead to a whack-a-mole situation that's hard for customers and for us to reason about. My expectation is that we'd check for DispatchCount in a single place and apply a simple policy. Would it be helpful to pause work on this PR to have a higher-level design discussion about poison message handling?

}
catch (Exception exception)
{
if (eventRaisedEvent.DispatchCount > this.maxDispatchCount)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this check make sense? Presumably we will throw this exception on all other execution attempts as well. Maybe we should just mark this message as "poisoned" immediately in this case regardless of the dispatch count.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It helps to think through the situations where this may happen. I don't have a lot of context here, but the existing exception message gives us a hint that it could be caused by corruption (which is fatal) or it could be caused by a "wrong version", which I assume means that some upgrade caused an issue. In the latter case, this could be non-fatal if the app owner rolls back the change. If we believe there are non-fatal cases, then app owners could potentially benefit from retries because they can quickly roll back (before the retry policy expires) and have the system recover automatically.

this.logHelper.TaskActivityDispatcherError(
workItem,
$"The activity worker received a message that does not have any OrchestrationInstance information.");
if (taskMessage.Event.DispatchCount > this.maxDispatchCount)
Copy link
Copy Markdown
Contributor Author

@sophiatev sophiatev Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this dispatch count check make sense in this and the following cases? Presumably nothing will change across execution attempts. Should we immediately mark the message as "poisoned"?

And is my choice to call CompleteTaskActivityWorkItemAsync(workItem, responseMessage: null); reasonable, and just special-casing the null response message in the backends?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an error case where we've received an invalid message. Rather than throwing an InvalidOperationException, I think we should complete with a TaskFailed response message. Then we don't have to think about poison messages here because we're treating this case as non-retriable.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll comment on the other cases individually.

Copilot AI review requested due to automatic review settings March 24, 2026 20:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings March 25, 2026 19:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants