Add Poison Message Handling to the Dispatchers by sophiatev · Pull Request #1331 · Azure/durabletask

sophiatev · 2026-03-24T19:21:32Z

This PR adds poison message handling to the activity, entity, and orchestration dispatchers. There are a lot of open questions that I need input on which I included as comments below.

src/DurableTask.Core/IOrchestrationService.cs

sophiatev · 2026-03-24T19:26:51Z

src/DurableTask.Core/TaskOrchestrationDispatcher.cs

+                            var failureAction = new OrchestrationCompleteOrchestratorAction
+                            {
+                                Id = runtimeState.PastEvents.Count,
+                                FailureDetails = new FailureDetails(


Given that we perform this logic before any other orchestration processing, we will potentially override the failure of an already-failing-orchestration with this failure. Are we okay with this? The problem of course is that if we attempt to process the poison messages that correspond to the already-failing-orchestration, we could run into issues or exceptions (they are "poisoned" for a reason).

What's an example of an "already-failing orchestration"?

Copilot

Pull request overview

This PR introduces “poison message” handling across the orchestration, entity, and activity dispatchers by tracking per-event dispatch attempts and failing/dropping work once a configured maximum dispatch count is exceeded.

Changes:

Adds DispatchCount to HistoryEvent (and propagates it into entity request messages) and adds MaxDispatchCount to IOrchestrationService.
Adds poison detection logic in TaskOrchestrationDispatcher, TaskEntityDispatcher, and TaskActivityDispatcher to fail/drop over-dispatched messages.
Adds structured logging support for poison message detection.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
src/DurableTask.Core/TaskOrchestrationDispatcher.cs	Detects over-dispatched orchestration events and fails the orchestration with non-retriable `FailureDetails`.
src/DurableTask.Core/TaskEntityDispatcher.cs	Propagates dispatch counts into entity requests, filters/handles poison operations, and emits poison logs/failures.
src/DurableTask.Core/TaskActivityDispatcher.cs	Detects poison activity events and either discards or fails activities based on dispatch count.
src/DurableTask.Core/Logging/LogHelper.cs	Adds `PoisonMessageDetected` structured logging helpers.
src/DurableTask.Core/Logging/LogEvents.cs	Adds a new structured log event type for poison message detection.
src/DurableTask.Core/Logging/EventIds.cs	Introduces a new event id for poison detection logs.
src/DurableTask.Core/IOrchestrationService.cs	Adds `MaxDispatchCount` configuration knob for providers.
src/DurableTask.Core/History/HistoryEvent.cs	Adds `DispatchCount` to all history events for serialization/transport.
src/DurableTask.Core/Entities/OrchestrationEntityContext.cs	Adds `AbandonAcquire()` to reset critical section lock acquisition state.
src/DurableTask.Core/Entities/EventFormat/RequestMessage.cs	Adds `DispatchCount` field to entity request messages.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/DurableTask.Core/TaskOrchestrationDispatcher.cs

src/DurableTask.Core/Logging/LogEvents.cs

src/DurableTask.Core/TaskActivityDispatcher.cs

src/DurableTask.Core/TaskEntityDispatcher.cs

src/DurableTask.Core/Logging/LogHelper.cs

src/DurableTask.Core/TaskActivityDispatcher.cs

src/DurableTask.Core/IOrchestrationService.cs

sophiatev · 2026-03-24T19:38:34Z

src/DurableTask.Core/TaskOrchestrationDispatcher.cs

Currently the poison-message-processing occurs after the call to ReconcileMessagesWithState. I wasn't sure if we wanted to alter the behavior of this method.

As far as I understand, currently what is done is if we get a work item with an "invalid state", we just skip processing it entirely and call CompleteTaskOrchestrationWorkItemAsync on it immediately, which will ultimately just delete the messages for this invalid orchestration from the queue without processing them in any way. There is currently a TODO to mark the orchestration as "failed" in this case, but this seems non-trivial - just as one example, if we somehow don't have an ExecutionStartedEvent in the orchestration's history then I am not sure we want to commit an ExecutionCompletedEvent. This edge case might be complicated to handle depending on each backend's implementation.

Additionally, this would itself be an edge case in terms of poison message handling. If we have a corrupted orchestration we probably don't want to continuously attempt dequeuing the messages until we hit the max dispatch count since presumably nothing will change across these execution attempts. We would have to add logic to instead, on the first dispatch, immediately mark these messages as "poisoned", add them to the poison storage, and delete them from the "normal" storage. (This isn't hard, just thought it worth mentioning).

So, all that to say - do we want to change the behavior of this method or no?

If we don't have an ExecutionStartedEvent in the history than I'm not sure we have enough information to reliably mark the orchestration as failed. The orchestration data might have been purged. Dropping the message is probably the best option since this is almost certainly a non-recoverable problem.

Are there any other cases you're concerned about? I'm not sure how to answer the question about whether we want to change the behavior. I think the simple answer is that it's fine to change the behavior - just be thoughtful about behavior changes. One example change might be to have it return a WorkItemDecision enum instead of a bool, with values such as .Process, .Drop, or .Poison if you want to add poison message handling logic.

sophiatev · 2026-03-24T20:06:17Z

src/DurableTask.Core/TaskEntityDispatcher.cs

+                                        $"exceeds the maximum dispatch count of {this.maxDispatchCount}. Stopping processing of remaining requests.");
+                                    break;
+                                }
                                throw new EntitySchedulerException("Failed to deserialize incoming request message - may be corrupted or wrong version.", exception);


What do we want to do if we encounter this deserialization error for the ExecutionStartedEvent? Currently I just left it as throwing an exception rather than trying to add poison message handling. The problem is if we can't deserialize this initial event, we can't process any operations afterwards since it contains the entity state. Maybe a reasonable choice would be if we encounter a deserialization exception for the ExecutionStartedEvent, we don't attempt to process any other messages, we move the event to poison storage, and delete the entire message batch from the backend storage. (If we don't want to delete the entire message we will need to add special logic to handle this specific case in the backend implementations).

Taking a step back, I worry that having case-by-case checks for DispatchCount scattered throughout the dispatcher code will lead to a whack-a-mole situation that's hard for customers and for us to reason about. My expectation is that we'd check for DispatchCount in a single place and apply a simple policy. Would it be helpful to pause work on this PR to have a higher-level design discussion about poison message handling?

sophiatev · 2026-03-24T20:06:59Z

src/DurableTask.Core/TaskEntityDispatcher.cs

                            }
                            catch (Exception exception)
                            {
+                                if (eventRaisedEvent.DispatchCount > this.maxDispatchCount)


Does this check make sense? Presumably we will throw this exception on all other execution attempts as well. Maybe we should just mark this message as "poisoned" immediately in this case regardless of the dispatch count.

It helps to think through the situations where this may happen. I don't have a lot of context here, but the existing exception message gives us a hint that it could be caused by corruption (which is fatal) or it could be caused by a "wrong version", which I assume means that some upgrade caused an issue. In the latter case, this could be non-fatal if the app owner rolls back the change. If we believe there are non-fatal cases, then app owners could potentially benefit from retries because they can quickly roll back (before the retry policy expires) and have the system recover automatically.

sophiatev · 2026-03-24T20:11:59Z

src/DurableTask.Core/TaskActivityDispatcher.cs

                    this.logHelper.TaskActivityDispatcherError(
                        workItem,
                        $"The activity worker received a message that does not have any OrchestrationInstance information.");
+                    if (taskMessage.Event.DispatchCount > this.maxDispatchCount)


Does this dispatch count check make sense in this and the following cases? Presumably nothing will change across execution attempts. Should we immediately mark the message as "poisoned"?

And is my choice to call CompleteTaskActivityWorkItemAsync(workItem, responseMessage: null); reasonable, and just special-casing the null response message in the backends?

This looks like an error case where we've received an invalid message. Rather than throwing an InvalidOperationException, I think we should complete with a TaskFailed response message. Then we don't have to think about poison messages here because we're treating this case as non-retriable.

I'll comment on the other cases individually.

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 9 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/DurableTask.Core/TaskActivityDispatcher.cs

src/DurableTask.Core/History/HistoryEvent.cs

src/DurableTask.Core/TaskActivityDispatcher.cs

src/DurableTask.ServiceBus/ServiceBusOrchestrationService.cs

src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs

src/DurableTask.Core/Logging/LogEvents.cs

src/DurableTask.Core/TaskActivityDispatcher.cs

src/DurableTask.Core/TaskOrchestrationDispatcher.cs

src/DurableTask.Core/TaskEntityDispatcher.cs

…dingMessage

…arameters

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 10 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/DurableTask.Core/TaskOrchestrationDispatcher.cs

src/DurableTask.Core/TaskActivityDispatcher.cs

src/DurableTask.Core/TaskEntityDispatcher.cs

src/DurableTask.Core/TaskOrchestrationDispatcher.cs

src/DurableTask.Core/TaskActivityDispatcher.cs

src/DurableTask.Core/TaskEntityDispatcher.cs

src/DurableTask.Core/Logging/LogHelper.cs

src/DurableTask.Core/TaskOrchestrationDispatcher.cs

initial implementation

de244ff

Copilot AI review requested due to automatic review settings March 24, 2026 19:21

Copilot started reviewing on behalf of sophiatev March 24, 2026 19:22 View session

sophiatev commented Mar 24, 2026

View reviewed changes

src/DurableTask.Core/IOrchestrationService.cs Outdated Show resolved Hide resolved

sophiatev commented Mar 24, 2026

View reviewed changes

Copilot AI reviewed Mar 24, 2026

View reviewed changes

sophiatev commented Mar 24, 2026

View reviewed changes

fixing the compilation errors

8f6bf9d

sophiatev commented Mar 24, 2026

View reviewed changes

addressing copilot comments

ea5a36f

Copilot AI review requested due to automatic review settings March 24, 2026 20:16

Copilot started reviewing on behalf of sophiatev March 24, 2026 20:18 View session

Merge branch 'main' into stevosyan/add-poison-message-handling

3ab0859

Copilot AI reviewed Mar 24, 2026

View reviewed changes

Sophia Tevosyan added 2 commits March 24, 2026 15:18

fixing the error in the logger where I was incorrectly calling Discar…

4c7665d

…dingMessage

moved the max dispatch count from IOrchestrationService to dispatch p…

3bd1dc9

…arameters

Copilot AI review requested due to automatic review settings March 25, 2026 19:37

Copilot started reviewing on behalf of sophiatev March 25, 2026 19:38 View session

Copilot AI reviewed Mar 25, 2026

View reviewed changes

Conversation

sophiatev commented Mar 24, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sophiatev Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sophiatev Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sophiatev Mar 24, 2026 •

edited

Loading

sophiatev Mar 24, 2026 •

edited

Loading