Skip to main content

Error Recovery Patterns

Comprehensive patterns and strategies for recovering from errors in BPMN processes.

Overview

This guide covers proven recovery patterns for:

  • Retry strategies - Automatic retry with backoff
  • Fallback mechanisms - Alternative data sources or services
  • Circuit breaker - Protect against cascading failures
  • Compensating actions - Undo partial work
  • Human intervention - Escalate complex failures

Pattern 1: Exponential Backoff Retry

Automatically retry failed operations with increasing delays between attempts.

Implementation

<bpmn:serviceTask id="CallExternalAPI" name="Call External API">
<bpmn:extensionElements>
<custom:properties>
<custom:property name="ConnectorKey" value="CallPartnerAPICommand"/>
<custom:property name="ResultVariable" value="apiResult"/>
</custom:properties>
</bpmn:extensionElements>
</bpmn:serviceTask>

<!-- Non-interrupting for retryable errors -->
<bpmn:boundaryEvent id="RetryableError"
attachedToRef="CallExternalAPI"
cancelActivity="false">
<bpmn:errorEventDefinition errorRef="RetryableError" />
</bpmn:boundaryEvent>

<!-- Interrupting for exhausted retries -->
<bpmn:boundaryEvent id="RetriesExhausted"
attachedToRef="CallExternalAPI"
cancelActivity="true">
<bpmn:errorEventDefinition errorRef="ExhaustedError" />
</bpmn:boundaryEvent>

<!-- Calculate retry delay -->
<bpmn:scriptTask id="CalculateRetry" name="Calculate Retry">
<bpmn:script>
var retryCount = (context.apiRetryCount || 0) + 1;
var maxRetries = 5;

if (retryCount > maxRetries) {
// Max retries reached - convert to permanent error
throw new BpmnError('API_RETRY_EXHAUSTED',
'Failed after ' + maxRetries + ' retry attempts');
}

context.apiRetryCount = retryCount;

// Exponential backoff: 2^retryCount * 1000ms
// Retry 1: 2s, Retry 2: 4s, Retry 3: 8s, Retry 4: 16s, Retry 5: 32s
var delayMs = Math.pow(2, retryCount) * 1000;

// Add jitter (random 0-500ms) to prevent thundering herd
delayMs += Math.random() * 500;

context.retryDelayMs = delayMs;

logger.info('API retry attempt ' + retryCount + ' of ' + maxRetries +
' after ' + Math.round(delayMs) + 'ms delay');

return {
retryCount: retryCount,
delayMs: Math.round(delayMs)
};
</bpmn:script>
</bpmn:scriptTask>

<!-- Wait before retry -->
<bpmn:intermediateCatchEvent id="WaitBeforeRetry" name="Wait">
<bpmn:timerEventDefinition>
<bpmn:timeDuration xsi:type="bpmn:tFormalExpression">
PT${Math.round(context.retryDelayMs / 1000)}S
</bpmn:timeDuration>
</bpmn:timerEventDefinition>
</bpmn:intermediateCatchEvent>

<!-- Loop back to retry -->
<bpmn:sequenceFlow sourceRef="WaitBeforeRetry" targetRef="CallExternalAPI" />

<!-- Handle permanent failure -->
<bpmn:scriptTask id="HandlePermanentFailure" name="Handle Permanent Failure">
<bpmn:script>
var error = context._lastError;

logger.error('API permanently failed after ' + context.apiRetryCount +
' retries: ' + error.errorMessage);

// Send alert
BankLingo.ExecuteCommand('SendAlert', {
severity: 'HIGH',
message: 'External API permanently unavailable',
context: {
errorCode: error.errorCode,
errorMessage: error.errorMessage,
retries: context.apiRetryCount
}
});

context.apiCallFailed = true;
context.useAlternativeMethod = true;
</bpmn:script>
</bpmn:scriptTask>

<bpmn:error id="RetryableError" errorCode="GATEWAY_TIMEOUT" />
<bpmn:error id="ExhaustedError" errorCode="API_RETRY_EXHAUSTED" />

Use Cases

  • ✅ Network timeouts
  • ✅ Temporary service unavailability
  • ✅ Rate limit exceeded (with longer backoff)
  • ✅ Database deadlocks

Pattern 2: Fallback Cascade

Try multiple data sources or services in order of preference.

Implementation

<bpmn:process id="FallbackCascade">

<!-- Try primary service -->
<bpmn:serviceTask id="PrimaryService" name="Primary Service">
<bpmn:extensionElements>
<custom:properties>
<custom:property name="ConnectorKey" value="PrimaryServiceCommand"/>
<custom:property name="ResultVariable" value="serviceResult"/>
</custom:properties>
</bpmn:extensionElements>
</bpmn:serviceTask>

<!-- Primary failed -->
<bpmn:boundaryEvent id="PrimaryFailed"
attachedToRef="PrimaryService"
cancelActivity="true">
<bpmn:errorEventDefinition />
</bpmn:boundaryEvent>

<!-- Log and try fallback -->
<bpmn:scriptTask id="LogPrimaryFailure" name="Log Primary Failure">
<bpmn:script>
logger.warn('Primary service failed, trying fallback service');
context.primaryServiceFailed = true;
</bpmn:script>
</bpmn:scriptTask>

<!-- Try fallback service -->
<bpmn:serviceTask id="FallbackService" name="Fallback Service">
<bpmn:extensionElements>
<custom:properties>
<custom:property name="ConnectorKey" value="FallbackServiceCommand"/>
<custom:property name="ResultVariable" value="serviceResult"/>
</custom:properties>
</bpmn:extensionElements>
</bpmn:serviceTask>

<!-- Fallback failed -->
<bpmn:boundaryEvent id="FallbackFailed"
attachedToRef="FallbackService"
cancelActivity="true">
<bpmn:errorEventDefinition />
</bpmn:boundaryEvent>

<!-- Log and try cache -->
<bpmn:scriptTask id="LogFallbackFailure" name="Log Fallback Failure">
<bpmn:script>
logger.warn('Fallback service failed, trying cached data');
context.fallbackServiceFailed = true;
</bpmn:script>
</bpmn:scriptTask>

<!-- Try cached data -->
<bpmn:scriptTask id="UseCachedData" name="Use Cached Data">
<bpmn:script>
if (!context.cachedServiceData) {
throw new BpmnError('NO_DATA_AVAILABLE',
'All data sources failed and no cache available');
}

var cacheAge = Date.now() - new Date(context.cacheTimestamp).getTime();
var cacheAgeHours = Math.round(cacheAge / 1000 / 60 / 60);

logger.info('Using cached data (age: ' + cacheAgeHours + ' hours)');

context.serviceResult = context.cachedServiceData;
context.dataSource = 'CACHE';
context.dataFreshness = cacheAgeHours < 24 ? 'RECENT' : 'STALE';
context.cacheAgeHours = cacheAgeHours;

return {
usedCache: true,
cacheAgeHours: cacheAgeHours
};
</bpmn:script>
</bpmn:scriptTask>

<!-- Cache also failed -->
<bpmn:boundaryEvent id="CacheFailed"
attachedToRef="UseCachedData"
cancelActivity="true">
<bpmn:errorEventDefinition errorRef="NoDataError" />
</bpmn:boundaryEvent>

<!-- All sources failed - final error -->
<bpmn:scriptTask id="HandleCompleteFailure" name="Handle Complete Failure">
<bpmn:script>
logger.error('All data sources failed (primary, fallback, cache)');

// Send critical alert
BankLingo.ExecuteCommand('SendCriticalAlert', {
message: 'All data sources unavailable',
impact: 'CRITICAL'
});

context.dataUnavailable = true;
</bpmn:script>
</bpmn:scriptTask>

<bpmn:error id="NoDataError" errorCode="NO_DATA_AVAILABLE" />

</bpmn:process>

Use Cases

  • ✅ External API unavailability
  • ✅ Database failover
  • ✅ Content delivery networks (CDN fallback)
  • ✅ Multi-region redundancy

Pattern 3: Circuit Breaker

Prevent cascading failures by "opening the circuit" after repeated failures.

Implementation

<bpmn:scriptTask id="CheckCircuitBreaker" name="Check Circuit Breaker">
<bpmn:script>
var serviceName = 'PaymentGateway';
var circuitKey = 'circuit_' + serviceName;

// Circuit state: CLOSED, OPEN, HALF_OPEN
var circuitState = context[circuitKey + '_state'] || 'CLOSED';
var failureCount = context[circuitKey + '_failures'] || 0;
var lastFailureTime = context[circuitKey + '_lastFailure'];

// Configuration
var failureThreshold = 10; // Open after 10 failures
var resetTimeout = 300000; // Try half-open after 5 minutes
var successThreshold = 3; // Close after 3 successes in half-open

context[circuitKey + '_config'] = {
failureThreshold: failureThreshold,
resetTimeout: resetTimeout,
successThreshold: successThreshold
};

if (circuitState === 'OPEN') {
// Circuit is open - check if we should try half-open
var now = Date.now();
var timeSinceFailure = now - new Date(lastFailureTime).getTime();

if (timeSinceFailure > resetTimeout) {
// Try half-open
context[circuitKey + '_state'] = 'HALF_OPEN';
context[circuitKey + '_halfOpenSuccesses'] = 0;
logger.info('Circuit breaker HALF_OPEN for ' + serviceName + ' (testing)');
} else {
// Still open - fail fast
var remainingMs = resetTimeout - timeSinceFailure;
var remainingMin = Math.ceil(remainingMs / 1000 / 60);

throw new BpmnError('CIRCUIT_BREAKER_OPEN',
serviceName + ' circuit breaker is OPEN. ' +
'Service unavailable for ' + remainingMin + ' more minutes.');
}
}

return {
circuitState: context[circuitKey + '_state'],
failureCount: failureCount
};
</bpmn:script>
</bpmn:scriptTask>

<!-- Circuit is closed or half-open - proceed with call -->
<bpmn:serviceTask id="CallProtectedService" name="Call Service">
<bpmn:extensionElements>
<custom:properties>
<custom:property name="ConnectorKey" value="PaymentGatewayCommand"/>
<custom:property name="ResultVariable" value="paymentResult"/>
</custom:properties>
</bpmn:extensionElements>
</bpmn:serviceTask>

<!-- Handle success -->
<bpmn:scriptTask id="HandleServiceSuccess" name="Handle Success">
<bpmn:script>
var serviceName = 'PaymentGateway';
var circuitKey = 'circuit_' + serviceName;
var circuitState = context[circuitKey + '_state'];

if (circuitState === 'HALF_OPEN') {
// Increment success count in half-open
var successCount = (context[circuitKey + '_halfOpenSuccesses'] || 0) + 1;
context[circuitKey + '_halfOpenSuccesses'] = successCount;

var successThreshold = context[circuitKey + '_config'].successThreshold;

if (successCount >= successThreshold) {
// Enough successes - close circuit
context[circuitKey + '_state'] = 'CLOSED';
context[circuitKey + '_failures'] = 0;
logger.info('Circuit breaker CLOSED for ' + serviceName + ' (recovered)');
} else {
logger.info('Circuit breaker HALF_OPEN: ' + successCount + ' of ' +
successThreshold + ' successes');
}
}

return { success: true };
</bpmn:script>
</bpmn:scriptTask>

<!-- Handle failure -->
<bpmn:boundaryEvent id="ServiceFailed"
attachedToRef="CallProtectedService"
cancelActivity="true">
<bpmn:errorEventDefinition />
</bpmn:boundaryEvent>

<bpmn:scriptTask id="HandleServiceFailure" name="Handle Failure">
<bpmn:script>
var serviceName = 'PaymentGateway';
var circuitKey = 'circuit_' + serviceName;
var circuitState = context[circuitKey + '_state'];
var error = context._lastError;

logger.error('Service call failed: ' + error.errorMessage);

if (circuitState === 'HALF_OPEN') {
// Failure in half-open - reopen circuit
context[circuitKey + '_state'] = 'OPEN';
context[circuitKey + '_lastFailure'] = new Date().toISOString();
logger.warn('Circuit breaker OPEN for ' + serviceName + ' (half-open test failed)');
} else {
// Increment failure count
var failureCount = (context[circuitKey + '_failures'] || 0) + 1;
context[circuitKey + '_failures'] = failureCount;
context[circuitKey + '_lastFailure'] = new Date().toISOString();

var failureThreshold = context[circuitKey + '_config'].failureThreshold;

if (failureCount >= failureThreshold) {
// Open circuit
context[circuitKey + '_state'] = 'OPEN';
logger.error('Circuit breaker OPEN for ' + serviceName +
' (threshold reached: ' + failureCount + ' failures)');

// Send alert
BankLingo.ExecuteCommand('SendAlert', {
severity: 'HIGH',
message: 'Circuit breaker opened for ' + serviceName,
failureCount: failureCount
});
} else {
logger.warn('Circuit breaker: ' + failureCount + ' of ' +
failureThreshold + ' failures');
}
}

// Re-throw error for process handling
throw new BpmnError(error.errorCode, error.errorMessage);
</bpmn:script>
</bpmn:scriptTask>

Circuit States

StateDescriptionBehavior
CLOSEDNormal operationCalls proceed normally
OPENToo many failuresFail fast without calling service
HALF_OPENTesting recoveryAllow limited calls to test if service recovered

Use Cases

  • ✅ External payment gateways
  • ✅ Third-party APIs
  • ✅ Microservice dependencies
  • ✅ Database connections

Pattern 4: Compensating Transactions

Undo partially completed work when an error occurs.

Implementation

<bpmn:process id="BookingWithCompensation">

<!-- Step 1: Reserve flight -->
<bpmn:serviceTask id="ReserveFlight" name="Reserve Flight">
<bpmn:extensionElements>
<custom:properties>
<custom:property name="ConnectorKey" value="ReserveFlightCommand"/>
<custom:property name="ResultVariable" value="flightReservation"/>
</custom:properties>
</bpmn:extensionElements>
</bpmn:serviceTask>

<!-- Step 2: Reserve hotel -->
<bpmn:serviceTask id="ReserveHotel" name="Reserve Hotel">
<bpmn:extensionElements>
<custom:properties>
<custom:property name="ConnectorKey" value="ReserveHotelCommand"/>
<custom:property name="ResultVariable" value="hotelReservation"/>
</custom:properties>
</bpmn:extensionElements>
</bpmn:serviceTask>

<!-- Hotel reservation failed -->
<bpmn:boundaryEvent id="HotelReservationFailed"
attachedToRef="ReserveHotel"
cancelActivity="true">
<bpmn:errorEventDefinition />
</bpmn:boundaryEvent>

<!-- Compensate: Cancel flight -->
<bpmn:scriptTask id="CancelFlight" name="Cancel Flight Reservation">
<bpmn:script>
var error = context._lastError;

logger.warn('Hotel reservation failed: ' + error.errorMessage);
logger.info('Compensating: Canceling flight reservation');

// Cancel the flight reservation
BankLingo.ExecuteCommand('CancelFlightReservation', {
reservationId: context.flightReservation.reservationId
});

context.compensationPerformed = true;
context.compensationReason = 'Hotel reservation failed';

return {
flightCanceled: true,
reason: error.errorMessage
};
</bpmn:script>
</bpmn:scriptTask>

<!-- Step 3: Charge payment -->
<bpmn:serviceTask id="ChargePayment" name="Charge Payment">
<bpmn:extensionElements>
<custom:properties>
<custom:property name="ConnectorKey" value="ChargePaymentCommand"/>
<custom:property name="ResultVariable" value="paymentResult"/>
</custom:properties>
</bpmn:extensionElements>
</bpmn:serviceTask>

<!-- Payment failed -->
<bpmn:boundaryEvent id="PaymentFailed"
attachedToRef="ChargePayment"
cancelActivity="true">
<bpmn:errorEventDefinition />
</bpmn:boundaryEvent>

<!-- Compensate: Cancel both flight and hotel -->
<bpmn:scriptTask id="CancelAllReservations" name="Cancel All Reservations">
<bpmn:script>
var error = context._lastError;

logger.warn('Payment failed: ' + error.errorMessage);
logger.info('Compensating: Canceling all reservations');

// Cancel flight
BankLingo.ExecuteCommand('CancelFlightReservation', {
reservationId: context.flightReservation.reservationId
});

// Cancel hotel
BankLingo.ExecuteCommand('CancelHotelReservation', {
reservationId: context.hotelReservation.reservationId
});

// Notify customer
BankLingo.ExecuteCommand('SendEmail', {
to: context.customerEmail,
subject: 'Booking Failed - Payment Issue',
body: 'Your booking could not be completed due to payment failure. ' +
'All reservations have been canceled. ' + error.errorMessage
});

context.compensationPerformed = true;
context.compensationReason = 'Payment failed';

return {
allCanceled: true,
customerNotified: true
};
</bpmn:script>
</bpmn:scriptTask>

</bpmn:process>

Use Cases

  • ✅ Multi-step bookings/reservations
  • ✅ Distributed transactions
  • ✅ Saga pattern implementation
  • ✅ Financial transactions with multiple steps

Pattern 5: Manual Intervention

Escalate complex errors to human operators.

Implementation

<bpmn:serviceTask id="AutomatedProcessing" name="Automated Processing">
<bpmn:extensionElements>
<custom:properties>
<custom:property name="ConnectorKey" value="ProcessDataCommand"/>
</custom:properties>
</bpmn:extensionElements>
</bpmn:serviceTask>

<!-- Catch all errors -->
<bpmn:boundaryEvent id="ProcessingError"
attachedToRef="AutomatedProcessing"
cancelActivity="true">
<bpmn:errorEventDefinition />
</bpmn:boundaryEvent>

<!-- Determine if manual intervention needed -->
<bpmn:scriptTask id="AnalyzeError" name="Analyze Error">
<bpmn:script>
var error = context._lastError;

// Categorize error
var errorCategory = 'UNKNOWN';
var requiresManualIntervention = false;

if (error.errorCode === 'DATA_INCONSISTENCY' ||
error.errorCode === 'MISSING_DATA' ||
error.errorCode === 'INVALID_DATA') {
errorCategory = 'DATA_QUALITY';
requiresManualIntervention = true;
} else if (error.errorCode === 'BUSINESS_RULE_VIOLATION') {
errorCategory = 'BUSINESS_RULE';
requiresManualIntervention = true;
} else if (error.errorCode === 'GATEWAY_TIMEOUT') {
errorCategory = 'TRANSIENT';
requiresManualIntervention = false; // Can retry
}

context.errorCategory = errorCategory;
context.requiresManualIntervention = requiresManualIntervention;

logger.info('Error categorized as: ' + errorCategory +
', Manual intervention: ' + requiresManualIntervention);

return {
category: errorCategory,
manual: requiresManualIntervention
};
</bpmn:script>
</bpmn:scriptTask>

<!-- Gateway: Manual intervention needed? -->
<bpmn:exclusiveGateway id="NeedsManualReview" name="Needs Manual Review?"/>

<bpmn:sequenceFlow sourceRef="NeedsManualReview" targetRef="CreateManualTask">
<bpmn:conditionExpression xsi:type="bpmn:tFormalExpression">
${context.requiresManualIntervention === true}
</bpmn:conditionExpression>
</bpmn:sequenceFlow>

<!-- Create manual review task -->
<bpmn:userTask id="CreateManualTask" name="Manual Error Review">
<bpmn:extensionElements>
<custom:properties>
<custom:property name="FormKey" value="error-review-form"/>
<custom:property name="ResponsibleTeams" value="Operations,Support"/>
<custom:property name="Priority" value="HIGH"/>
<custom:property name="Description"><![CDATA[
Automated processing failed with error that requires manual review.

Error Code: {{context._lastError.errorCode}}
Error Message: {{context._lastError.errorMessage}}
Error Category: {{context.errorCategory}}

Please review and take appropriate action.
]]></custom:property>
<custom:property name="PreScript"><![CDATA[
// Prepare context for form
formContext.error = context._lastError;
formContext.errorCategory = context.errorCategory;
formContext.processData = {
customerId: context.customerId,
transactionId: context.transactionId,
amount: context.amount
};
]]></custom:property>
<custom:property name="ServerScript"><![CDATA[
// Process operator's decision
context.operatorDecision = formData.decision; // 'RETRY', 'SKIP', 'CANCEL'
context.operatorComments = formData.comments;
context.reviewedBy = formData.userId;
context.reviewedAt = new Date().toISOString();

return {
decision: formData.decision,
reviewedBy: formData.userId
};
]]></custom:property>
</custom:properties>
</bpmn:extensionElements>
</bpmn:userTask>

<!-- Process operator decision -->
<bpmn:exclusiveGateway id="OperatorDecision" name="Operator Decision"/>

<!-- Retry -->
<bpmn:sequenceFlow sourceRef="OperatorDecision" targetRef="AutomatedProcessing">
<bpmn:conditionExpression xsi:type="bpmn:tFormalExpression">
${context.operatorDecision === 'RETRY'}
</bpmn:conditionExpression>
</bpmn:sequenceFlow>

<!-- Skip -->
<bpmn:sequenceFlow sourceRef="OperatorDecision" targetRef="SkipAndContinue">
<bpmn:conditionExpression xsi:type="bpmn:tFormalExpression">
${context.operatorDecision === 'SKIP'}
</bpmn:conditionExpression>
</bpmn:sequenceFlow>

<!-- Cancel -->
<bpmn:sequenceFlow sourceRef="OperatorDecision" targetRef="CancelProcess">
<bpmn:conditionExpression xsi:type="bpmn:tFormalExpression">
${context.operatorDecision === 'CANCEL'}
</bpmn:conditionExpression>
</bpmn:sequenceFlow>

Use Cases

  • ✅ Complex data quality issues
  • ✅ Business rule exceptions
  • ✅ Ambiguous situations requiring judgment
  • ✅ High-value transactions with errors

Best Practices

✅ Do This

// ✅ Track retry attempts
context.retryCount = (context.retryCount || 0) + 1;

// ✅ Use exponential backoff
var delayMs = Math.pow(2, retryCount) * 1000;

// ✅ Add jitter to prevent thundering herd
delayMs += Math.random() * 500;

// ✅ Set max retries
if (retryCount > maxRetries) {
throw new BpmnError('RETRY_EXHAUSTED', 'Failed after ' + maxRetries + ' attempts');
}

// ✅ Log all recovery attempts
logger.info('Recovery attempt ' + retryCount + ': ' + strategy);

// ✅ Send alerts on critical failures
BankLingo.ExecuteCommand('SendAlert', { severity: 'HIGH', message: '...' });

// ✅ Track circuit breaker state
context.circuitState = 'OPEN';
context.circuitLastFailure = new Date().toISOString();

❌ Don't Do This

// ❌ Infinite retries
while (true) {
try { /* ... */ } catch(e) { /* retry forever */ }
}

// ❌ No retry delays (thundering herd)
for (var i = 0; i < 10; i++) {
try { callAPI(); break; } catch(e) { /* immediate retry */ }
}

// ❌ No error logging
catch (error) {
// No logging - impossible to debug
}

// ❌ No alerting on critical failures
if (allSourcesFailed) {
return null; // Silent failure
}

// ❌ No compensation for partial work
// Transaction 1: Success
// Transaction 2: Success
// Transaction 3: Failed
// -> No rollback of 1 and 2

Error Recovery Decision Tree

Error Occurred
├─ Is it transient? (timeout, network, etc.)
│ ├─ YES → Retry with exponential backoff
│ │ └─ Max retries reached?
│ │ ├─ YES → Try fallback
│ │ └─ NO → Continue retrying
│ └─ NO → Continue

├─ Is alternative available? (fallback service, cache)
│ ├─ YES → Try fallback
│ │ └─ Fallback failed?
│ │ ├─ YES → Circuit breaker or manual intervention
│ │ └─ NO → Use fallback data (mark as such)
│ └─ NO → Continue

├─ Is partial work done?
│ ├─ YES → Compensate (undo partial work)
│ └─ NO → Continue

├─ Is circuit breaker applicable?
│ ├─ YES → Open circuit, fail fast for duration
│ └─ NO → Continue

└─ Requires human judgment?
├─ YES → Create manual review task
└─ NO → Log and fail process

Features Used:

  • Phase 5: Error Handling
  • Phase 4: Timer Events (retry delays)
  • Phase 2: Async Boundaries (background compensation)

Status: ✅ Production Ready
Version: 2.0
Last Updated: January 2026