The Transaction Recovery Service provides automatic re-registration of finality listeners for pending transactions that may have lost their listeners due to node restarts, network interruptions, or other failures. This ensures that transactions eventually reach finality even after system disruptions.
The recovery system consists of three main components:
The Manager runs in the background and periodically scans for pending transactions that are eligible for recovery. It uses distributed locking (PostgreSQL advisory locks) to ensure only one replica in a multi-instance deployment performs recovery at a time.
Key features:
The Handler interface defines how individual transactions are recovered. The TTX service provides a concrete implementation (TTXRecoveryHandler) that:
The Storage interface abstracts database operations needed for recovery:
AcquireRecoveryLeadership: Obtains distributed lock for leader electionClaimPendingTransactions: Atomically claims a batch of pending transactions, returning a lightweight RecoveryClaim (TxID + StoredAt) for each row — the recovery loop only needs these two fields, so the SQL projection is kept narrowReleaseRecoveryClaim: Releases claim after processingSetStatus: Promotes a transaction to a terminal status. Used by the recovery loop to mark NotFound-past-grace-period rows as Orphan so they exit the eligible scan range without being conflated with ledger-rejected transactions (Deleted)PostgreSQL is the recommended database for production multi-instance deployments:
UPDATE...RETURNING ensures no duplicate claimsSQLite is supported for single-node deployments and development:
Recovery behavior is controlled via configuration (see Configuration):
recovery:
enabled: true # Enable/disable recovery
ttl: 30s # Minimum age before recovery
scanInterval: 5s # How often to scan
batchSize: 100 # Max transactions per scan
workerCount: 4 # Parallel workers
leaseDuration: 30s # Claim lease duration
advisoryLockID: 8389... # PostgreSQL lock ID
instanceID: "" # Instance identifier
notFoundGracePeriod: 30m # Promote NotFound rows to Orphan after this age (0 disables)
Creating a recovery manager:
config := recovery.Config{
Enabled: true,
TTL: 30 * time.Second,
ScanInterval: 5 * time.Second,
BatchSize: 100,
WorkerCount: 4,
LeaseDuration: 30 * time.Second,
AdvisoryLockID: 8389190333894887286,
NotFoundGracePeriod: 30 * time.Minute,
}
manager := recovery.NewManager(
logger,
storage, // Implements Storage interface
handler, // Implements Handler interface
config,
)
// Start recovery
if err := manager.Start(); err != nil {
return err
}
defer manager.Stop()
To implement a custom recovery handler:
type MyHandler struct {
// your dependencies
}
func (h *MyHandler) Recover(ctx context.Context, txID string) error {
// 1. Query transaction status from your backend
// 2. Apply finality logic based on status
// 3. Update local database state
// 4. Return nil on success, error on failure
return nil
}
RecoveryClaim (TxID + StoredAt)Handler.Recover() for its transactionsNotFound and the row was stored more than notFoundGracePeriod ago, the manager promotes the row to Orphan via SetStatus so it exits the eligible scan rangeA token request transitions through the following statuses as the recovery loop interacts with it:
ClaimPendingTransactions; the claim query and its supporting partial index filter on status = Pending.network.Invalid) or by local validation (token request hash mismatch via the finality listener). Terminal.NotFound from the network past notFoundGracePeriod. Terminal in this version, and intentionally distinct from Deleted so operators (and future replay tooling) can identify broadcast failures separately from ledger-rejected transactions.All three terminal statuses (Confirmed, Deleted, Orphan) are excluded from subsequent recovery sweeps by virtue of the status = Pending filter on the claim query.
Deleted in the databaseNotFound past notFoundGracePeriod): Marked as Orphan to indicate the transaction never reached the ledger; distinct from Deleted so operators can distinguish broadcast failures from ledger-rejected transactionsbatchSize (200-500)workerCount (8-16)scanInterval (2-3s)batchSize (50)workerCount (2)scanInterval (10-15s)ttl (60s or more)leaseDuration > expected processing timeThe Manager is thread-safe and can be safely started/stopped from multiple goroutines. The Handler implementation must also be thread-safe as it will be called concurrently by multiple workers.