High-availability Namespaces
Temporal Cloud offers "Reliability-as-a-Service" to support mission-critical deployment when must be highly available. Data loss and disruptions to Workflows can severely impact business. Replication is the critical component of highly available Namespaces. It protects applications against outages and downtime. High availability creates a fallback "replica" that can take over during service incidents. This keeps your Workflows running and your data available.
Replication and failovers
Temporal Cloud’s replicated Namespaces provide disaster-tolerant availability for critical workloads. When you enable replication, Temporal Cloud syncs your data and Workflows between an active and a replica Namespace. When an incident occurs, Temporal automatically fails over your Namespace.
Your Workflow Executions and Schedules seamlessly transition from the active Namespace to a standby domain. This standby domain is called a replica, as it replicates the Workflows and data of the active Namespace. Once the incident resolves, the Namespaces reconcile and control returns back to the original.
A high availability Namespace creates a single logical Namespace that operates across two domains: one active and one standby. Replicated Namespaces combine access for both domains to a unified Namespace endpoint. As Workflows progress in the active Namespace, Temporal Cloud replicates History events to the standby domain, ensuring continuity and data integrity.
During an incident or outage in the active domain, Temporal Cloud seamlessly fails over to your replica. Failovers allow existing Workflow Executions to continue running and new Workflow Executions to be started. Once failover occurs, the replica becomes active. After the issue is resolved, the active replica "fails back" and the original Namespace resumes being "active". Temporal resumes replication from the original active Namespace to the replica.
Under the hood
An isolation domain is a physically isolated data center within a deployment region for a given cloud provider. Regions consist of multiple isolation domains. Isolation domains provide redundancy and fault tolerance.
A replicated Namespace consists of an active Namespace and a passive, fallback replica. Depending on your setup, your replica may reside in the same region as your active Namespace (standard replication), or it may be located in an entirely different region (multi-region replication).
After a failover, the replica takes on the active role until the incident is resolved. After, the replica fails back and the original Namespace resumes the active role.
Temporal Cloud’s high availability features:
- No manual deployment or configuration needed, just simple push-button operations.
- Existing Workflows resume seamlessly in the replica with minimal interruption and data loss.
- No changes needed for Worker and Workflow code during setup or failover.
- 99.99% contractual SLA.
Types of high availability
Temporal currently offers the following high availability features. Configure these from your Namespace:
- Replication: Workflows are seamlessly replicated to a different isolation domain within the same region as the Namespace, such as "us-east-1". Choose this option for applications architected for a single-region. Your Namespaces failover to an isolation domain within the same region.
- Multi-region replication: Workflows are seamlessly replicated to a different region that you choose. Choose this option when your business requires multi-regional availability and the higher-level of resilience that separated locations offers. You will failover from one region to a separate region.
Replication charges apply when you enable high availability. For pricing details, visit the Temporal Cloud Pricing page.
Should you choose high availability?
Should you be using high availability Namespaces? It depends on your availability requirements:
- High availability Namespaces offer a 99.99% Service Level Agreement (SLA) for workloads with strict high availability needs. They use two Namespaces in two isolation domains to support standby recovery. In the event of an incident, Temporal Cloud automatically fails over the Namespace to the standby replica. High availability Namespaces's 99.99% availability is enforced by Temporal Cloud's service error rates SLA.
- Namespaces without high availability include a 99.9% contractual Service Level Agreement (SLA). In this use, Temporal clients connect to a single Namespace in one deployment domain. For many applications, this offers sufficient availability.
Temporal Cloud provides 99.99% service availability for all Namespaces, both single-region and high availability. Our system is designed to limit data loss after recovery when the incident triggering the failover is resolved.
- Our recovery point objective (RPO) is near-zero. There may be a short period of time during an incident or forced failover when some data is unavailable in the replica. Some Workflow History data won't arrive until networks issue are fixed, enabling the History to finish replicating and the divergent History branches to reconcile.
- Temporal Cloud proactively responds to incidents by triggering failovers. Our recovery time objective (RTO) is 20 minutes or less per incident.
During a disaster scenario in which the data on the hard drives in the active Namespace cannot be recovered, the duration of data loss may be as high as the replication lag at the time of disaster.