> For the complete documentation index, see [llms.txt](https://docs.catalyx.solutions/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.catalyx.solutions/catalyx-blockchain-manager/canton-network/version-2.0/technical-documentation-canton/technical-architecture-and-infrastructure.md). # Technical Architecture & Infrastructure CAT-BM is a Kubernetes-native solution designed to manage Canton's distributed ledger infrastructure with high availability, enterprise-grade security, and operational automation. It uses a custom Kubernetes Operator and CRDs to manage Canton components across environments. It is fully portable to AWS EKS, Microsoft Azure (AKS), Google Cloud (GKE), or on-premises Kubernetes/OpenShift. Beyond infrastructure management, CAT-BM provides operational capabilities at the Canton/Daml application level, including managing parties and users, deploying DARs, client applications, configuring various identity providers, performing backup and restore operations, pruning, monitoring, integration with wallet providers, handling upgrades, etc. ## Overview


CAT-BM Key Technical Features	#id-1.-cat-bm-key-technical-features
System Context	#id-2.-system-context
Container Diagram	#id-3.-container-diagram
Canton Network Topology	#id-4.-canton-network-topology
Cloud Infrastructure	#id-5.-cloud-infrastructure
High Availability	#id-6.-high-availability
Disaster Recovery	#id-7.-disaster-recovery

*** ## **1. CAT-BM Key Technical Features**

High Availability (HA)

* All critical **Canton components** are deployed with high availability in mind, ensuring continuous operation, resiliency, and fault tolerance. * **Multi-AZ Deployment**: All Canton services and dependencies can be deployed across multiple Availability Zones within a region. This prevents a single AZ failure from affecting system availability. * **Multi-Region Redundancy (optional)**: Where required, deployments can be extended across multiple geographic regions, enabling disaster recovery and regional failover capabilities. Cross-region data replication and failover mechanisms ensure continuity in case of regional outages. * **Self-Healing and Auto-Recovery**: The Kubernetes control plane monitors all workloads and will automatically reschedule or restart failed pods based on probe feedback or node availability, maintaining system integrity. * **Rolling Updates and Zero Downtime Deployments**: Leveraging Kubernetes' rolling upgrade strategy, component updates and configuration changes are rolled out gradually with no service interruption. Traffic is only directed to healthy and ready pods during deployment. * **Load Balancing and Traffic Routing**: Service traffic is distributed using internal Kubernetes load balancers, ensuring even distribution and automatic rerouting in the event of pod or node failures. * **Pod Anti-Affinity Rules:** Canton components are scheduled with anti-affinity rules to distribute replicas across nodes and AZs, maximizing failure domain isolation. * **Persistent Storage with Replication:** For components that require storage (e.g., Canton participants backed by PostgreSQL), persistent volumes are provisioned with multi-AZ replication support. * **Kubernetes-Native Health Checks:** * Liveness probes are configured for all Canton pods to detect and restart non-responsive components automatically. * Readiness probes ensure that only fully initialized and healthy pods receive traffic. * Startup probes help manage services with longer initialization periods, preventing premature restarts during bootstrapping.

Enterprise-Grade Security

* Encrypted persistent storage (**EBS**) and encrypted communication channels (**TLS/mTLS**) * **Secure secret management** via Azure Key Vault, AWS Secrets Manager, GCP Secret Manager, or Kubernetes Secrets * Integration with enterprise **Key Management Services (KMS)** such as AWS KMS, GCP KMS, or HashiCorp Vault for encryption key lifecycle management * Support for integration with **external Wallet-as-a-Service providers** for secure signing and transaction management * Support for various **OIDC-compliant Identity Providers** (e.g., Keycloak, Okta, Azure AD, Auth0, Ping Identity, etc.) for authentication and authorization * Fine-grained **access control** using scopes/claims mapping from OIDC or SAML identity providers * **Role-based access control (RBAC)** * Compliance with **enterprise security standards** (SOC2, ISO 27001, GDPR readiness, etc.) * **Network-level security**: IP allowlisting, VPN, and firewall/WAF integration options * **Security audit logging** with integration into SIEM platforms (Splunk, ELK, Datadog) * **Regular security patching and vulnerability scanning** (container images, dependencies)

GitOps-Driven Operations

* All infrastructure and deployment configurations are defined declaratively using **Helm charts** and managed through **GitOps practices**. * **ArgoCD** is used to synchronize desired state from Git repositories to the Kubernetes cluster, enabling traceable, auditable, and consistent deployment pipelines across dev, test, staging, and production environments.

Infrastructure-as-Code (IaC) Approach

* All cloud infrastructure is defined declaratively using **Infrastructure-as-Code** **tools** (e.g., Terraform), enabling reproducible and version-controlled provisioning. * IaC configurations are stored in Git repositories, supporting **GitOps-style workflows** for infrastructure changes. * **Automated pipelines** apply changes, ensuring traceable, auditable, and consistent deployments across development, test, staging, and production environments. * Supports **modular and reusable configurations** to standardize resources, promote best practices, and reduce configuration drift.

Automated Lifecycle Management

* CatalyX handles the **full automation** of node provisioning, dependency management, certificate distribution, and topology orchestration. * **Built-in support** for scaling, patching, and configuration updates through CRDs.

Monitoring and Observability

* Integrated with **Prometheus and Grafana** for real-time metrics collection, visualization, and alerting. * Logs are aggregated using **Fluent Bit** and forwarded to a centralized log store (e.g., Loki). * Health checks, liveness/readiness probes, and custom metrics are exposed for **proactive monitoring and incident response**.

Scalability and Extensibility

* Horizontal and vertical **pod autoscaling** based on resource usage * Support for **multi-tenant deployments** and workload isolation via Kubernetes namespaces and network policies

Supported Versions

* Canton Protocol v3.40 or Higher * Daml v3.40 or Higher

## 2. System Context CAT-BM sits between human operators and the Canton DLT, orchestrating deployment, operations, and monitoring of Canton Nodes, Canton Applications, and their supporting Identity Providers. It runs on Kubernetes (AWS EKS as reference; AKS, GKE, on-prem K8s, or OpenShift all supported) atop the customer's chosen cloud or on-premises infrastructure. CBM's interactions with Canton DLT span **Deployment**, **Operations**, and **Monitoring**. | Software System | Functional Responsibilities | Interfaces | | --------------------------- | ------------------------------------------------------------------------------- | -------------------------------------- | | Catalyst Blockchain Manager | Canton/Daml infrastructure provisioning, application deployment and operations. | Web UI, HTTP API, K8s Custom Resources | | Identity Provider | User management, RBAC. | Web UI, API (OAuth\OIDC) | | Canton Nodes | Distributed Ledger | REST API, gRPC API, TCP | | Canton Applications | End user Daml applications | Web UI, HTTP API |

Catalyst Blockchain Manager for Canton: System Context

## 3. Container Diagram

| Component | Functional Responsibilities | Interfaces | | -------------------------- | -------------------------------------------------------------------- | ----------------------------------------------------------------------- | | CAT-BM Canton UI | User interface for for Canton infrastructure & applications operator | Web UI | | CAT-BM Canton API | Backend API for user interface & 3rd party integrations | REST API | | CAT-BM Canton Operator | Kubernetes operator for Canton deployment operations | Kubernetes custom resource definitions (CRDs) for Canton infrastructure | | Canton Nodes | Canton DLT infrastructure | gRPC API, REST API | | Identity Provider UI | User management UI | Web UI | | Identity Provider API | API for OIDC based authentication | REST API | | Identity Provider Database | Stores users & RBAC configuration |

ODBC

| **Interaction flow** 1. The Canton Operator (human user) uses the CAT-BM UI to operate the Canton network. 2. The UI and the operator request access tokens from the Identity Provider. 3. The UI calls the CAT-BM API over REST for operations and subscribes to events via SSE (Server-Sent Events). 4. The API performs CRUD operations on Custom Resources (CRs) via the Kubernetes API and watches events from CRs. 5. The CAT-BM Canton Operator watches events from CRs and their dependencies, reconciles cluster state, and reconciles external state with retries and delays on failure. 6. The Kubernetes API schedules provisioning of Canton Nodes based on the CRs. Nodes run with their supporting resources (services, deployments, PVCs, ingresses). ## 4. Canton Network Topology A typical CAT-BM deployment consists of a **Private Canton Subnet** that interoperates selectively with the **Global Canton Network** through a single **Bridge Validator**. ### Private Canton Subnet The private subnet is an isolated Canton environment operated by the organisation. It contains a **Private Sync Domain** — a dedicated sync domain that coordinates transaction sequencing for participants within the subnet — and one or more **Participant Nodes**, each representing an entity within the consortium or organisation. Participants can transact with each other privately without exposing transaction data externally. ### Bridge Validator One validator node serves a dual role: it is connected to both the private sync domain and the global sync domain, acting as a controlled gateway between the two. This bridge validator allows specific assets, contracts, or parties to be visible and interoperable across domains while keeping all other private subnet data isolated. ### Global Canton Network Integration The global sync domain is a publicly reachable sync domain that enables interoperability between participants worldwide. The bridge validator is granted membership to the global sync domain through a **Super Validator** — an authoritative trusted validator that handles permissioning and trust establishment for new validators joining the global network. | Flow | Description | | -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ | | **Private Transactions** | Participants in the private sync domain transact securely and privately. | | **Selective Exposure** | The bridge validator can selectively expose certain contracts or parties to the global Canton network. | | **Trust Establishment** | The Super Validator ensures the bridge validator is recognised and trusted in the global sync domain. | | **Cross-Domain Workflows** | Once onboarded, workflows can span both the private subnet and the global Canton network, enabling collaboration without compromising confidentiality. | ## 5. Cloud Infrastructure The platform is deployed on **AWS EKS** (reference setup) and is fully portable to Microsoft Azure (AKS), Google Cloud (GKE), or on-premises Kubernetes/OpenShift. The architecture supports flexible database options, automated provisioning, comprehensive observability, and efficient traffic management to ensure high availability and smooth operations. {% tabs %} {% tab title="Infrastructure" %} Canton nodes are hosted on managed Kubernetes (EKS), providing a scalable and managed environment that simplifies cluster operations and enhances reliability. {% endtab %} {% tab title="Database" %} PostgreSQL can be deployed either within the Kubernetes cluster for tight coupling and easier management or externally via Azure Database for PostgreSQL with multi-AZ replication, automated backups, and managed maintenance. {% endtab %} {% tab title="Deployment" %} The EKS infrastructure is provisioned and managed using Terraform, integrated into an automated deployment pipeline to ensure consistent, repeatable, and auditable infrastructure changes. {% endtab %} {% tab title="Ingress" %} Traefik serves as the ingress controller, efficiently managing incoming traffic with support for dynamic routing, load balancing, and secure TLS termination, enhancing application availability and security. {% endtab %} {% tab title="Observability" %} Observability in the Canton environment is implemented using Grafana, Prometheus, and Loki, providing a full-stack view of system health, performance, and events. This setup enables operators to monitor, analyze, and troubleshoot the system effectively.\ \ **Key aspects include:**
* **Metrics Collection and Visualization:**\ Prometheus collects detailed metrics from Canton nodes, databases, and supporting infrastructure. Grafana dashboards provide real-time visualization, enabling quick insight into transaction throughput, node health, latency, and resource utilization. Canton-specific dashboards include views for participant nodes, domain nodes, transaction processing rates, and ledger states.
* **Centralized Logging:**\ Loki aggregates logs from all Canton components, including participants, domains, and connectors. Centralized logging ensures that errors, warnings, and system events can be traced quickly, supporting faster root cause analysis. Logs are searchable by node, timestamp, or component, providing deep insights into system behavior.
* **Alerting and Notifications:**\ Prometheus Alertmanager integrates with Canton observability to provide automated alerts based on defined thresholds or anomalies. Typical alerts cover node failures, high transaction latencies, resource exhaustion (CPU, memory, disk), or replication issues. Alerts can be routed to email, Slack, PagerDuty, or other incident management systems, enabling rapid response.
* **Custom Canton Dashboards:**\ Grafana dashboards are customized for Canton deployments to provide operators with domain-specific views, such as participant node performance, consensus progress, transaction conflict rates, and network latency between nodes. This ensures that teams can monitor critical business operations in real time.
* **Historical Analysis and Reporting:**\ Collected metrics and logs are retained for historical analysis, enabling trend detection, capacity planning, and post-incident review. Operators can correlate metrics and logs to understand performance bottlenecks and optimize node configurations.
* **Extensibility:**\ The observability stack can be extended to include additional monitoring tools or custom metrics specific to business logic implemented on top of Canton nodes. {% endtab %} {% endtabs %} ### 5.1 Cloud Infrastructure Diagram

| Service | Type | Backup | | ----------------------------- | -------------------------- | ------------------------------------------------------ | | EKS | Managed kubernetes service | Cluster configuration stored as a Terraform repository | | Azure VM | Compute engine | Azure Backup | | Azure Key Vault | Secrets management | Secrets replicated across multiple Azure regions | | Azure Database for PostgreSQL | Managed database service | multi-AZ HA, automated backup | | Azure Managed Disks | Storage | Azure Backup | | Azure DNS | DNS | - | | Azure Load Balancer | Load Balancer | - | ## 6. High Availability IntellectEU offers node hosting services backed by a 99.9% availability SLA, leveraging resilient cloud infrastructure designed for high availability ### 6.1 Multi-Zone and Multi-Region Redundancy

The system is designed with redundancy at both the availability zone and regional levels to ensure service continuity. In lower environments, deployment is limited to a single region with multiple availability zones for zone-level fault tolerance. In the production environment, high availability is extended to a multi-region active–passive (or standby) setup, with each region containing two availability zones and nodes arranged in an active–passive pattern for intra-region failover. Kubernetes health checks and liveness probes trigger automatic restarts. In case of node failure, workloads are rescheduled to healthy nodes automatically. Depending on requirements, the secondary region can operate in active–passive mode (passive backup with minimal or no running workloads until failover) or standby mode (warm standby with ongoing data replication for faster recovery) Failover between regions can be configured to occur automatically, ensuring minimal downtime and seamless continuity, or handled manually, giving operators full control over when and how the switchover happens — depending on business, compliance, and operational requirements. ### 6.2 HA Implementation for Canton Nodes

Canton nodes are deployed in high availability (HA) mode using Kubernetes Custom Resources (CRs) and a dedicated Operator. The Custom Resources define the desired state and configuration of Canton components, such as nodes and applications, allowing declarative management within Kubernetes. The Canton Operator continuously monitors these CRs and automates the lifecycle management tasks, including deployment, scaling, upgrades, and failover handling. This approach simplifies complex HA deployments by encapsulating operational logic, ensuring consistency across the cluster, and improving resilience by automatically responding to node or zone failures. ## 7. Disaster Recovery ### 7.1 Disaster Recovery Strategy We maintain a comprehensive Disaster Recovery (DR) plan and procedures covering Canton nodes, node identity keys, Kubernetes infrastructure, and supporting Azure services - to ensure rapid restoration of Canton node services in the event of catastrophic failure, data corruption, or infrastructure outage. The disaster recovery strategy for Canton ensures continuity of operations through multiple recovery mechanisms: * **Database Backup Restoration**: In the event of a failure, Canton nodes can be restored from regular database backups to recover state and resume operation with minimal data loss. * **Regional Failover**: Depending on requirements, workloads can be failed over to a secondary region operating in either passive mode (cold standby with minimal resources until activated) or standby mode (warm standby with ongoing data replication for faster recovery). * **Fallback from Node Identity Dumps**: If database backups are unavailable or corrupted, Canton nodes can be reinitialized from node identity dumps, ensuring the network can be reconstructed and operations resumed. DR plan reviewed and tested at least annually. ### 7.2 Recovery Objectives 1. **Availability Zone failure**: RPO = 0, RTO = 1 min 2. **Region failure**: RPO < 2 min, RTO = 2 hours. Depends on async replica lag (cannot be strictly guaranteed; may be seconds to minutes). Promotion/failover to the replica is a DR procedure. 3. **Data corruption**: average RPO < 2 min (not strictly guaranteed), RTO = 2 hours. Bad deployment / operator error: recover via PITR to just before the incident. --- # Agent Instructions This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com. ## Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://docs.catalyx.solutions/catalyx-blockchain-manager/canton-network/version-2.0/technical-documentation-canton/technical-architecture-and-infrastructure.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.