Building a Resilient Database for Fastly's CA

Staff Site Reliability Engineer, Fastly

September 27, 2023

This is the latest in a series of posts exploring how we built Certainly, Fastly’s new publicly-trusted Certification Authority. Previously we’ve described some of the architectural decisions behind Certainly. Today we’ll explore one of the major challenges stemming from the decision to create a cloud-like ephemeral environment in which systems are regularly and automatically destroyed and rebuilt.

Like every other CA, at the heart of Certainly operations lies a robust, reliable and resilient database. Certainly’s well-designed database architecture allows for efficient storage and retrieval of certificate data and provides the ability to scale to meet the growing business needs. We currently use MariaDB with the InnoDB database engine along with replication enabled across the data centers. Replication helps us to achieve data redundancy and builds confidence in our operations to recover easily in case of disasters. But this wasn’t always the case. Before adopting MariaDB replication, we relied on the MariaDB Galera cluster setup, which only proved to be painful, operationally burdensome, and unstable. Every day we would find ourselves facing new problems, which eventually necessitated the transition to MariaDB replication. Here are some of the benefits of this decision:

1. Reduced Complexity

Galera cluster’s multi-primary architecture required all the database nodes across all of our facilities to be in constant communication to ensure data consistency and synchronization. MariaDB replication is a single primary and multiple replicas-based design. Having a single database node act as write primary significantly simplifies the setup and maintenance. On top of that, the configuration is easier with less operational overhead. With the decreased complexity, we have seen a significant drop in the number of alerts and issues which in turn have led to substantial improvement in the effectiveness and morale of the Certainly SRE team.

2. Dynamic Scalability

As we were getting ready to launch, we had to ensure that scalability would not become an issue as we grew our user base. Although the Galera cluster provided us with synchronous multi-primary replication, it came with limitations. Adding new nodes to the existing cluster was a complex and time-consuming process. In contrast, MariaDB replication provides us enough flexibility to add new read replicas as needed without impacting the existing primary’s operations. This helps with scaling our read operations horizontally without making it a burden on the team.

3. Use Case-based Replicas

We can create additional replicas based on use cases such as read operations, backups, reporting & analytics, and more, thanks to the new database design. Contrary to Galera, adding or removing nodes no longer affects operations, therefore the primary can continue operating as efficiently as before.

4. Ephemerality

Certainly embraces the concept of ephemerality wholeheartedly. What that means for our database nodes is that they are rebuilt from scratch on a regular cadence. The general notion is that ephemerality applies to stateless components and databases are designed to store and persist data. We attempted to break that notion with our innovative infrastructure design which provides us with more agility, scalability and resilience. As stated earlier, Galera needs its nodes to be in constant communication which did not go well with the ephemeral nature of our infrastructure. During rebuilds, nodes were frequently exiting and rejoining the cluster, triggering Galera to recalculate quorum. This, coupled with various known factors such as backups and network connectivity, along with certain unidentified elements, resulted in Galera losing quorum and shutting down more often than expected. If the database is down, Certainly is down. With the new design, we can take down the replicas for rebuilds without worrying about having an adverse effect on the primary node, making the system unquestionably more robust.

5. Effortless Failovers

To manage failovers, we employ MariaDB Orchestrator, an open-source solution. Around the orchestrator, we have created specialized tools that enable health checks, failure detection, performance degradation, and failovers as necessary. Orchestrator maintains an up-to-date and accurate view of the database cluster's topology, including the roles of each node (primary or replica) and their relationships. This information helps us manage failovers and keep the database in good shape. Additionally, it aids in reducing the possibility of data loss and any divergence during failover.

6. Future Flexibility

With our current DB design, we have the flexibility to expand and change along with the PKI industry's constantly evolving landscape.

To conclude, the transition to MariaDB Replication from Galera was a pivotal decision that has provided us with enhanced scalability, improved performance and simplified operations. It was an informed choice that has positioned us well to face new challenges. We feel confident that we are equipped with the tools necessary to evolve and expand our reach to global customers.