There are 3 primary components that we back up for each of our Managed Environments:
- Application Server configurations
- Database Server configuration and data
- Content (like SCORM and xAPI content)
Application Server Backups and DR:
- Application servers live in multiple, geographically separate data centers. We can lose an entire datacenter and your application servers will keep running just fine.
- Application servers are monitored by their load balancers. The load balancer regularly connects to our healthcheck APIs to verify that all of the system’s major components are online. If a health check fails, the instance is marked as unhealthy, terminated, and replaced by a fresh instance, and the support team is notified.
- Our continuous integration (CI) system builds your hosts from scratch every time we deploy an update. If we need to rebuild your application servers in the event of a disaster, we just rerun a CI job and new hosts are created from scratch.
- Config data for application servers is stored in Git repositories and region-specific Consul clusters. Consul Key/Value and service data is backed up prior to all changes.
- If we experience continuous failures in a particular environment, the Dev/Ops team will be alerted via slack message, SMS, and an automated phone call to the on-call engineers.
Database Servers:
- Database servers live in multiple, geographically separate data centers in a hot spare configuration. In the event that the primary master server fails its health checks, the system automatically fails over to the secondary master server in the alternate datacenter. The failed primary server marks itself as failed and alerts the Dev/Ops team. A Dev/Ops team member must then determine the root cause of the issue before we initiate the resync of the failed master to the new primary.
- Transaction logs are spooled from the primary master to the secondary master.
- Nightly disk-level snapshots are taken and stored in S3.
- Nightly MySQL dump backups are taken and stored in S3. These backups are kept in a separate location from the disk-level snapshot backups. We maintain these separate backups as a precaution in the event that a snapshot backup is corrupted or otherwise unrecoverable.
- If we experience a database failover event, the Dev/Ops team will be alerted via slack message, SMS, and an automated phone call to the on-call engineers. The Dev/Ops team will then verify that the environment is stable on the secondary server before troubleshooting the issue further.
Content Filestores:
- Per Amazon, S3 has 99.9999% durability. Yay for durability! By its nature, S3 is geographically distributed across multiple datacenters.
- We store all course content in versioning-enabled S3 buckets. In the event that an object is deleted, we can restore from the previous version.
- The S3 buckets are backed up by a script each night that copies them to a secondary bucket. In the event of an error that accidentally damages or destroys the S3 bucket that holds the content, the content can be restored from this backup.