SCORM Engine High Availability Architecture for AWS

We're often asked about what a High Availability SCORM Engine Architecture should look like, what its components are, and how one assembles such a thing. We've spent a LOT of time thinking about this problem as we've built out our Managed Hosting capabilities.

This article centers around the reference architecture for all of our Managed SCORM Engine and Content Controller deployments. If you have us host Engine for you, this is what you'll be getting. It's also very possible for you to create something very like this if you've got internal Dev/Ops resources that can wrangle it for you.

When we started looking at how to host SCORM Engine most efficiently, we first considered the set of common problems that all Engine installations face, namely:

How do we spec our environments for a given number of users? What is a “user” anyway?
How do we deal with traffic spikes? Our learners tend to hit the system all at once.
How do we handle content delivery to people around the world? We want for Australian users to have an experience that’s just as good as what our North American users experience.
If Engine is offline, our client's LMS might as well be offline. How do we build a High Availability SCORM Engine environment without it costing the Earth?
How do we build a secure environment that’s also economical?

Given that a few pictures are worth a ton of me running my mouth, here are some charts that describe typical High Availability SCORM Engine Deployments. Press the fullscreen button to get the, er, full effect.

Building on top of a cloud provider’s infrastructure was the clear 1st step, and we ultimately settled on Amazon Web Services - their feature set ticked all of the boxes for us, we already had a lot of internal expertise with AWS development, and a lot of our client base already uses AWS. Easy choice.

What’s a user? How many can we support?

When folks are integrating with SCORM Engine, a question that always comes up is “how do I build a system that can serve X number of users?” This is a great question, and sadly the answer is to ask more questions. <audience groans>

Sorry about that. So, what is a “user”? This is a harder question than you might think. Is a “user” a single person? Is it the sum total of that user’s interactions with the system over a period of time? How do we scale a system to serve a number of “users” if we don’t know what a user is?

When talking with our customers, it seemed like folks were most interested in two things:

Concurrent Users - The number of users that their system could serve at once
System Population - The system’s ability to serve a given annual user base

Once we figured that out, we went and locked ourselves in a dungeon laboratory for a few weeks and ran load tests against every kind of AWS setup we could think of. We came out of that with a set of hardware specs that we could look at and say, “yup, this will do the job for X Concurrent Users and Y System Population.” The article where we enumerate this is long, boring, and of interest mostly to nerds, but you can find it here if you’re interested.

System Population is an interesting number. In a lot of Engine installations, you very rarely have every registered user of the system engaged at once. Rarely. But sometimes (like right before a deadline, the end of the semester, or the last week of December) they all pile on at once, and that’s what leads us to the need for...

Spike Protection

SCORM Engine traffic (and LMS traffic) is inherently spiky. Imagine a typical scenario: the folks that use your LMS all have end-of-year training deadlines, and many of them tend to put things off until the last minute. As a result, traffic to your LMS and to SCORM Engine blows up tenfold during the last few days of the year, and everything grinds to a halt under the load. Cue the angry support calls complete with torches and pitchforks.

Ugh. Not awesome. But how do you deal with it? You don’t want to have bunch of expensive servers sitting around doing nothing 80% of the time just so you can handle big spikes.

Amazon provides Auto Scaling features that we use to solve this problem. We configure our Auto Scaling Groups to measure critical performance metrics, and we add new SCORM Engine application servers when these metrics pass critical thresholds. Once the traffic spike passes, the Auto Scaling Group sees that it’s time to cool off and shuts down the unneeded capacity.

Protecting against spikes at the database level is harder. SCORM Engine is inherently write-heavy, and it leans on master database servers pretty hard. For loads of a few hundred concurrent users, this isn't a big deal. However, if you're expecting really large spikes in traffic (as in many thousands of users all at once), load-testing your backend and having a protocol for upsizing and downsizing it it essential. The nuts and bolts of this are beyond the scope of this particular article, but if you want to chat about it, please drop us a line at support@rusticisoftware.com and we'll help you out.

High Availability

We all want 99.9999999% uptime for everything. Who wouldn’t? Things get more and more expensive for every nine you add, so building High Availability systems becomes a game of “where do I get the most benefit for the least risk?” We’ve built our Managed Environments around the idea that if the system must fail (and all systems do eventually) it should at least do so gracefully, and be able to recover without human intervention.

There are a few common challenges when considering High Availability configurations in Engine:

How do we share content files between Application Servers? The typical answer here is a SAN (expensive) or NAS (poor performance).
We can’t afford downtime, so we want to put servers in multiple datacenters for redundancy, but holy wow does that ever cost the Earth.
Even if we get fancy and have servers in multiple datacenters, how do we handle failover? Especially of the database servers?
How do we keep the configuration of all these servers in sync? Upgrades are a serious chore when you have to run them on 20 servers at once.

We solved the content file sharing problem by using Amazon’s S3 service for file storage. S3 is cheap, performs well, and very, very reliable. Done.

AWS makes solving the problem of availability across multiple datacenters straightforward. We have servers in multiple physical locations (in AWS-speak, these are called “Availability Zones”). The SCORM Engine servers are controlled by an Elastic Load Balancer. If one of those zones fails, the Load Balancer notes the failure, routes connections away from the failed systems, and spins up extra servers in the surviving zone to pick up the slack. A key feature here is that a human being doesn’t have to do anything for this to work - the system is smart enough to look after itself.

High availability for Database servers is a little bit tougher - we can only have 1 master database running at a time (the technical reasons for this are super boring, so I’ll spare you.) Happily, Amazon’s RDS service provides a mechanism that lets us automate failover to a database server in another zone in the event that we lose the primary. The failover happens without human intervention, and generally results in downtimes of less than 30 seconds. Not bad, given that it means we can survive a whole datacenter being taken out by an alien invasion, an asteroid strike, a horde of marauding Dothraki, or a truck knocking over the power pole on the datacenter’s loading dock.

All of this stuff requires a lot of orchestration and automation, particularly when you’re doing it at the scale we are. After a lot of experimentation, late nights, and spilled coffee, we settled on using Ansible for handling server build automation and Hashicorp’s Terraform and Consul tools to handle building the Amazon infrastructure.

Content Delivery

Over the last few years, we’ve gotten a lot of feedback from folks that use SCORM Engine about the need to provide high levels of service to a global audience. There was a lot of hunger for a system that would allow folks in Australia to have the same quality of experience with e-learning content as folks in Europe or North America. Content Delivery Network (CDN) support was the clear answer, but the implementation details were really tricky, and only a brave few attempted it.

In our Managed Environments, we solved this problem by integrating with Amazon’s CloudFront service. CloudFront allows us to cache SCORM and xAPI content in datacenters that are spread around the planet. When a user’s request for content doesn’t have to go halfway around the world in order to complete, it speeds things up a ton. As a bonus, it also greatly reduces the load on our main application servers.

Security

Security is one of those topics that can get really super nerdy really fast. It’s important, though, and we’ve put a lot of thought into how we make our Managed environments secure against abuse and intrusion. AWS provides a ton of really well-considered features that we’ve integrated with to wrap layers of security around our Managed environments.

We’ve built an integration with Amazon’s Web Application Firewall that we use as a layer of to protect SCORM Engine directly against naughty folks that would attempt to abuse it, and we use Amazon’s Shield service to defend against denial of service attacks.

We use Amazon’s VPC features to build our internal network bits, and to put put walls in between the various components of the system and only allow stuff that MUST talk to other stuff to communicate. When we build a new environment for a client, they get their own shiny new network infrastructure that’s isolated from everyone else.

We use Amazon’s Inspector Agent to monitor all of our servers for vulnerabilities and misconfigurations.

We build all of our servers from scratch every time we roll out an update to our software. This solves a whole set of availability and security problems. It also makes rolling updates with Terraform a LOT easier - there's no worrying about provisioning time for instances or taking stuff in and out of the load balancers piecemeal - we just stand up new AMIs in an autoscaling group, point the ELB at it, and off we go. If for some reason an instance is compromised (which, happily, hasn't happened yet,) we can just destroy it and move on.

Search