Canvas LMS deployment for the USAF

David Owen

Cloud Infrastructure Architect
Systems Engineer
Engineering Manager
Linux
PostgreSQL

Background

The United States Air Force has an internal education program that most of its service members to participate in. Their courses were spread over several courseware systems, none of which were meeting operational requirements.
Their goal was to deploy Canvas LMS to a private, cloud-hosted environment, and migrate courses and accounts for approximately 220,000 students. The new system had to scale to their anticipated load and be highly reliable, secure, and cloud-agnostic.
David would lead the project team, and be assisted by a few other developers who were assigned to the project part-time. At the beginning of the project, David was the only project member who had the technical skills and experience to deploy such a system. Besides meeting technical requirements, he would also train other project members in the system.

Deployed services

The final deployed system consisted of over two dozen services. Directly related to Canvas itself were: Canvas application servers and background workers, PostgreSQL, NFS file server, Cassandra, Redis, HAProxy, Postfix, and Dovecot. WordPress, MariaDB, Etherpad, an SSO integration, Nagios, Graphite, Squid, Fail2Ban, ntpd, ISC BIND, ClamAV, AIDE, audit logging, a machine compliance scanner, and central rsyslog and backup servers were also deploy to support additional features and requirements.
David automated the deployment and configuration of all of these services. He designed the deployment scripts so that they could be safely run on already-deployed machines to bring them back into configuration compliance or deploy configuration changes. His extensive experience was key to being able to deploy so many custom services.

Reliability

Critical machines and services were either horizontally scaled and redundant or had slave machines available. Detection of machine outages and subsequent fail-over were automatic. With deployment being scripted, any machine could be replaced in minutes. Besides redundancy, all critical data was periodically dumped to backup storage.
David engineered a logging and metric system out of existing components and a few custom agents. Metrics were logged by local agents to syslog, allowing easy correlation between machine state and logged events for any given point in time. Logs were shipped to a central server where cross-machine events could be correlated. The central server also extracted the metrics and sent them to Graphite or Nagios. Logs and metric history in both Graphite and Nagios were kept for over a year, allowing us to compare current behavior with historic norms.
He designed several custom metrics, along with agents to collect them. These additional metrics not only helped diagnose problems, but also helped the team improve machine utilization.
The metrics and status dashboards were so good, that the team was able to diagnose several problems just from them.

Security

Security was critical. Machines were configured to target applicable STIG guidelines from the DoD. David designed the configuration scripts to emit any assertions about aspects they controlled (e.g. Apache config). He wrote an additional scanner to collect all other assertions, and produce a machine-readable report that could be rolled-up to the DoD’s central compliance dashboard.
To prevent intrusions, the entire system was protected by an active firewall provided by the Air Force. Each machine additionally used fail2ban. All files, from system package updates to student uploads, were scanned by ClamAV.
System files were fingerprinted with AIDE to detect any tampering, and rootkit scanners run periodically. Kernel auditing logged all privileged operations to syslog, which was also streamed off machine to help prevent log-tampering.

Knowledge-sharing

It was important to document the system and train other team members and new hires in its function and maintenance.
An important foundation of this effort was a knowledge base. David created the knowledge base and began documenting the system and its many components, organizing articles into a useful hierarchy. Project members could subscribe to articles or sets of articles to be automatically notified of changes.
He also ran weekly training sessions. He trained on the deployed services, tooling, cloud infrastructure, and overall mindset and approach. He added notes from each session into the knowledge base.
David also established a culture of retrospectives. Any time there was an issue with the system, the investigator would start a new entry in the knowledge base. That entry would be updated as investigation and fix progressed. At the end, the team would do a retrospective with actions to improve the system.
* Canvas LMS logo copyright Instructure Inc and made available under the terms of the AGPL-3.0. Original source at https://github.com/instructure/canvas-lms/blob/master/public/images/canvas-logo.svg. Format and aspect conversion by David Owen.
Partner With David
View Services

More Projects by David