We work hard to make sure that you have reliable access to your data. Learn more about how we do it and get a system status check.
Current Status
Use this info page to get a report on our current server status. We regularly have 99.9% uptime.
Before relocating to Atlanta, Georgia, we were originally a South Florida company and having weathered through several hurricanes, we were well versed in the dangers of being unprepared for disaster. It’s for this reason that we’ve never hosted servers of any kind in Florida or Georgia; our systems reside in Amazon’s AWS cloud.
We’ve worked hard, sometimes at the expense of wiz-bang features, to make sure that you have reliable access to your data. This FAQ covers some of the ways we make sure your data is available as well as how we prepare for issues and disaster recovery.
Operational Resiliency
Global High Speed Access
We’re proud of our international user-base and strive to deliver fast page loads across the globe. It all starts with our lightning quick page loads, which stay fast across the globe thanks to the Amazon Cloudfront CDN, with hundreds of edge locations across the globe.
99.99% Average Monthly Global Availability
When you operate in the cloud you have to anticipate failure. This is why operating your own data centers in a small or medium business is begging for downtime (unless you’re on the S&P 500, you have no business running your own data center, and shame on you if you do).
We prepare for the inevitable interruption by operating our application stack (web servers and databases) in multiple AWS Availability Zones 24/7. We have multiple web servers in each zone we occupy and we always occupy at least two zones.
It’s worth noting that while our system status is published at status.tave.com, we consider our system down if email, or even just daily reminders is down. Many of our downtime events are minor, however we will always remain committed to providing reliable information about our system status when our users need it the most.
Availability Events
The following events do not involve any data loss.
Web Server Failures
When large AWS users Netflix and Reddit are down, there’s a chance that we’re affected, as it suggests that several AWS zones are experiencing critical failures. If our zones are indeed affected, we will act quickly to provide updates on our system status at status.tave.com.
Our web servers sit behind Amazon’s elastic load balancers, which Amazon advertises as auto-scaling and auto-healing. In the event of a total ELB failure, our architecture is designed to automatically provision a new ELB, which generally takes 5-10 minutes. The system would appear fully offline during this type of event, but there is no risk of data loss as these systems simply route requests to web servers across the AWS zones.
Database Failures
We use a multi-zone AWS Aurora database cluster and benefit from their built-in durability including automatic fail-over in case of a database availability event. Fail-overs tend to complete within 60 seconds, though that could be extended in the event of an AWS-wide disruption.
EBS Failure–The AWS Achilles’ heel
Amazon’s Elastic Block Storage, which provides networked disk access to servers and databases, is quite possibly the most common culprit in AWS outages. Historically, this has been our largest risk, as a failure in their EBS system has often resulted in an inability to spin up new instances until the event is resolved. This wait, even though it often slows down the recovery process, has never involved data loss to date. Luckily, we’ve only had to witness this situation twice since moving the full application stack to AWS in March of 2011, and in one of those two events we were unaffected as our primary database was not in an affected zone.
Disaster Recovery
These options are for true emergencies only. We’d never take these actions automatically as they involve some level of data loss and we have never had to perform any of these actions in our live production stack. We practice these protocols regularly in order to provide full stack replicas for DevOps and Quality Assurance testing, which gives us even more confidence in their soundness. They’re listed in our priority order, which focuses on data integrity over speed.
5-minute Database Snapshots
While using Amazon’s database services may yield occasional hiccups, one of the great features is filesystem-level snapshots of the database every 5 minutes. These snapshots can then be used to spawn up a new server. In a true multi-client AWS disaster, it may take a while to spin up a new server, but when it returns there would be minimal data loss of 0-5 minutes.
We’ve only had to consider executing this step once, and the primary database recovered while the new snapshot-based server was booting up.
This protocol is tested frequently, as these snapshots are used to reload our staging servers each morning.
Multi-Region Write-Once Database Backups
In addition to the built-in AWS features mentioned above, we use AWS Backup to store snapshots of the database in separate backup vaults across 2 regions. For both legal and recovery reasons, we store these hourly backups for 1 week, our daily backup for 5 weeks, and our monthly backups for 2 years. For security reasons, the backups use AWS Backup Vault Lock to prevent accidental or malicious modification or deletion of backups.
Prevention & Security
In addition to responding to infrastructure events, we heavily monitor our systems to observe the current state and trigger alarms when situations change. In this way, we often avoid potential issues by catching them before they result in a downtime event.
Our system logs are copied to S3 buckets with Write Lock to prevent malicious deletion or modification of access logs or change logs.
Our systems also meet or exceed security best practices. Our systems are always locked down to only those who need access and require two-factor RSA authentication codes in order to access them. This includes our systems, our application source code, and our communications platform, all of which use separate services from separate companies.
Summary
We hope to maintain and improve our 99.99% average availability and hope these recovery methods are never necessary. Since you can never be too careful, we’ll continue to review and update these processes as our application evolves.