Our #1 Priority is system availability. We’ve worked hard to make sure that you have your data whenever you need it. Learn more about how we do it and get a system status check.
Our standby site resides on a completely independent server system in Oregon. Should anything happen to the main Táve servers, the standby site serves as an always accessible read only version of the main production site.
Use this info page to get a report on our current server status. We regularly have 99.9% uptime.
Originally a South Florida company and having weathered through several hurricanes, we were well versed in the dangers of being unprepared for disaster. It’s for this reason that we’ve never hosted servers of any kind in Florida; our systems reside in Amazon’s AWS cloud, primarily in Virginia but with hot standby in Oregon.
We’ve worked hard, sometimes at the expense of wiz-bang features, to make sure that you have access to your data whenever you need it. This FAQ covers some of the ways we make sure your data is available when it matters most, first in how we keep it operating resiliently, then how we prepare for issues, and finally how we prepare for disaster recovery.
Global High Speed Access
We’re proud of our international user-base and strive to deliver fast page loads across the globe. It all starts with our lightning quick page loads, which stay fast across the globe thanks to the Amazon Cloudfront CDN, with over 52 edge locations across the globe.
99.99% Average Monthly Global Availability
When you operate in the cloud you have to anticipate failure. This is why operating your own data centers in a small or medium business is begging for downtime (unless you’re on the S&P 500, you have no business running your own data center, and shame on you if you do).
We prepare for the inevitable interruption by operating our application stack (web servers and databases) in multiple AWS Availability Zones 24/7. We have multiple web servers in each zone we occupy and we always occupy at least two zones.
It’s worth noting that our system status is published at status.tave.com. Please note that these availability checks scan all systems, including backend servers, so they are far slower than the average page load. The up-side is that we consider our system down if email is down, or just daily reminders, etc., so many of our so called downtime events are actually systems that handle it gracefully, such as calendar feeds.
These events do not involve any data loss
Web Server Failures
When large AWS users Netflix and Reddit are down, there’s a chance that we’re affected, as it suggests that several AWS zones are experiencing critical failures. If one of our zones are indeed affected, web servers in other zones take over. In a severe event, when an entire data center goes down and all clients are affected, it can take several minutes more for new servers to become available if every other AWS user is performing a similar recovery process.
While there may be a slight slow-down as fewer servers handle more traffic while new servers automatically spin up as part of our elastic architecture, we have always remained online during web server and EC2 zone failure events.
These multi-zone web servers sit behind Amazon’s elastic load balancers, which Amazon advertises as auto-scaling and auto-healing. In the event of a total ELB failure, our architecture is designed to automatically provision a new ELB, which generally takes 5-10 minutes. The system would appear fully offline during this type of event, but there is no risk of data loss as these systems simply route requests to web servers across the AWS zones.
In the event of database failure, which happens in any IT environment, our setup is ready to automatically failover to a hot standby in another data center within the same region. When our database server crashes, which happens maybe once a year on average, our application is back up and running at full speed with ZERO data loss in 1-3 minutes and ZERO human intervention (we like it that way!).
EBS Failure–The AWS Achilles’ heel
Amazon’s Elastic Block Storage, which provides networked disk access to servers and databases, is quite possibly the most common culprit in AWS outages. Historically, this has been our largest risk, as a failure in their EBS system has often resulted in an inability to spin up new instances until the event is resolved. This wait, even though it often slows down the recovery process, has not involved data loss to date. Luckily, we’ve only had to witness this situation twice since moving the full application stack to AWS in March of 2011, and in one of those two events we were unaffected as our primary database was not in an affected zone.
These options are for true emergencies only. We’d never take these actions automatically as they involve some level of data loss and we have never had to perform any of these actions in our live production stack. We practice these protocols regularly in order to provide full stack replicas for DevOps and Quality Assurance testing, which gives us even more confidence in their soundness. They’re listed in our priority order, which focuses on data integrity over speed.
Global Hot Standby Database
As mentioned above, we run a hot standby of our database in another zone of the AWS Eastern region. If both the primary and secondary databases fail simultaneously, we have a third hot option available 24/7 in another region (currently US West, Oregon). While we don’t run a full hot stack in this region, our hot database means we can spin up new web servers and background servers and be ready to go with virtually no data loss (there is a slight risk of loss if a database commit was in process during failure, but it’s highly unlikely). We replicate only our database because spinning up new servers is a trivial process in our architecture.
During this process, 1-800-560-TAVE would be opened for emergency data requests, as the recovery time is based on bringing new web servers and workers online, and not the database, which we can access internally.
Time: 1-3 hours from failure alarm.
Data loss: Highly unlikely.
5-minute Database Snapshots
While using Amazon’s database services may yield occasional hiccups, one of the great features is filesystem-level snapshots of the database every 5 minutes. These snapshots can then be used to spawn up a new server. In a true multi-client AWS disaster, it may take a while to spin up a new server, but when it returns there would be minimal data loss of 0-5 minutes.
We’ve only had to consider executing this step once, and the primary database recovered while the new snapshot-based server was booting up.
Time: 20-40 minutes from failure alarm.
Data loss: 0:01 – 5:00 minutes.
Off-site Database Backups
In addition to the built-in AWS services, we perform automated transactionally-locked backups of the database via a read-slave every hour. For both legal and recovery reasons, we store these hourly backups for 3 months, our daily backup for 36 months, and our monthly backups indefinitely since 2006. These backups are stored off-site from the application in a walled-off isolated system. For security reasons, the application accounts are granted write-only access to the primary backup location and thus have zero read or delete access.
This protocol is tested frequently, as these backups are used in our release testing process.
Time: 1-3 hours from failure alarm.
Data loss: 0:01-59:59 minutes.
Prevention & Security
In addition to responding to infrastructure events, we heavily monitor our systems to observe the current state and trigger alarms when situations change. In this way, we often avoid potential issues by catching them before they result in a downtime event.
Our systems also meet or exceed security best practices. Our systems are always locked down to only those who need access and require two-factor RSA authentication codes that recycle every 60 seconds in order to access them. This includes our systems, our application source code, and our communications platform, all of which use separate services from separate companies.
We hope to maintain and improve our 99.99% average availability and hope these recovery methods are never necessary. Since you can never be too careful, we’ll continue to review and update these processes as our application evolves.