We work hard to make sure that you have reliable access to your data. Learn more about how we do it and get a system status check.

Current Status

Use this info page to get a report on our current server status. We regularly have 99.9% uptime.

Originally a South Florida company and having weathered through several hurricanes, we were well versed in the dangers of being unprepared for disaster. It’s for this reason that we’ve never hosted servers of any kind in Florida; our systems reside in Amazon’s AWS cloud.

We’ve worked hard, sometimes at the expense of wiz-bang features, to make sure that you have reliable access to your data. This FAQ covers some of the ways we make sure your data is available as well as how we prepare for issues and disaster recovery.

Operational Resiliency

Global High Speed Access

We’re proud of our international user-base and strive to deliver fast page loads across the globe. It all starts with our lightning quick page loads, which stay fast across the globe thanks to the Amazon Cloudfront CDN, with over 52 edge locations across the globe.

99.99% Average Monthly Global Availability

When you operate in the cloud you have to anticipate failure. This is why operating your own data centers in a small or medium business is begging for downtime (unless you’re on the S&P 500, you have no business running your own data center, and shame on you if you do).

We prepare for the inevitable interruption by operating our application stack (web servers and databases) in multiple AWS Availability Zones 24/7. We have multiple web servers in each zone we occupy and we always occupy at least two zones.

It’s worth noting that while our system status is published at status.tave.com, we consider our system down if email, or even just daily reminders is down. Many of our downtime events are minor, however we will always remain committed to providing reliable information about our system status when our users need it the most.

Availability Events

The following events do not involve any data loss.

Web Server Failures

When large AWS users Netflix and Reddit are down, there’s a chance that we’re affected, as it suggests that several AWS zones are experiencing critical failures. If our zones are indeed affected, we will act quickly to provide updates on our system status at status.tave.com.

Our web servers sit behind Amazon’s elastic load balancers, which Amazon advertises as auto-scaling and auto-healing. In the event of a total ELB failure, our architecture is designed to automatically provision a new ELB, which generally takes 5-10 minutes. The system would appear fully offline during this type of event, but there is no risk of data loss as these systems simply route requests to web servers across the AWS zones.

Database Failures

In the event of database failure, which happens in any IT environment, our setup is ready to automatically failover to a hot standby in another data center within the same region. If our database server crashes, our application is back up and running at full speed with ZERO data loss in 1-3 minutes and ZERO human intervention (we like it that way!).

EBS Failure–The AWS Achilles’ heel

Amazon’s Elastic Block Storage, which provides networked disk access to servers and databases, is quite possibly the most common culprit in AWS outages. Historically, this has been our largest risk, as a failure in their EBS system has often resulted in an inability to spin up new instances until the event is resolved. This wait, even though it often slows down the recovery process, has never involved data loss to date. Luckily, we’ve only had to witness this situation twice since moving the full application stack to AWS in March of 2011, and in one of those two events we were unaffected as our primary database was not in an affected zone.

Disaster Recovery

These options are for true emergencies only. We’d never take these actions automatically as they involve some level of data loss and we have never had to perform any of these actions in our live production stack. We practice these protocols regularly in order to provide full stack replicas for DevOps and Quality Assurance testing, which gives us even more confidence in their soundness. They’re listed in our priority order, which focuses on data integrity over speed.

5-minute Database Snapshots

While using Amazon’s database services may yield occasional hiccups, one of the great features is filesystem-level snapshots of the database every 5 minutes. These snapshots can then be used to spawn up a new server. In a true multi-client AWS disaster, it may take a while to spin up a new server, but when it returns there would be minimal data loss of 0-5 minutes.

We’ve only had to consider executing this step once, and the primary database recovered while the new snapshot-based server was booting up.

Off-site Database Backups

In addition to the built-in AWS services, we perform automated transactionally-locked backups of the database via a read-slave every hour. For both legal and recovery reasons, we store these hourly backups for 3 months, our daily backup for 36 months, and our monthly backups indefinitely since 2006. These backups are stored off-site from the application in a walled-off isolated system. For security reasons, the application accounts are granted write-only access to the primary backup location and thus have zero read or delete access.

This protocol is tested frequently, as these backups are used in our release testing process.

Prevention & Security

In addition to responding to infrastructure events, we heavily monitor our systems to observe the current state and trigger alarms when situations change. In this way, we often avoid potential issues by catching them before they result in a downtime event.

Our systems also meet or exceed security best practices. Our systems are always locked down to only those who need access and require two-factor RSA authentication codes that recycle every 60 seconds in order to access them. This includes our systems, our application source code, and our communications platform, all of which use separate services from separate companies.

Summary

We hope to maintain and improve our 99.99% average availability and hope these recovery methods are never necessary. Since you can never be too careful, we’ll continue to review and update these processes as our application evolves.

Did this answer your question?