We configured the tool to keep 5 days worth of snapshots and delete everything else.Īny attempts to restore a replicated snapshot from our repository failed with an unknown error and not much else to go on.
The first ES Restore-a-thon took place months after the feature was complete and deployed in production so there were many snapshots taken and many old ones deleted. Like running a fire drill, our quarterly Restore-a-thons help keep our team prepped and ready to handle any emergency. Not to mention the hands-on training and experience team members gain actually doing something not under the high pressure of a real outage.
It is a lot of work, but it is absolutely necessary! In each Restore-a-thon we have uncovered at least one issue with services not having backups enabled, not knowing how to restore, or access the restored backup. One might think "Oh my, that is a lot of unnecessary work!" and you would be half right. We would assume the original region was gone and that we had to restore each database from our cross regional replica and validate the contents. Because we like to have some fun, we decided to hold a quarterly "Backup and Restore-a-thon". Part of SOC2 certification requires that you validate your production database backups for all critical services. Here is what the final solution looks like: We decided to run Curator as a Lambda function driven by a scheduled EventBridge rule, all packaged in AWS SAM. It even has helper methods for creating custom snapshot repositories which was an added bonus. It was open-source and maintained by themselves.Ĭurator is simply a Python tool that helps you manage your indices and snapshots. The next tool we investigated is called Elasticsearch Curator. co's ES and that SLM was not supported in AWS ES. We quickly learned that AWS ES is a modified version of Elastic. However, as soon as we tried to set this up in our domains it failed. You can even use your own snapshot repository too. The policy can also delete snapshots based on retention rules you define. An SLM policy automatically takes snapshots on a preset schedule. The easiest way to regularly back up a cluster. The first tool we tried was Elastic's Snapshot lifecycle management (SLM), a feature which is described as: We didn't want to reinvent the wheel, so we searched for an existing tool that would do the heavy lifting for us. Maintaining our own snapshot repository wasn't ideal, and sounded like a lot of unnecessary work. Our only choice was to create and manage our own snapshot repository and snapshots. List of automated snapshots GET _cat/snapshots/cs-automated-enc?v&s=id
This was a problem for us because we wanted a daily snapshot sent to a repository backed by one of our own S3 buckets, which was configured to replicate its contents to another region. The repository is configured by default to take hourly snapshots and you cannot change anything about it. Since S3 has the ability to replicate its contents to a bucket in another region, it was a perfect solution for this particular problem.ĪWS ES comes with an automated snapshot repository pre-enabled for you. There are multiple types of snapshot repositories, including one backed by AWS S3. Snapshots are stored in a snapshot repository. Snapshotsįirst, we'll need a quick vocabulary lesson. Let's dive into how Elasticsearch works, how we used it to securely backup data, and our current disaster recovery process. The answer was to do what Rewind does so well - take a backup! Specifically, what we needed was a way to replicate our customer's data securely, efficiently, and in a cost-effective manner to an alternate AWS region. "An entire AWS region? That will never happen!" (Except for when it did)Īnything is possible, things go wrong, and in order to meet our SOC2 requirements we needed to have a working solution. As part of our System and Organization Controls Level 2 (SOC2) certification process, we needed to ensure we had a working disaster recovery plan to restore service in the unlikely event that the entire AWS region was down. Every second of downtime counts, so our search results need to be fast, accurate, and reliable.Īnother consideration was disaster recovery. Speed is essential when customers are looking for a particular file or item that they need to restore using Rewind. To put it simply, ES is a document database that facilitates lightning-fast search results. One of the databases we use is called Elasticsearch (ES or Opensearch, as it is currently known in AWS). Unsurprisingly, here at Rewind, we've got a lot of data to protect (over 2 petabytes worth).