Backing Up Data Centre Hosted Data To AWS

Currently at work, we backup the majority of our critical on-premises data to Azure, with some local retention onsite, as the majority of restores are needed for data within the last week or so. This is done using a combination of Microsoft Azure Backup Services (MABS) and the standalone Microsoft Azure Recovery Services (MARS) software and agents.

In time the majority of the on-prem data is likely to move to the cloud, but this takes time, with various products and business functions to move, so for now we still have a large data set that we need to backup from our data centres. The cloud all this on-prem data is moving to is and will continue to be AWS, but the backups happen into Azure, for mainly historical reasons.

We started looking at whether we could move this from Azure to AWS to reduce complexity, simplify billing and potentially reduce cost, because as I said, a lot of our estate runs in AWS already. We fairly quickly found out that the actual “AWS Backup” solution doesn’t really cover on-premises data directly, and from there things started to get complicated.

So we looked at the various iterations of Storage Gateway, including file gateway and volume gateway.

Tape gateways are pretty much ruled out as we don’t have any enterprise backup software that will write to tape storage, so would incur additional cost to purchase licences for that.

File Gateway does most of what we need as we can write backup output from things like MSSQL or MySQL servers running on-prem, to the file gateway presented volume and have that written back into S3 and backed up from there. However, as this can’t be throttled in terms of bandwidth and we don’t have a direct connect available for this means we can’t risk annihilating our data centre egress bandwidth.

Volume gateways would do what we need in terms of being able to present storage to a VM that backups are written to, and that can then be throttled and sent into S3. From there we’d have to pick that data up and move that via AWS Backup into a proper backup with proper retention policies attached, however as this bills as EBS rather than S3 storage, when we priced all this out it worked out considerably more expensive than our current solution of backing up into Azure, which again, pretty much rules this out as option.

Now, if only Amazon would add an on-prem option for AWS Backup, we’d be laughing – oh well, we can dream.

Serving Index Pages From non-root Locations With AWS CloudFront

Note: Adapted from someguyontheinter.net, I grabbed the content from web caches as the site appears to have been taken offline, but I did find it useful, so thought it might be worth re-creating.

So, I was doing a quick experiment with host this site in static form in AWS S3, details on how that works are readily available, so I’ll not go into that here. Once you’ve got a static website it’s not hard to add a CloudFront distribution in front of it for content caching and other CDN stuff.

Once setup and with the DNS entries in place, the Cloudfront distribution will present cached copies of your website in S3, and if you’ve got a flat site structure, such as this example below;

http://website-bucket.s3-website-eu-west-1.amazonaws.com/content.html

this will work fine.

However, if you have data in subfolders, ie. non-root locations, for example if there was a folder in the bucket called, “subfolder” such as the example here;

http://website-bucket.s3-website-eu-west-1.amazonaws.com/subfolder/

and you want to be able to browse to

https://your-site.tld/subfolder/

and have the server automatically serve out the index page from within this folder, you’ll find you get a 403 error from CloudFront. This problem comes about as S3 doesn’t really have a folder structure, but rather has a flat structure of keys and values with lots of cleverness that enables it to simulate a hierarchical folder structure. So your request to CloudFront gets converted into, “hey S3, give me the object whose key is subfolder/“, to which S3 correctly replies, “that doesn’t exist”.

When you enable S3’s static website hosting mode, however, some additional transformations are performed on inbound requests; these transformations include the ability to translate requests for a “directory” to requests for the default index page inside that “directory”, which is what we want to happen, and this is the key to the solution.

In brief: when setting up your CloudFront distribution, don’t set the origin to the name of the S3 bucket; instead, set the origin to the static website endpoint that corresponds to that S3 bucket. Amazon are clear there is a difference here, between REST API endpoints and static website endpoints, but they’re only looking at 403 errors coming from the root in that document.

So, assuming you’ve already created the static site in S3 and that can be accessed on the usual http://website-bucket.s3-website-eu-west-1.amazonaws.com URL, it’s example time;

  1. Create a new CloudFront distribution.
  2. When creating the CloudFront distribution, set the origin hostname to the static website endpoint and do NOT let the AWS console autocomplete a S3 bucket name for you, and do not follow the instructions that say “For example, for an Amazon S3 bucket, type the name in the format bucketname.s3.amazonaws.com”.
  3. Also, do not configure a default root object for the CloudFront distribution, we’ll let S3 handle this
  4. Configure the desired hostname for your site, such as your-site.tld as an alternate domain name for the CloudFront distribution.
  5. Finish creating the CloudFront distribution; you’ll know you’ve done it correctly if the Origin Type of the origin is listed as “Custom Origin”, not “S3 Origin”.
  6. While the CloudFront distribution is deploying, set up the necessary DNS entries, either directly to the CloudFront distribution in Route 53 or as a CNAME in whatever DNS provider is hosting the zone for your domain.

Once your distribution is fully deployed and the A record has propagated, browse around in your site and you should see all of your content, and it’ll be served out from CloudFront. Essentially what’s happening is CloudFront is acting as a simple caching reverse proxy, and all of the request routing logic is being implemented at S3, so you get the best of both worlds.

Note: nothing comes without a cost, and in this case the cost is that you must make all of your content visible to the public Internet, as though you were serving direct from S3, which means that it will be possible for others to bypass the CloudFront CDN and pull content directly from S3. So be careful to not put anything in the S3 bucket that you don’t want to publish.

If you need to use the feature of CloudFront that enables you to leave your S3 bucket with restricted access, using CloudFront as the only point of entry, then this method will not work for you.