Four Stages of CloudFormation

AWS CloudFormation gives developers and systems administrators an easy way to create and manage a collection of related AWS resources, provisioning and updating them in an orderly and predictable fashion. AWS CloudFormation Homepage

I’ve gone from never having used Amazon CloudFormation to building multi- tier, cross region, many availability zone deployments in a couple of months and while digging through official documentation, support requests, blog posts and sample templates I’ve put together what I’ve come to view as the ‘Four Stages of CloudFormation’. If I’d known about these when I first started then I’d have saved myself some time and a few more of those few, semi-precious remaining hairs.

One Template to rule them

It begins innocuously. You decide to use CloudFormation and you start to put your resources in to what will become the all encompassing JSON file of darkness. You add the VPC, a couple of subnets and then you do a test build. It fails. You make the corrections and continue. A couple of times over the day you make enough progress to warrant another test run. Sometimes it fails and rolls back, some times it passes. You end the day with enough VPC in place to run an autoscaling web server group and all is well. You tear down all the testing resources and go home.

You start to add the application autoscaling groups and their requirements, security groups, subnets, launch configs etc and then you do a test run. You watch your email as tens of SNS notifications come in as the stack builds itself and then… it fails. You start getting the rollback emails. Something went wrong and now you get to see each stage of the build unwind itself. Your testing time has now grown to maybe 30 minutes for a change. Sometimes you get an intermittent failure like an EIP not getting attached and a CloudFormation template that normally works folds like cheap paper in the rain. You can lose a day testing a handful of changes this way - especially when you involve RDS. So you decide to grow and change, you decide to have multiple templates.

Little Nightmares

So you look at your architecture diagram (or your wedge of JSON) and start to separate the resources in to logical groupings. Basic VPC config, webservers and supporting functionality, RDS and option groups etc. You run the basic VPC template and it goes through quickly and easily. Too easily

You move further in and run the bastion host template. The reference errors begin. When everything was in a single template requiring a couple of “Parameters” and using references everywhere inside the template (“VpcId” : { “Ref” : “VPC” } ) was easy. Now you have to pass in a parameter for each bit of state you need in this new template. VPC id, public subnet ids, NAT route table ids. Your command lines start getting bigger but you decide the shorter testing cycles and component separation is worth it. Then you discover that SourceSecurityGroupName is a lie across templates and needs to be SourceSecurityGroupId, which you need to pass in.

Between the carried state and the duplication, writing out one public subnet, NAT route, Route association per AZ and then adding them all again but this time with 2 in the name because the template doesn’t have iteration, you decide that a little coding will make it all so much better.

The Wrapper

Some people start off with a script to build their infrastructure - often using boto or fog - and gradually add to it over time. Avoiding all the hassles of the built in CloudFormation types and piles of JSON is an alluring prospect. However this leads to the same kind of problems that Puppet and Chef solved over provisioning shell script. Writing idempotent code against a big backend of different APIs is hard. You can end up with masses of exception handling code. Also scratching my personal itch becomes a lot harder - I like to generate a view of what impact this changeset will have - something that’s quite easy to do if you have CloudFormation as your intermediate format (you’re diffing two JSON files) but is quite hard to do consistently well when you’re using a REST API and making lots of individual calls.

While I’ve been quite down on a pure script and API call approach I think using a library and scripting your infrastructure using something that abstracts CloudFormation is the current winning approach for me. When configuring an app that needs to run over three availability zones I can call a method in loop that generates the pile of CloudFormation boilerplate and keeps the three occurrences of it in perfect sync. I can even do the template upload and stack creation from the script itself and create a set of post actions that turn on name resolution or run basic nagios checks to confirm the stack works.

Compared to where I started this feels like a much better tool chain and removes a lot of the painful scutwork but it does feel like an intermediate step. Which leads us to

The Future

So what’s beyond this? I think that the libraries will improve in a couple of different directions. Firstly something like an ActiveRecord / Clusto syntax mapping of common stacks could save a lot of time and effort -

application = cf_lib('appname')
application.tier('web', asg=True, own_subnet=True)
application.tier('app', asg=True)

application.connect_sgs(from=application.tier('web'), to=application.tier('app'),ports= [ 80,443 ] )

This tiny chunk of code would hide masses of configuration by convention boilerplate. Subnets, basic cloudwatch, notification topics, network ACLs etc. It’d also allow easy specification of parameters across templates. Eventually it’d become available as a layer in visio and then you can ‘draw’ your new applications, diff the JSON, get it back in a nice graphical form and I can put my pretty printing script to sleep.

I think the second direction will be to fill in some of the gaps in the CloudFormation functionality. It’s currently impossible to turn on the per instance public fqdn option for a VPC using cloud formation. You can build your entire stack but to reach any of the hosts via a fqdn you either use the cli or the web interface. I think library shims for this kind of missing functionality masquerading as CloudFormation types could be added then listed as pre or post run actions or a normal dependency. Once CloudFormation adds support you can then remove the shim.