Minimizing downtime on Amazon AWS

As we argued in anotherarticle, being fast is the secret to scalability.Automation makes you speedy. It helps ruling out as many commodity tasksas possible. So, we decided to share a few tricks about increasingyour uptime with automation.Downtime is bad. It moves your focus away from creating your awesomeproduct to the arduous task of fixing broken things. It is a completewaste of time, whether it is because of unexpected outages, crashes oryour own software update procedure.Our cloud printing company Peechoruns on Amazon Web Services.Every week, we deploy multiple new versions of our entire system. Still,our Pingdomstatistics show a 99.96% uptime over the past year. The followingwrite-up by MarcelPanse shows our efforts to minimize downtime with AWS, based on somebest practices and an automated deployment procedure of instances withinan auto-scaling group.

Lightning zaps your cluster

Divine intervention is hard to predict. Volcanoes erupt, nuclear powerplants flood and sometimes even Amazon goes down. Luckily, the AWSinfrastructure is split into self-sufficient chunks to become morerobust during catastrophes.Most apps live in a single AWS region. For example, everything Peecho ishosted in the European AWS region. However, to ensure maximum uptime incase of emergencies, all our applications run in at least two differentavailability zones. This is called a multi-AZ approach.Availability zones are like separate hosting co-locations in a singleregion. If one of those burns to cinders, the other two should stillwork. This means that your multi-AZ app stays available if somebodyincidentally trips a wire – or whenlightning strikes.

Cloud servers die, too

Even in the dark dungeons of Amazon, physical servers finally wear outand die. Such an unfortunate event may cause one of your runninginstances to get terminated as collateral damage. If you applied therule of running at least two instances under a load balancer, you arefine. When one instance dies, the other instance will still be acceptingconnections.The killed instance can be resurrected manually, but that is hardlyscalable and at least rather annoying if it happens at night.The good news is, that you may be able to continue sleeping after all.This superior peace-of-mind can be achieved by using the reallycool AWSauto-scaling feature to automatically invoke a new EC2 instance ifthe number of active instances gets below a certain number.You can find everything you need to know about auto-scaling in theexcellent book ProgrammingAmazon EC2 by Jurg van Vliet and Flavia Paganelli. It is a must-readbefore attempting this kind of stuff.

Automated updates with auto-scaling

Catastrophes do not happen that often. In almost all software projects,scheduled outages account for most of the downtime. The popularity ofiterative development only makes it worse. In the eyes of your users,force majeure events may let you escape liability – but we find scheduledfailures really hard to explain to our marketing department. Thereis no other option than to automate the deployment procedure.For starters, you need at least two servers in a load-balanced set-up.This way, you can deploy your new code to server A, while server B keepsrunning – and deploy server B once A is back and ready with the newversion. Again, you can simply use a loadbalancer to achieve this.Now, it gets interesting. Leveraging the AWS services fully to be ascost-efficient as possible, our cloud system is elastic. Thatmeans it scales up and down on demand using auto-scaling, and there isno way to predict the number of active « machines » at any given moment:it could be two, or ten, or hundred – and then back to two. In case ofour processing engines, we even like to stick to zero as aminimum number, but that is an entirelydifferent story.This elasticity complicates automated software updates without downtime.However, if you execute the next steps, you can get the updateautomation up and running.

    • Create a self-provisioning AMI;


    • Create a template;


    • Configure a scaling group;


    • Programmatically achieve remote version listing;


    • Programmatically achieve remote deployment;


    • Secure it.


As an example case, we will take a look at the Printcloudadministration interface – the console that controls our system forrouting print orders to production facilities. It is a relatively simpleweb application, running from a standard CentOS Linux AMI and createdwith Java, Spring, Hibernate and Jquery.

Creating a self-provisioning AMI

The EC2 instances need to be self-provisioning to accomplishautomated updates within an auto-scaling group. This means when theinstance gets created by the auto-scaling feature, it automaticallyupdates itself to the correct version. The instance should be able tostart, download the latest version, deploy it locally and starteverything without user interaction. Therefore, we run an Antdeployment script every time one of our servers starts.The script needs to know which environment it is supposed to deploy to.In our case, that is a choice between test, acceptance or productionenvironments. To this end, the script first retrieves an environmentproperty called sys.env. This property is entered whenlaunching a new instance in the user-data field. You canretrieve this from Ant using a URL:

<property url=""></property>

This IP address is an internal reference thatyou can use in all your instances to retrieve instance specific datalike instance-id or user-data.Next up is to get to know which version of the software to download anddeploy. We store the accurate version number in an S3 bucket, because itis easily accessible from all servers. Again, the version number in S3sits in a one-line property file containing something like version.number=123.You can load the properties file from Ant the same way as before, usingthe property URL tag. Just replace the URL with the complete S3 URL.Take note: the file should be publicly accessible, but keep it read-only.When you have both the version number and the environment properties youcan start downloading the correct build from your build server – we use AtlassianBamboo. Another good practice is to create a environment independentWAR file and create different zip files with environment-specificconfigurations. This way you don’t have to create multiple WAR files,which are slow to build and take lots of space, too.In short, this is what the deployment script does.

    • Download latest version and configuration from Bamboo;


    • Stop the Tomcat server;


    • Clean webapps and work directories;


    • Unzip the download in the correct Tomcat folders;


    • Start the Tomcat server.



Creating a template

After your server has become self-provisioning, you can create theactual AMI. Be careful: if you create an AMI from the AWS console, itwill restart your server. Instead, you could use elasticFox,a Firefox plugin to create an AMI from running instances. Use the optionto create the AMI without restarting the instance. You could name theresulting AMI something like ‘website-project-2011-08-30’.When the AMI is ready, you can start creating a template for it. Wenormally use Ylastic to do this, butnote that it is a paid service. It is worth it, though. For the geeks,it is also possible to create a template from command line, againexplained in thebook of Jurg and Flavia. When creating the template you have tospecify the user data. Create a template named ‘test-website-project’and specify the user data – with sys.env=test.

The auto-scaling group

Next, you need to create an auto-scaling group. Again, we use the YlasticUI interface to do this, but it can be done from command line as well. Wespecified the scaling group with a minimum of 2 servers and a maximum of 2servers. This means that when a server gets terminated, a new instancewill get launched automatically – using the template you have createdbefore. If, for some reason, there are too many instances, then one willget terminated. You can do lots of really cool stuff using theseauto-scaling rules, like scaling up and down along with traffic or CPUusage across instances under the load balancer.


Potentially, we can now have any number of running instances in a singleauto-scaling group. Let’s assume that these instances need to be updatedto a next version of the software. Of course, manually deploying the newversions using SSH or even terminating all of the instances and updatingthem automatically would not suffice. Also, the update procedure shouldnormally take place sequentially, to prevent downtime. We decided tobuild a system where we could deploy automatically to all instances oneby one directly from our admin user interface.

Load balancer setup

Let’s assume that you enter the Printcloud user interface. Because of theload balancer, you normally don’t know exactly which instance you areconnected to. So, every instance should be able to perform the updatesequence. A single monitoring instance with a master-slave modelis unadvisable, because that would introduce a single point of failure.Luckily, Amazon gives us an API to collect the data we need from anywhereand a cool Java SDK tomake it even easier.Configure the AWS load balancer client as follows.

<bean id="awsCredentials" class="com.amazonaws.auth.BasicAWSCredentials">  <constructor-arg value="${platform.aws.aws_access_key}" />  <constructor-arg value="${platform.aws.aws_secret_key}" /></bean><bean id="amazonElasticLoadBalancingClient"	class="com.amazonaws.services.elasticloadbalancing.AmazonElasticLoadBalancingClient">  <constructor-arg ref="awsCredentials" />  <property name="endpoint" value="elasticloadbalancing.eu-west-1.amazonaws.com"/></bean>

Here is the code to list all instances in a load balancer.

DescribeLoadBalancersRequest describeLoadBalancersRequest 	= new DescribeLoadBalancersRequest().withLoadBalancerNames(loadBalancerName);DescribeLoadBalancersResult describeLoadBalancers 	= elbClient.describeLoadBalancers(describeLoadBalancersRequest);LoadBalancerDescription loadBalancerDescription 	= describeLoadBalancers.getLoadBalancerDescriptions().get(0);List<Instance> instances = loadBalancerDescription.getInstances();

Use the Instance object to fetch the public DNS attibute.

DescribeInstancesRequest describeInstancesRequest 	= new DescribeInstancesRequest().withInstanceIds(instance.getInstanceId());DescribeInstancesResult describeInstances 	= ec2Client.describeInstances(describeInstancesRequest);String publicDnsName 	= describeInstances.getReservations().get(0).getInstances().get(0).getPublicDnsName();


Version listing

Basically, the load balancer should supply a list of all instancesrunning in the load balancer. After receiving the public DNS from theinstances, you need to figure out which version they are running – so aglobally available version property can be injected during the buildphase. To create a list of all instance versions in the load balancer, aURL requesting the version is called on each instance. An example: http://public-DNS-of-instance.com/webapp/version.

The upgrade mechanism

There are two scenarios. The first is when there are no databasechanges, or all database changes are backwards compatible. In that case,updates will not break any running code. Now, all instances should beupdated one by one. This sequence results in zero downtime.The second scenario involves a database change that would break thecurrently running code, like a column name change. In this case, allinstances should be updated at once to minimize downtime. This willresult in about 10 to 30 seconds of downtime, mostly depending on thestart-up time of the Tomcat server.We can trigger the appropriate upgrade scenario directly from our admininterface. The admin interface knows which servers to deploy, so ourcode triggers the instances to deploy depending on the scenario youchoose. It triggers another instance to deploy by simply calling an URLon the instance, like http://public-DNS-of-instance.com/webapp/upgrade.By the way, we use the excellent Spring restTemplateto create GET and POST requests on URLs.


Both the version URL and the upgrade URL should be protected. If youdon’t do this, evil hackers could start upgrading your servers -although it seems a nice gesture, we prefer to do thatourselves.To secure the upgrade features we added 2 more parameters to both theURLs: a timestamp and a secret. The timestamp is just a simpleUNIX-based timestamp. The receiving instance checks if the timestamp isnot older than a couple of minutes. This way the URL is only usable fora maximum of a couple of minutes. To make sure nobody can fiddle withthe timestamp we hash the timestamp using a secret key that is onlyavailable on the instances itself. The receiving instance creates a hashwith his secret key on the timestamp it received and the hash should ofcourse match the secret you received.

That’s all, folks

People like shiny stuff with colors and such, so to wrap things up hereis a screenshot of our pretty admin interface deployment section.We wish you eternal uptime.


Latest blogs

Beginner's guide to selling print products via Peecho

Read blog

Using coupons: boost your Peecho sales with promo codes

Read blog

Unapologetically savvy: B Louder magazine

Read blog