The butterfly effect – or how a single font can take you down
In chaos theory, the butterfly effect is the sensitive dependence on initial conditions, where a small change at one place can result in large differences to a later state.On Thursday, May 29th, Peecho experienced a service degradation. Erratic outages affected most of our services, including the Peecho website, Simple Print Button, Simple Print Service and Simple Print API. Now that we have fully restored functionality to all affected services, we would like to share more details about the events that occurred.For those of you that don’t know: Peecho connects websites with a distributed network of large print facilities, allowing them to sell digital content as printed objects. For example, you can add the embeddable Simple Print Button on your website and sell your photos as canvas prints, or offer your digital magazine as hardcover photo books.We are a global service, running entirely in the cloud of Amazon Web Services. Our systems are built to grow and shrink dynamically, along with the amount of orders that are placed. As an AWS Technology Partner, we are expected to know our cloud stuff – but unfortunately, we make mistakes too.
In the server logs we could see an enormously increased number of requests being fired at our server cluster, resulting in larger response times and some downtime. The continuous requests appeared to be coming straight from one of our own Cloudfront distributions that has a custom origin pointing to the Peecho servers. The vast majority of these calls came from the website of a particular, brand new customer. All requests caused the Peecho error page to appear and they all originated from the Internet Explorer browser.However, after frantically eliminating many other scenarios, we finally found the root cause. Unbelievable as it may sound, our first real downtime in three years was caused by a reference to a single font file in one of our stylesheets.
The root cause
The customer added our embeddable Simple Print Button on every page of his website. That’s fine, because that is the way it works. As opposed to most of our larger customers, he decided to keep the default styling of the Simple Print Button, rather than applying his own CSS. That’s also fine – normally. This time, however, it went wrong.When the default styling of our Simple Print Button is used, it loads a Peecho stylesheet containing a @font-face rule with a small icon font file, served through Cloudfront. Coincidentally, the website of the new customer has many – and I mean many – visitors that use Internet Explorer. To cover for this, the default stylesheet contains a font-face hack called “#iefix”. By mistake, the font file extension was missing for this part, resulting in a redirect to the 404 error page instead of an actual font file – which is equal to one direct server hit for every single button view. Ouch.The visible user impact of the error is practically zero, because this degrades gracefully. However, our servers thought otherwise and suffered the consequences. To make things worse, this particular stylesheet was cached pretty aggressively in nodes of the content delivery network across the globe – so it kept on asking for the file. Since Cloudfront does not cache 404 errors for very long, the requests continued to hit our servers.
First, we increased the number of server instances behind our load balancer to win time. Finally, after examining the log files again and again, we realized that the problem was a mistake of only three characters in the file name of a font. Quickly, we added a Cloudfront cache behaviour. A cache behavior is the set of rules you configure for a given URL pattern based on file extensions, file names, or any portion of a URL path on your website (e.g., *.jpg). Now, the broken link pointed to the actual file. Then, we fixed the typo. Problem solved.Of course, cleaning the code and the routing of bad requests are only the first steps. Secondly, we will have to create a better, lightweight 404 page that does not require so much server activity. Third, we should apply better server-side caching as well. Relying on a content delivery network alone is just not safe enough!
Careful with your butterflies
We learned a lot today, and we will do our best not to let our customers down again. If you are a fellow developer, we hope that this story will help you prevent this from happening in your own systems.Be careful with those butterflies and feel free to upvote, like and share this post with your friends and colleagues: we think that we can handle the load!