Cloudinary Blog

A story about production systems, Rails, monitoring and off-hour notifications

Cloudinary's image management service is used by thousands of world-wide websites and mobile apps. For many of our clients, Cloudinary has become a central, mission-critical component used for managing image uploads, transformations and delivery.
 
This is why we've built Cloudinary from the ground up to be a very robust service. We put a lot of emphasis on availability, scalability and support and we take our users' confidence in us extremely seriously.
 
So far, we've been quite satisfied with our ability to keep Cloudinary at an average of > 99.99% uptime.
 
However, on April 4th, the Cloudinary service experienced outages for a few hours. We wanted to explain what happened, our conclusions and the steps we've taken to make sure this won’t happen again.
 

The upgrade

Cloudinary's core service is built with Ruby on Rails. The service is tested thoroughly and upgrades are handled with uttermost care. This is why we've preferred to stay with Rails v3.0 for a long time rather than rock the boat with an upgrade to the latest Rails 3.2.
 
A few weeks ago a security vulnerability was discovered in Rails. As always, we wanted to apply the security fix as soon as possible. However, the Rails team stopped releasing fixes for Rails 3.0. We had to upgrade to v3.2.
 
We've upgraded to Rails 3.2 in our lab and modified our code to support it (Rails upgrades tend to be non backward compatible and break code built with previous versions). We've tested our code extensively and verified that our thousands of unit tests passed correctly. We've successfully finished a thorough manual QA of the system in our staging environment. It all went quite smoothly.
 
We scheduled the upgrade for April 4th. As usual, we deployed the system gradually to all of our production servers. Deployment went smoothly as well. We performed additional sanity testing after the system was deployed and closely monitored the system during the working day.
 
We went to sleep happy and relaxed.
 

The issues

At about 1am at night things started to shake.
 
Apparently, Rails 3.2 changed the defaults of one simple configuration parameter - response caching was turned on by default when certain cache headers are returned.
 
As a result, after long hours of service requests, the local application disk for some of our servers became full due to the cached responses. This caused certain requests that required disk space to fail, depending on the exact request and the size of the response.
 
Annoyingly enough, the automatic monitoring service that regularly verifies our APIs, was performing a request that required very little disk space and continued to operate regularly. This service is configured to send notifications to our engineering team's mobile phones during the night. But since no errors were detected, no notification was sent.
 
Fortunately for us, our co-founder's toddler woke him up early in the morning. He naturally (?) checked his inbox, understood that something was very wrong. He quickly cleared the disk space and modified Rails 3.2 cache settings. The system was fully working again.
 
It's important to note that during these ~5 hours, all existing images and transformed images were delivered successfully to users through our delivery service and tens of thousands of worldwide CDN edges (Akamai + CloudFront). Still, part of the upload API calls did fail during this time and we are very sorry for this.
 

Going forward

Naturally, we've immediately started to improve our outage prevention mechanisms.
 
We've added additional disk space tests to our QA list and added abnormal disk usage monitoring to our urgent notification service. We're also adding a wider set of API requests to our automatic service monitoring.
 
We've integrated with Twilio to enhance our off-hour notifications. Specifically, our engineering team will now receive automatic voice calls to their mobile phones in addition to our previous notification methods.
 
To make sure we keep you in the know during outages, we've pushed up the priority of a public status page. This page will include automatic monitoring details as well as human written notes.
 

Summary & conclusions 

We are happy that Cloudinary had nearly zero availability issues in almost 2 years of operations. On the other hand, no online service is perfect and every service experienced or will experience outages. 
 
We will continue to enhance our service with additional image-related features. On the same time, we'll continue to work hard on having Cloudinary's uptime as close to 100% as possible.
 
Thank you for trusting us with your images!

Recent Blog Posts

ImageCon17: Delivering Responsive Images

By Jason Grigsby

After five years many specifications, some inflamed Twitter battles and other conversations, responsive images have finally landed and there's a sound. Which is really exciting right? People have been climbing for this for quite some time and we've reached a point where they're available in modern browsers. So people were excited, they wanted to go use them it's something that designers and developers have had as a point of frustration for a long time.

Read more
Auto padding images with content-aware color padding

How you present the content of your website can be just as important as the content itself. The images you display need to conform to the graphic design of your site, and every image needs to fit within a predefined size. Although that may be simple enough to achieve when you are dealing with your own images, the task can be more challenging when displaying images uploaded by your users.

Read more

Bounce! Hacking Jazzfest with Social Videos

By Eric Normand
Bounce! Hacking Jazzfest with Social Videos

Last week, I was invited to an exclusive hackathon to build apps for musicians. The app team I was assigned to was tasked with building a video upload site for Bounce videos. Bounce is a style of music that originated in New Orleans. The app would be called BounceDotCom.com and there were plans to have Big Freedia, the Queen of Bounce, promote it. I knew the organizer could make things happen, so I jumped at the chance.

Read more
Getting a Better React-ion with Progressive Web Apps

This is part 2 of a 3 part series

React has become more popular, as well as more mature, over the last four years since its release by Facebook. It has become one of the go-to technologies for people looking to componentize the front-end of any web application. It also helps that an entire mobile stack is built around React in the form of ReactNative. The components are wonderful, however there can be a burdensome learning curve. But, in the end, there’s the payoff of highly reusable code and a better user experience.

Read more
Build an Image Library with React & Cloudinary

This article was originally posted on Scotch.io

React is a good tool when it comes to building flexible and reusable UI components. However, it's "one of those libraries" that cannot handle all the tasks involved in building a full fleshed UI project. Other supporting tools - such as a recently announced React SDK from Cloudinary - are available to provide solutions that the React core cannot.

Read more
Getting Started with Vue JS: The Progressive JavaScript Framework

The most challenging aspect of building a product with a front-end framework is focusing on the complexity of the tool rather than the complexity of the actual problem being solved. It’s perhaps more frustrating when the tool is complex but the problem is a simple one (e.g. complex webpack config for a todo app). The term progressive, is used by Vue.js to describe how this challenge can be mitigated. Building a basic demo or small app? Vue.js is simple enough to handle that. How about a complex enterprise app? Vue.js is capable of produce cutting-edge solutions, as well.

Read more