How Performance Feedback can Reduce Testing in Agile Development
Here at New Relic we’re an Agile shop. For the most part we follow scrum, however on occasion we’ve been known to break from the scrum "bible". New Relic is a Rails shop. Every part of our infrastructure is either Ruby or Ruby on Rails. We use RPM to optimize and performance-tune our application and consequently we are "eating our own dog food". The end result is that RPM enables us to spend more time building high quality features. We think what some of what we’re learning about performance management and Agile Development along the way is really quite interesting and may be useful to others.
How our Development Lifecycle is Changing
The two tools that form the pillars of our development process are Pivotal Tracker, and RPM. As you might have guessed, with Tracker we manage the development team’s activities and with RPM we manage the health of our production application (which happens to be RPM). However, over time, a surprising change has occurred in our development life-cycle.
A funny thing happens when you have constant visibility into your application’s health. The team starts to orient itself around that as a primary driver. Every morning, most of us in development, unsolicited, sign into RPM to "see how things are doing". We often find something of interest. A new error, a slow page request, a strange fluctuation in DB response times, a growing heap. Right away these get entered into Tracker and we start working on them. At New Relic, you don’t need permission from anyone to make the site better.
Once we’ve massaged the site into a healthy state, it’s back to implementing our Agile stories. We all dump stories into Tracker and product management sorts by priority. Then we start banging out the stories. And here’s where it starts to get interesting. Because we have great visibility into the application’s health with RPM, the need for finding problems in a pre-production environment is reduced. WHAT you ask?? What about the mantra that all Ruby codes needs complete test coverage?
Stop Testing - What Did You Say?
Well, "complete" coverage is a fallacy anyway. With test coverage, it’s a probability game. At New Relic, we try to analyze the business value of everything we do - from feature building to test writing. For us, simple tests that at least execute the code are mandatory. Tests that probe for edge cases in important code are mandatory. However, stress tests, performance tests, and complex integration tests are not normally done.
Here’s what we’ve discovered - having constant deep visibility into the health of the production application is an acceptable substitute for some testing. And, it even finds things that you’d never find in a contrived test environment. We’ve seen time and again that many problems only happen in production. Of course, the cost of finding a problem in production can be expensive if you look at the cost to the business. However, here’s where Agile comes to the rescue. Our team can quickly fix production problems, normally in under 5 minutes. So for us, the cost of letting a problem slip into production is relatively low.
What this means for our business is that because RPM gives us great visibility into the health of our production application, we can spend less time on integration tests, performance tests, stress tests, and edge case testing - while still having a high quality site. We can spend more time building features that deliver value to our customers. Agile is great because it allows for teams to react quickly. Normally, we think of this only in the context of building new features, however at New Relic it’s also allowed us to change our fundamental assumptions about how much time we invest in testing.
Others are Seeing this Trend
I can’t claim all the credit for this change in development process, or more accurately, I can’t claim to have first noticed this. Ward Cunningham of AboutUs.org introduced this notion when he spoke about New Relic at RailsConf 2008. He was talking about performance and stress testing and what he noticed was that his "team stopped trying to performance test new functionality and instead just pushed it to production and watched what happened. If RPM showed a significant problem, then it’s cap deploy:rollback." My first thought was "now, that’s just reckless". Now I realize no, it’s just plain good business sense. Performance testing is expensive and it never really simulates what’s going to happen in production anyway. And, how often is code so bad that you need to take serious action? Once a month? Fine, then rollback and make fixes once a month.
Then, a week later, Ward dropped another bomb. He told me they don’t even have a staging server. If the code passes unit tests, it’s straight to production. And if RPM says there’s a problem, you guessed it, cap deploy:rollback. At New Relic, we’re not quite there yet - we still have a staging environment. But I think Ward is onto something. If failure rates are low, if visibility is immediate, and if the repair cost is low, then why not push the envelope? It’s all about value to the business and while bugs are an expense, new features drive the revenue.
To illustrate the kind of visibility we get with RPM, below are some examples of problems we’ve noticed…

Some strange MySQL error we noticed one day. It’s on our backlog to investigate. You need fine-grained detail to dig out SQL errors.

We saw that the combination of a long time window and a lot of database detail caused some of our queries to take way too long. This poor guy had a 150 second query. We ended up restricting what length of time some views could display. RPM let us see the specific combination of queries that needed to be changed and restricted.

One afternoon I saw our DB activity spike. Without a tool like RPM, we might have wasted a bunch of time digging into our code or trying to see if we were under a hack attack. Turns out a background job was consuming too much of the database.
