Monday, July 30, 2007

How To Create a Major Software Outage

In the interests of creating opportunities for career enhancement for mid-level managers in the IT field, I am going to pass on some tips on how to create a major software outage. A major outage, particularly of some core business component, provides many opportunities for career advancement for an enterprising, ambitious manager.

The key, of course, is that the manager avoid having blame assigned to him or her. You must be seen as part of the solution, not part of the problem. In the wake of the outage, management will likely clear away anyone it perceives as a cause of the problem. Since upper management's analysis of most things is superficial, it will sometimes confuse anyone who warned there might be a problem as the cause of the problem. They will likely look to promote anyone who helped solve the problem even if they helped to cause the problem. Your task is to aggravate the problem without seeming to be the cause of it, never to warn there might be a problem, then to step in after the outage occurs and seem to be one who helped to fix it.

A major software outage usually begins with a zealous attitude towards some business objective that requires a major software change. As a manager, your job is to feed the frenzy. If developers tell you the job requires five months, encourage upper management to think it can be done in three. Asking the developers to work extra hours goes without saying. You can also suggest canceling vacations and holding mandatory Saturday afternoon meetings. Turnover is good. It would also be good if you need to bring in contractors to meet the aggressive schedule. Make sure you go low-bid (or even off-shore) on the contractors to guarantee poor quality. A consulting company can become a convenient fall-guy with a clear path of responsibility for many problems.

If you have an old, poorly designed, and poorly understood system that is essential to the software change, you are ahead of the game. Find ways to burden this system with additional requirements and load. Probably this system barely works now so it will be easy to topple it over and almost impossible to fix it.

If you don't have such a system, then insist that a completely new system be created to support the change. Tell management you will use agile, iterative development to cut the normal development time in half. Make sure this new system has the most complicated architecture you can design. Use terms like "failover" and "horizontal scaling" to justify the architecture, but insist that nothing off-the-shelf will work. This will create the need to write thousands of lines of code in a very short period of time with very little chance to test even a fraction of the paths through the code. The first odd exception will immediately crash the system.

Requirements are extremely important. The more people you can have working on the requirements the better. Ideally people should be working on and changing the requirements right up to launch. It is also useful to have multiple groups working on the requirements to guarantee that some of the requirements are contradictory. Complex and volatile requirements mean that developers will miss most of them or understand them incorrectly.

Provided you are not a part of the QA team, you can use testing to your advantage. You must have testing of the software changes. You should be adamant about it. You don't want anyone to accuse you of installing untested changes into production after the outage. You have already paved the way to slip the bad software by the QA team with the complex requirements. Most likely the QA team will not develop test cases for most of the requirements. Since you are working with a tight time frame, they will have very little time to test and probably no time to retest fixes for any defects that get identified. Don't mention "load test". If some one brings it up, agree that it would be a good idea and create a list of new hardware and software licenses that will need to be purchased to carry out a "real" load test. Upper management will swiftly kill the idea or the purchasing will get caught up in the bureaucracy. When the day of the launch comes, you can reassure everyone having any doubts about the software changes that everything has been tested.

In deployment planning, make certain to involve as many teams as possible - the more the merrier when all hell breaks loose. This will help to provide many false leads as to the source of the problems after deployment. Each team will look to the their own area for problems. System administrators will find obscure operating system bugs. Network engineers will find bottlenecks in routers and lines. Database administrators will find poorly optimized queries. It's not a problem that some of these things may be tied to the software changes. Most of them will be red herrings many people can work hours on "fixing".

Deployment should be complicated. Rollback also should be complicated or better yet impossible. Make certain to set up multiple war rooms for the entertainment of upper IT management who has little idea what is changing. These are the people who ultimately will be held accountable and consequently they will be the ones who will jump in to make matters worse when the problems start.

When the day of the launch finally arrives, feel good. This is your time to shine. You must use every opportunity to sow confusion and obfuscation.

Try to find as many problems as possible. Upper management will appreciate your vigilance. If the problems are minor or not even problems, that's good. Undoubtedly management will dispatch teams to deal with them anyway. Hours will pass while the teams investigate the "problems".

Better yet, push for immediate solutions. Undoubtedly the "solutions" will involve the teams that may be most knowledgeable about the problems. Diverting them away from understanding the problems will assure many more hours will pass before a solution is arrived at. As the teams work away for hours at the "solutions", they will tire and be unable to analyze the problems once they may be able to get back to them again.

Whatever idea upper management comes up with, no matter how asinine, immediately agree with it and jump right into assisting with it. Assisting with it, of course, mainly means badgering the tired developers into implementing it. If the idea is good, you will be on the spot when the solution is reached. If it is bad, you can later tell upper management that you were convinced that the idea was good too and that will make them feel good about your judgment.

If upper management has a problem coming up with ideas, here's a few to suggest. Rewriting a major portion of the system in a few hours is always a good idea. You can test in production. Bring up a new server is also good to suggest. That there are probably some corrupted settings on the old server will be a good argument for management. Building out a new server will keep the system administrators and hardware people busy for hours. Since they probably don't have well-documented procedures for doing this, there will be numerous failed efforts to bring the system up on the new server. Be sure to document these failures and bring them up in the post mortem. Suggest that a team who has no understanding of the software causing the problem be sent to "help" the team having the problem is also good. Not only will this demoralize the team with the problem, it will also assure many more hours will pass while the assisting team struggles to understand the software.

Eventually the time will arrive when either the bad software is backed out, shut down, or made to work in some fashion. It may be costing the company millions a day but the company will push ahead with it anyway. This isn't the time for you to ease up. Carefully now record all of the failings of the various teams and fire off an email suggesting “steps for improvement”.

Now sit back and wait for the fun to begin. Within a month of so, you will likely find yourself in a new job at a higher salary.

Note: This little posting was inspired by Roedy Green's essay “How to Write Unmaintainable Code”.





<< Home

This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]

© 2007 jimcross.com LLC