Much has been written (examples here and here) about the horrendous launch of Cuil.com this week. Despite extensive PR work that succeeded in getting people to the site, the user experience was so bad that it ensured none would return. From error messages to a message saying “too many people are currently searching” and despite claiming to have an index larger than Google’s, it could not find this humble blog among other missing sites.
We’ve encountered many types of failures to launch and have tried to help others avoid them.
The most common is the site’s inability to handle the traffic spikes. Sometimes a PR success can lead to availability nightmares.
When the much anticipated Britanica.com launched in 1999, the demand was so high, the hardware was not able to support it and the site had to shut down for about 10 days before re-launching. When Burundis.com, a Mexican greeting card site launched, they handled traffic well but did not anticipate that at Valentines Day, their traffic will spike 10 fold and bring down the system.
What can be done to ensure a successful launch?
- Sizing. Estimating the potential traffic and translating that into bandwidth, CPU and Memory utilization. The key to sizing is to correctly identify the resource utilization of a user session and how it scales. The best metric to use is concurrent sessions and a server’s ability to handle X number of concurrent sessions is the essential key for determining both individual server specs and the total number of server required.
- Finding the bottleneck. What chocks first as the load increases? Do calls to the database start to queue up? Finding and addressing bottlenecks is the first method for optimizing performance.
- Software Vs. Hardware. It is always surprising to see how little time and thought is put into performance planning upfront on the software side. As hardware resources provide amazing power at a reasonable cost, many software developers have forgotten the art of coding for performance. It is very common that in the late stages of performance testing and optimization, a lot of code is being re-written to optimize caching, reduce calls to the database, and transfer load into compiled objects.
- Planning for the spikes. Building hardware infrastructure that will hold maximum spikes can be expensive. Now, with Amazon Computing Services and other providers, a virtual server can be up and running in 7 minutes and you pay only for the resources you use.
Other methods that are often recommended to mitigate the risk of a filed launch include:
- Soft Launch or a limited release. Build traffic gradually rather than in a big bang. Monitor behavior and optimize before the Press Release goes out.
- Monitoring and Contingency plans. The amount of traffic expected if often unknown. Monitoring tools that track actual performance and resource utilization that are watched constantly are a must. A contingency plan has to be in place. From Amazon flexible computing to shutting down a resource intensive feature, these plans need to get you alive until adjustments in software or hardware can be made.
Finally, a few lessons from the trenches:
- Be absolutely positively sure that your development and staging environment are identical to your production environment. When a large printer manufacturer launched their revamped dealer extranet site a few years back, it suffered from horrible performance problems on production that did not show in the performance tests done in staging. The system administrators swore on their life that the environments were identical. IBM consulting had to be brought in to investigate and after 2 weeks of testing found that a faulty DB2 patch was applied to the production database. Once the correct patch was applied, the problems disappeared.
- Have software developers on site during the first 24 hours of the launch. It is easy to delegate monitoring to the NOC and let the developers who may have worked 72 hours straight to get the launch done on time, go home to sleep. A fresh and ready senior member of the development team has to be there to react if anything goes wrong.
If you have any other tips to share, we’d love to hear them.