There is nothing like having your application in production. You and your users will find bugs, surprising behavior, and new challenges in ways that simply don't happen in any other environment. Consider the initial period of time after launch — hours or days depending on when demand stabilizes — as sacrosanct, an opportunity to focus on operations, system availability, and providing reliable service to your users.
Ad Hoc has been a part of many high-profile launches in government, starting with coming in to help rescue the early troubled launch of HealthCare.gov, to subsequent, increasingly smoother open enrollment periods for that same website, to the transformation of VA.gov, among others. We’ve seen first-hand the stakes of getting launch wrong, and the value in doing it right.
In order to properly prepare for launch, take the following hard-won experiential advice into account.
Adopt the appropriate posture
The most important thing you can do to prepare for launch is first to get into the right frame of mind. This means having a sense of urgency around the tasks that need to be done. It means being paranoid about your system, and not taking anything for granted. You should want to have things proven to you — that tasks are done, that things work or perform the way they are expected to — rather than assume them. You should hone your judgment, and be able to quickly prioritize issues and tasks that will have the biggest impact on the system.
Related, you should shift your engineering focus from feature development to bug fixes, performance improvements, and system stability. You should consider your application largely feature-complete at this point, at least for the purposes of launch. Rapid churn on new functionality is destabilizing. Regardless of how important you think your new feature is, if it introduces bugs, which it likely will, or causes a performance regression or outages, its benefits will be swamped by the operational costs it incurs. New features can always be rolled out later.
Model load test on expected demand
Ideally, your load test would perfectly resemble the shape of the traffic your app will receive in production, but we can make informed guesses to craft load tests that will provide useful information. Use logging and monitoring in lower environments to determine frequently accessed and intensive resources. Generate request payloads for the load test that model expected user inputs. Let the load test run long enough to eliminate any cache warm-up that may skew results. Load test against an environment that reasonably looks like your production environment — if you’re pre-launch, you can load test against production itself. The load test results should inform what you prioritize in terms of code optimizations, database indexes, CDN caching, additional server capacity, or other mitigations. Re-run the load test after any significant changes. Ideally, the load test should run from outside the perimeter of your system, so that it fully tests your entire stack, from CDN on down. But an inside-the-perimeter load test can have value, too, for example, by concentrating on components such as app servers and databases.
Have target system serving goals
In lieu of other more specific system metrics, a typical user-facing transactional web service should average no more than 250ms of latency (measured from when a request enters the perimeter of service to when the response is fully rendered, i.e., not including time on the wire to the user), and the 99th percentile of requests should respond in no more than 1 second. The key about long-tail request latency is that it could disguise problems in your software design or architecture that could swamp the site under heavier load. These goals should be published so everyone on the team understands what is expected of the system. Enable alerting in monitoring tools like New Relic for when these thresholds are exceeded. Reverse-sort the worst offending routes and drill into what’s making them slow.
Prepare a public communications plan
If something goes wrong and the system is down or degraded, your users will want to know what’s going on. You need a means by which to communicate system status, and what you’re working on to fix it. At the upper end of sophistication, this can be a public status page. But any channel to your users can work. Preparing this in advance will make it easier to roll out in the midst of an outage, especially if any public communications require approval from senior stakeholders.
It’s in your team’s interest to be ruthless when reviewing outstanding todos. Meet regularly (daily) to go over the list, reverse-sort them in order of potential impact to the system, and focus on the top of the list. It can be tempting to fix things for various motivations, and we’re all susceptible to shiny-thing syndrome, but some things just won’t make a significant measurable change, especially compared to more glaring issues. It’s your job to both find the big impact items and keep the smaller bore stuff from crowding out attention on the former.
Do not fling open the doors all at once
Related to “there’s nothing like production,” having live traffic for the first time may stress parts of the system you didn’t anticipate. If you meter in traffic slowly, you might be able to catch an unindexed query that’s causing your database’s CPU usage to go non-linear in time before it falls over. Ramp up to 100% over several days. Coordinate with your communications team on a soft launch strategy.
Focus on availability, correctness, performance, and functionality, in that order
Everyone’s priority should be to keep the site up and serving; that’s the whole reason for being. This should help focus you during an outage — it might be tempting to fix the bug, but your only job in that moment is to put the site back in service. After this, bugs that could corrupt data or serve inaccurate or incorrect results can clearly cause harm and should be escalated. A slow site is annoying and frustrating to users and could indicate deeper capacity problems, so that should be addressed next. Finally, new functionality, regardless of how mission-critical, is always nice-to-have relative to these other issues. Think of it like a layer cake that builds on the layers below — functionality builds on performance/capacity, which builds on correctness, which builds on availability. Deploying a feature when the underlying layers are not stable will exacerbate the problem.
Cache as much as possible
Any resource that can be generated once and served many times to multiple users should be cached. This includes HTML response bodies, static assets, API responses that don’t vary by whether a user is logged-in or not, pages constructed with complex queries (although unbounded database queries should be aggressively eliminated through use of indexes and rewriting). Use a CDN for all static assets. Pre-compute as much as possible and stuff in an object store like S3 or NetStorage. Focus on pages that are at the top of the user flow funnel: landing pages, sign-up and login, pure static informative content, and form entry pages.
Be ready to add serving capacity
If you’ve load tested properly, you should have a rough idea of the relationship between request volume and your main fixed server resources: CPU, memory, and network bandwidth. Be prepared to bring additional capacity online as soon as you see thresholds exceeded (60% CPU is a good rule of thumb). For resources like databases, think through what it would take to bring on additional read or write capacity. This may require redesigning how the app talks to the database. If you’re using auto-scaling, double-check the triggering thresholds.
There’s no such thing as being too chatty when it comes to launch prep. As the moment approaches, everyone will be increasingly focused on their tasks, so you can’t assume others know what you know. Surface important findings immediately. If the signal-to-noise ratio starts to degrade, find alternate channels, but the worst thing is to see something important and not say something. If you’re responding to an issue that surfaces during launch, narrate what you are changing in channels visible to your team and other stakeholders. This will help get additional eyes on the issue and can serve as a form of documentation for future reference.
Use a checklist
Finally, when you’re ready to launch, don’t leave anything to chance. Write down exactly the tasks that need to be performed or double-checked for completion, by whom, and by when. Publish a staffing schedule for the days around launch, so the team knows what is expected of them. Assign someone to keep track and follow-up on completion.
Launches can be unpredictable, and there is an element of the unknown that is unavoidable and impossible to entirely eliminate. However, by having a plan for launch and preparing in the broad ways outlined above, you can focus on that which you can control, and set yourself up to recover more quickly should things go wrong.
Pre-launch checklist template appropriate to almost all launches:
- Database indexes created
(i.e. no unindexed columns queried)
- CDN caches populated
(static assets, other pre-computed resources)
- Systems monitoring in place and made available to all
(e.g. New Relic, CloudWatch)
- Business metrics analytics in place and made available to all
(e.g. Tealium, Google Analytics)
- Staffing plan in place for day -1, day 0, day +1, day +2
- Server hosting support staff alerted
(i.e. give technical account managers a heads-up you're having a major event)
- Switchover/DNS redirects in place and tested
(i.e. if you need to repoint what the root of the domain name references. Don't forget about DNS TTLs.)
- Coordinated with other stakeholders' teams
(i.e. any dependencies on third-parties need to be explicitly spelled out. Assign key personnel to be the go-between.)