Solution Architecture Best Practice: Using System Availability and Recovery Metrics

Before endeavoring on an IT project involving the introduction of a new software package or or expansion of an existing one, business leaders need to know the impact of such an initiative on revenues, labor costs, and capital budgets. A solution architecture design document  (aka SAD) can help as long as it is part of an overall business impact or disaster recovery planning process. When drafting a solution architecture design document, helpful metrics such as system availability, recovery time objective (RTO) and recovery point objective (RPO) can help determine the desired runtime characteristics the business wants to achieve. Non-technical business leaders and subject matter experts may not necessarily care about “the nines” (99.999% availability, for instance), but they do care about lost revenue per hour, minute and second that the system (hardware software as a whole) that the company incurs when an IT asset is offline, or the labor costs of workers standing idle or having to resort to manual business process steps. Conversely, IT operating team members don’t necessarily care about the notion of these costs, but cares more about the nines. But for many, arriving at the right set of nines to assign to an IT project that introduces or expands a system is not exactly straightforward.

I’m offering an approach to help you assign a set of nines to your system availability objective. By “system,” I am referring to the combination of hardware and software. The following table provides industry-standard mappings of “nines” to acceptable down times for different availabilities for a given one-year period.







40 days

4 days

9 hours

50 minutes

5 minutes

30 seconds

How do we know which of the sets of nines is applicable? It depends on the business subject matter experts, and in turn, they may rely on the operations team to supply data. But in the case where neither the business SMEs or the operations teams have such numbers, a good rule of thumb is to first have the business SMEs, tally the Line of Business (LOB) revenue per hour, minute or second of any given business process that would be impacted if the system in question went down. Have them do the same for revenue per hour, minute and second. Don’t worry about downtimes just yet; we only want to know how much money is generated by the business process per hour/min/sec, and then the labor cost (or overhead costs, operating costs) per hour/min/sec.

Next, identify the cost of maintaining each of the sets of nines (the greater the number of nines, the greater the maintenance cost).

Finally, if the loss of revenues per hour/min/sec noticeably exceeds the cost of maintaining the desired nines, then it might be advisable to absorb the maintenance costs. In the absence of revenues, the project’s maintenance budget can be used, but caution has to be used here as the budget may not align with lost revenues when a system goes down as the budget is almost always smaller than the company’s revenues for the impacted business process.

Labor costs should be used in a separate metric to identify the amount of money a company pays its employees when the system is unavailable. To recap, we have three system availability decision metrics to use from a business standpoint to help us arrive at a decision on which of the nines to choose:

Availability Decision Per Revenue
  1. Tally Revenue generated per hour, min, seconds
  2. Identify the cost of maintaining each of the sets of nines
  3. Availability Decision Ratio (ADR) = Revenues (R) / Cost of Nines (CoN), where a number greater than 1 indicates that the chosen set of nines is doable


Availability Decision Per Labor Costs Similar to Availability Decision Per Revenue above, except you use Labor Costs (LC) instead of Revenues
Availability Decision Per Maintenance Budget Similar to Availability Decision Per Revenue above, except you use Maintenance Budget (MB) instead of Revenues

Regarding recovery metrics, an article on Wikipedia does a great job in explaining them. I provide a snippet below, and invite you to go to to read the rest. I have highlighted some sentences to call your attention to important principles.

The recovery time objective (RTO) is the duration of time and a service level within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity.[1] It can include the time for trying to fix the problem without a recovery, the recovery itself, testing, and the communication to the users. Decision time for users representative is not included. RTO is spoken of as a complement of RPO (or Recovery point objective) with the two metrics describing the limits of acceptable or “tolerable” ITSC performance in terms of time lost(RTO) from normal business process functioning, and in terms of data lost or not backed-up during that period of time(RPO) respectively. The rule in setting an RTO should be that the RTO is the longest period of time the business can do without the IT Service in question.

A “recovery point objective” or “RPO”, is defined by business continuity planning. It is the maximum tolerable period in which data might be lost from an IT service due to a major incident.[1] The RPO gives systems designers a limit to work to. For instance, if the RPO is set to 4 hours, then in practice, offsite mirrored backups must be continuously maintained- a daily offsite backup on tape will not suffice. Care must be taken to avoid two common mistakes around the use and definition of RPO. Firstly, BC Staff use business impact analysis to determine RPO for each service – RPO is not determined by the existent backup regime. Secondly, when any level of preparation of offsite data is required, rather than at the time the backups are offsited- the period during which data is lost very often starts near the time of the beginning of the work to prepare backups which are eventually offsited.

How RTO and RPO values affect computer system design

The RTO and RPO form part of the first specification for any IT Service. The RTO and the RPO have a very significant effect on the design of computer services and for this reason must be considered in concert with all the other major system design criteria.

When assessing the abilities of system designs to meet RPO criteria, for practical reasons, the RPO capability in a proposed design is tied to the times backups are sent offsite- if for instance offsiting is on tape and only daily (still quite common), then 49 or better, 73 hours is the best RPO the proposed system can deliver, so as to cover for tape hardware problems (tape failure is still too frequent, one bad tape can write off a whole daily synchronisation point). Another example- if a service is to be properly set up to restart from any point (data is capable of synchronisation at all times) and offsiting is via synchronous copies to an offsite mirror data storage device, then the RPO capability of the proposed service is to all intents and purposes 0 hours- although it is normal to allow an hour for RPO in this circumstance to cover off any unforeseen difficulty.

If the RTO and RPO can be set to be more than 73 hours then daily backups to tapes (or other transportable media), that are then couriered on a daily basis to an offsite location, comfortably covers backup needs at a relatively low cost. Recovery can be enacted at a predetermined site. Very often this site will be one belonging to a specialist recovery company who can more cheaply provide serviced floor space and hardware as required in recovery because it manages the risks to its clients and carefully shares (or “syndicates”) hardware between them, according to these risks.

If the RTO is set to 4 hours and the RPO to 1 hour, then a mirror copy of production data must be continuously maintained at the recovery site and close to dedicated recovery hardware must be available at the recovery site- hardware that is always capable of being pressed into service within 30 minutes or so. These shorter RTO and RPO settings demand a fundamentally different hardware design- which is for instance, relatively much more expensive than tape backup designs.

Author: John Conley III

I am a technology and business consultant who provides state of the art cloud solution design services to rapidly growing and mature organizations using cutting edge technologies. Information Technology Professional with over 20 years of industry experience as a Software Architect/Lead Developer and Project Management Coach using service oriented (SOA/EIB) view of the software development process (Use Case/Story View, Class Design View, Database Design View, and Infrastructure View) and software design (Model-View-Controller based (MVC pattern/framework)). Coached PMs on various aspects of task and resource management and requirements tracking and tracing, and even filled in for PMs. Led teams of varying sizes mainly from the architect viewpoint: translating non-technical requirements into concrete, technical components and work units, identifying and creating reusable frameworks and design patterns, creating skeletal IDE projects with MVC wiring and config files, assigning app tiers or horizontal components to developers, making sure test team members have use cases and other work unit inputs to create an executable test/quality assurance plan, organizing meetings, ensuring enterprise standards and practices are adhered to, enforcing any regulatory and security compliance traceable from requirements/Solution Architecture Documents (SADs) all the way down to core classes in code, and so on Expertise includes designing and developing object-oriented, service/component-based software systems that are robust, high-performance and flexible for multiple platforms. Areas of specialization include Internet (business-to-business and business-to-consumer) e-commerce and workflow using Microsoft.NET technologies (up to current Visual Studio 2010/.Net Framework 4.0, MVC3/Razor View Engine, LINQ), TFS, Sharepoint 2007 (Task Mgmt, Build Script), Commerce Server 2007/2002 (basket and order pipeline), ASP.NET, ADO.NET, C#, Visual C++, Visual Basic.NET) and Java EE/J2EE, service oriented architecture (SOA) and messaging (MSMQ, MQSeries, SAP message handling) and more abstract enterprise service bus (ESB) designs, best patterns and practices, telecommunications and the offline processes of the enterprise. Provide detail estimates on budgets, guided design and development tasks with offshore teams, technical assessments of third party software tools and vendor selections, project/iteration planning and spring product backlogs, and level of effort for statements of work (including for offshore based development teams), including executive summary presentations as needed.

3 thoughts on “Solution Architecture Best Practice: Using System Availability and Recovery Metrics”

  1. Great items from you, man. I have consider your stuff prior to and you are just too magnificent.
    I actually like what you’ve acquired here,
    really like what you are saying and the best way by which you are saying it.
    You make it entertaining and you still take care of to keep it sensible.
    I can not wait to read much more from you.

    That is actually a great site.

  2. I believe what you said made a great deal of sense. But,
    think on this, what if you were to create a awesome headline?
    I ain’t saying your content is not solid, but suppose you added a post title that makes people want more?
    I mean Solution Architecture Best Practice:
    Using System Availability and Recovery Metrics
    | Samsona Software is kinda vanilla. You ought to glance at Yahoo’s front page and note how
    they create post titles to get people to click. You might add a
    related video or a picture or two to grab people excited about what you’ve
    got to say. In my opinion, it would bring your posts a
    little livelier.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: