Solution Architecture Best Practice: Using System Availability and Recovery Metrics

//
Before endeavoring on an IT project involving the introduction of a new software package or or expansion of an existing one, business leaders need to know the impact of such an initiative on revenues, labor costs, and capital budgets. A solution architecture design document  (aka SAD) can help as long as it is part of an overall business impact or disaster recovery planning process. When drafting a solution architecture design document, helpful metrics such as system availability, recovery time objective (RTO) and recovery point objective (RPO) can help determine the desired runtime characteristics the business wants to achieve. Non-technical business leaders and subject matter experts may not necessarily care about “the nines” (99.999% availability, for instance), but they do care about lost revenue per hour, minute and second that the system (hardware software as a whole) that the company incurs when an IT asset is offline, or the labor costs of workers standing idle or having to resort to manual business process steps. Conversely, IT operating team members don’t necessarily care about the notion of these costs, but cares more about the nines. But for many, arriving at the right set of nines to assign to an IT project that introduces or expands a system is not exactly straightforward.

I’m offering an approach to help you assign a set of nines to your system availability objective. By “system,” I am referring to the combination of hardware and software. The following table provides industry-standard mappings of “nines” to acceptable down times for different availabilities for a given one-year period.

90%

99%

99.9%

99.99%

99.999%

99.9999%

40 days

4 days

9 hours

50 minutes

5 minutes

30 seconds

How do we know which of the sets of nines is applicable? It depends on the business subject matter experts, and in turn, they may rely on the operations team to supply data. But in the case where neither the business SMEs or the operations teams have such numbers, a good rule of thumb is to first have the business SMEs, tally the Line of Business (LOB) revenue per hour, minute or second of any given business process that would be impacted if the system in question went down. Have them do the same for revenue per hour, minute and second. Don’t worry about downtimes just yet; we only want to know how much money is generated by the business process per hour/min/sec, and then the labor cost (or overhead costs, operating costs) per hour/min/sec.

Next, identify the cost of maintaining each of the sets of nines (the greater the number of nines, the greater the maintenance cost).

Finally, if the loss of revenues per hour/min/sec noticeably exceeds the cost of maintaining the desired nines, then it might be advisable to absorb the maintenance costs. In the absence of revenues, the project’s maintenance budget can be used, but caution has to be used here as the budget may not align with lost revenues when a system goes down as the budget is almost always smaller than the company’s revenues for the impacted business process.

Labor costs should be used in a separate metric to identify the amount of money a company pays its employees when the system is unavailable. To recap, we have three system availability decision metrics to use from a business standpoint to help us arrive at a decision on which of the nines to choose:

Availability Decision Per Revenue
  1. Tally Revenue generated per hour, min, seconds
  2. Identify the cost of maintaining each of the sets of nines
  3. Availability Decision Ratio (ADR) = Revenues (R) / Cost of Nines (CoN), where a number greater than 1 indicates that the chosen set of nines is doable

 

Availability Decision Per Labor Costs Similar to Availability Decision Per Revenue above, except you use Labor Costs (LC) instead of Revenues
Availability Decision Per Maintenance Budget Similar to Availability Decision Per Revenue above, except you use Maintenance Budget (MB) instead of Revenues

Regarding recovery metrics, an article on Wikipedia does a great job in explaining them. I provide a snippet below, and invite you to go to http://en.wikipedia.org/wiki/Recovery_point_objective to read the rest. I have highlighted some sentences to call your attention to important principles.

The recovery time objective (RTO) is the duration of time and a service level within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity.[1] It can include the time for trying to fix the problem without a recovery, the recovery itself, testing, and the communication to the users. Decision time for users representative is not included. RTO is spoken of as a complement of RPO (or Recovery point objective) with the two metrics describing the limits of acceptable or “tolerable” ITSC performance in terms of time lost(RTO) from normal business process functioning, and in terms of data lost or not backed-up during that period of time(RPO) respectively. The rule in setting an RTO should be that the RTO is the longest period of time the business can do without the IT Service in question.

A “recovery point objective” or “RPO”, is defined by business continuity planning. It is the maximum tolerable period in which data might be lost from an IT service due to a major incident.[1] The RPO gives systems designers a limit to work to. For instance, if the RPO is set to 4 hours, then in practice, offsite mirrored backups must be continuously maintained- a daily offsite backup on tape will not suffice. Care must be taken to avoid two common mistakes around the use and definition of RPO. Firstly, BC Staff use business impact analysis to determine RPO for each service – RPO is not determined by the existent backup regime. Secondly, when any level of preparation of offsite data is required, rather than at the time the backups are offsited- the period during which data is lost very often starts near the time of the beginning of the work to prepare backups which are eventually offsited.

How RTO and RPO values affect computer system design

The RTO and RPO form part of the first specification for any IT Service. The RTO and the RPO have a very significant effect on the design of computer services and for this reason must be considered in concert with all the other major system design criteria.

When assessing the abilities of system designs to meet RPO criteria, for practical reasons, the RPO capability in a proposed design is tied to the times backups are sent offsite- if for instance offsiting is on tape and only daily (still quite common), then 49 or better, 73 hours is the best RPO the proposed system can deliver, so as to cover for tape hardware problems (tape failure is still too frequent, one bad tape can write off a whole daily synchronisation point). Another example- if a service is to be properly set up to restart from any point (data is capable of synchronisation at all times) and offsiting is via synchronous copies to an offsite mirror data storage device, then the RPO capability of the proposed service is to all intents and purposes 0 hours- although it is normal to allow an hour for RPO in this circumstance to cover off any unforeseen difficulty.

If the RTO and RPO can be set to be more than 73 hours then daily backups to tapes (or other transportable media), that are then couriered on a daily basis to an offsite location, comfortably covers backup needs at a relatively low cost. Recovery can be enacted at a predetermined site. Very often this site will be one belonging to a specialist recovery company who can more cheaply provide serviced floor space and hardware as required in recovery because it manages the risks to its clients and carefully shares (or “syndicates”) hardware between them, according to these risks.

If the RTO is set to 4 hours and the RPO to 1 hour, then a mirror copy of production data must be continuously maintained at the recovery site and close to dedicated recovery hardware must be available at the recovery site- hardware that is always capable of being pressed into service within 30 minutes or so. These shorter RTO and RPO settings demand a fundamentally different hardware design- which is for instance, relatively much more expensive than tape backup designs.

Brief Note on Mobile Enterprise Architecture

By John Conley III

With over 70% of corporate enterprise adopting mobile technology into the workplace, and over 60% of adult consumers using the mobile internet every day, coming up with a coherent strategy for designing and deploying mobile solutions using becomes very important. Usually, workers tend to desire similar computing experiences they enjoy as consumers, so adopting a consumer-friendly approach to mobile technology adoption is key. Benefits to consider are increasing revenues, decreasing costs, boosting worker productivity and workplace satisfaction, as well as facilitating a better work-life balance.

But how does mobile technology fit into your company?

As with any app, you have to decide what business process is suitable for mobile technology, and what tasks within that process can be made more efficient if automated by deploying such technology. The worker who is responsible for that task would be a good subject matter expert (SME) for making this determination. Also, the worker’s manager can help in determining the impact on productivity form a team perspective and help with prioritizing the selection of task automation to be realized by mobile technology.

Once that decision is made, it is important to determine whether the content for the task should be in mobile website form or standard mobile app form. A mobile website is still just a website, but one that is designed to fit within the paradigm of the mobile device. The website should be able to automatically detect if an incoming web page request is from a mobile device or not. Your mobile website should also provide the user the option to download a mobile app, which can provide a better user experience as it is hosted directly on the mobile device, as opposed to having the content sent from the website to the device for each and every request. Judging by the fact that over 90% of those users who download mobile apps also access the mobile internet, it is sound modern business practice to incorporate a dual mobile app-website approach to your mobile enterprise architecture planning.

What features should go on mobile apps? For every business process, there are tasks in automated or manual form that are more commonly used than others. Some of these tasks are probably already realized in Excel spreadsheets that you may not be aware of. Interview the task’s SME to determine which features are more common than others, and prioritize and weight these items as a basis for a mobile app feature brainstorming session. The architect or designer will then have to determine at some point whether to create mobile presentation templates for a specific mobile platform (iPhone or Android, for instance), or design templates to be abstract enough to run on any mobile platform, the latter of which is ideal for optimal developer and tester efficiency and faster deployment turnaround time.

Let me know if you have any questions or need more specific planning assistance for your organization.

Business and Technical Architecture Planning for Corporate Mergers

Merging two corporate entities, especially sizable ones, is a very tricky and complicated process. The Source company will have a unique corporate culture and sets of business processes and IT processes that have grown imperfectly over time, and the same is true of the Target company (the one being acquired or merged into the source). Without help from a qualified consultant, it can be easy for workers in the Source company to try to do a direct mapping between Business and IT processes in both companies. This is almost impossible and would take way too long to figure out. Instead, it is smarter and more efficient to brainstorm a Model Company that you want the post merger organization to resemble. Rethink the business and IT processes from scratch. Develop user stories or use cases to help think through the business processes. Use IT industry best practices (design patterns, principles, enterprise architecture frameworks, hardware architecture, etc) that you desire the Model Company to be like.

Allow a team of up to 11 workers something like 2-4 weeks for every year the older of the two companies has been in business, up to no more than 40 years (we need an arbitrary, hard limit so as to avoid spending too much time on older companies). In other words, if the Source company (the one initiating the merger) is the oldest, and has been in business for 40 years, then it may take something like 80-160 weeks to come up with a working Model Company. To cut the time, consider partitioning the Model Company and assigning a team to each Partition.

Once you have the Model Company modeled and documented, you can determine which company (Source or Target) has the best competency for a particular Model Company Process (Business or IT process). Use a decision matrix approach such as Architecture Alternative Analysis (AAA) to help decide which company is voted the most competent at a Process. Such an approach will eliminate tie-breaker situations when both companies are deemed to have equal competence.

At the conclusion of the decision matrix iteration, do a Gap Analysis to determine what gaps exist between the Actual Company (Source or Target) and the Model Company. Document how to close the gap. Identify the risks and rewards of closing each gap. Be sure to include Subject Matter Experts from both companies, with an outside consultant (Merger Architect Coach) to act as facilitator, tie-breaker, business architect and IT architect – sort of an IT go-to specialist to keep the team moving forward.

If you have any questions or feedback, please let me know

John Conley III

Chief Technology Architect

Samsona Software

© 2013 by John Conley III for Samsona Corporation