Synthetic Monitoring: Helpful Info for Web Developers, Architects and Admins

Every website requires some kind of real-time monitoring to stay abreast of changes to the production behavior of web applications at runtime. We all need to see how our websites hold up by simulating a customer clicking through our site pages and launching various transactions or complex requests. We also need to see how our apps respond when things go wrong. That’s what synthetic monitoring (aka active monitoring) helps with.

Microsoft‘s Technet Site has some helpful info. Here’s a snippet followed by a link for more detailed info:

In Operations Manager 2007, synthetic transactions are actions, run in real time, that are performed on monitored objects. You can use synthetic transactions to measure the performance of a monitored object and to see how Operations Manager reacts when synthetic stress is placed on your monitoring settings.

For example, for a Web site, you can create a synthetic transaction that performs the actions of a customer connecting to the site and browsing through its pages. For databases, you can create transactions that connect to the database. You can then schedule these actions to occur at regular intervals to see how the database or Web site reacts and to see whether your monitoring settings, such as alerts and notifications, also react as expected.

http://technet.microsoft.com/en-us/library/dd440885.aspx

Wikipedia has the following info (click the link for more info):

Synthetic monitoring (also known as active monitoring) is website monitoring that is done using a web browser emulation or scripted real web browsers. Behavioral scripts (or paths) are created to simulate an action or path that a customer or end-user would take on a site. Those paths are then continuously monitored at specified intervals for performance, such as: functionality, availability, and response time measures.

Synthetic monitoring is valuable because it enables a webmaster to identify problems and determine if his website or web application is slow or experiencing downtime before that problem affects actual end-users or customers. This type of monitoring does not require actual web traffic so it enables companies to test web applications 24×7, or test new applications prior to a live customer-facing launch.

http://en.wikipedia.org/wiki/Synthetic_monitoring

Website monitoring is the process of testing and verifying that end-users can interact with a website or web application as expected. Website monitoring is often used by businesses to ensure website uptime, performance, and functionality is as expected.

Website monitoring companies provide organizations the ability to consistently monitor a website, or server function, and observe how it responds. The monitoring is often conducted from several locations around the world to a specific website, or server, in order to detect issues related to general Internet latency, network hop issues, and to pinpoint errors. Monitoring companies generally report on these tests in a variety of reports, charts and graphs. When an error is detected monitoring services send out alerts via email, SMS, phone, SNMP trap, pager that may include diagnostic information, such as a network trace route, code capture of a web page’s HTML file, a screen shot of a webpage,and even a video of a website failing. These diagnostics allow network administrators and webmasters to correct issues faster.

Monitoring gathers extensive data on website performance, such as load times, server response times, page element performance that is often analyzed and used to further optimize website performance.

http://en.wikipedia.org/wiki/Website_monitoring 

Best Practices for Active Response Time Monitoring by Chung Wu

First, unless carefully designed, the tests may not be representative of actual end user activities, reducing the usefulness of the measurements. Therefore, you must be very careful in defining those tests. It would be a good idea to sit down with real users to observe how they use the applications. If the application has not been launched, work with the developers, or if there is one, the UI interaction designer to define the flow. In addition, work with your business sponsors to understand where the application will be used and the distribution of user population. You would want to place your synthetic test drivers at locations where it is important to measure user experience.

Second, some synthetic transactions are very hard to create and may introduce noise into business data. While it is usually relatively easy to create query-based synthetic transactions, it is much harder to create transactions that create or update data. For example, if synthetic transactions are to test for successful checkouts on an e-commerce website, the tests must be constructed carefully so that the test orders are not mis-categorized as actual orders.

To mitigate these potential problems, you should set up dedicated test account(s) to make it easier to tell whether something running on the application came from real users or the synthetic tests…

Read the rest at http://it.toolbox.com/blogs/app-mgt-blog/best-practices-for-active-response-time-monitoring-23265 

Related articles

Solution Architecture Best Practice: Using System Availability and Recovery Metrics

//
Before endeavoring on an IT project involving the introduction of a new software package or or expansion of an existing one, business leaders need to know the impact of such an initiative on revenues, labor costs, and capital budgets. A solution architecture design document  (aka SAD) can help as long as it is part of an overall business impact or disaster recovery planning process. When drafting a solution architecture design document, helpful metrics such as system availability, recovery time objective (RTO) and recovery point objective (RPO) can help determine the desired runtime characteristics the business wants to achieve. Non-technical business leaders and subject matter experts may not necessarily care about “the nines” (99.999% availability, for instance), but they do care about lost revenue per hour, minute and second that the system (hardware software as a whole) that the company incurs when an IT asset is offline, or the labor costs of workers standing idle or having to resort to manual business process steps. Conversely, IT operating team members don’t necessarily care about the notion of these costs, but cares more about the nines. But for many, arriving at the right set of nines to assign to an IT project that introduces or expands a system is not exactly straightforward.

I’m offering an approach to help you assign a set of nines to your system availability objective. By “system,” I am referring to the combination of hardware and software. The following table provides industry-standard mappings of “nines” to acceptable down times for different availabilities for a given one-year period.

90%

99%

99.9%

99.99%

99.999%

99.9999%

40 days

4 days

9 hours

50 minutes

5 minutes

30 seconds

How do we know which of the sets of nines is applicable? It depends on the business subject matter experts, and in turn, they may rely on the operations team to supply data. But in the case where neither the business SMEs or the operations teams have such numbers, a good rule of thumb is to first have the business SMEs, tally the Line of Business (LOB) revenue per hour, minute or second of any given business process that would be impacted if the system in question went down. Have them do the same for revenue per hour, minute and second. Don’t worry about downtimes just yet; we only want to know how much money is generated by the business process per hour/min/sec, and then the labor cost (or overhead costs, operating costs) per hour/min/sec.

Next, identify the cost of maintaining each of the sets of nines (the greater the number of nines, the greater the maintenance cost).

Finally, if the loss of revenues per hour/min/sec noticeably exceeds the cost of maintaining the desired nines, then it might be advisable to absorb the maintenance costs. In the absence of revenues, the project’s maintenance budget can be used, but caution has to be used here as the budget may not align with lost revenues when a system goes down as the budget is almost always smaller than the company’s revenues for the impacted business process.

Labor costs should be used in a separate metric to identify the amount of money a company pays its employees when the system is unavailable. To recap, we have three system availability decision metrics to use from a business standpoint to help us arrive at a decision on which of the nines to choose:

Availability Decision Per Revenue
  1. Tally Revenue generated per hour, min, seconds
  2. Identify the cost of maintaining each of the sets of nines
  3. Availability Decision Ratio (ADR) = Revenues (R) / Cost of Nines (CoN), where a number greater than 1 indicates that the chosen set of nines is doable

 

Availability Decision Per Labor Costs Similar to Availability Decision Per Revenue above, except you use Labor Costs (LC) instead of Revenues
Availability Decision Per Maintenance Budget Similar to Availability Decision Per Revenue above, except you use Maintenance Budget (MB) instead of Revenues

Regarding recovery metrics, an article on Wikipedia does a great job in explaining them. I provide a snippet below, and invite you to go to http://en.wikipedia.org/wiki/Recovery_point_objective to read the rest. I have highlighted some sentences to call your attention to important principles.

The recovery time objective (RTO) is the duration of time and a service level within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity.[1] It can include the time for trying to fix the problem without a recovery, the recovery itself, testing, and the communication to the users. Decision time for users representative is not included. RTO is spoken of as a complement of RPO (or Recovery point objective) with the two metrics describing the limits of acceptable or “tolerable” ITSC performance in terms of time lost(RTO) from normal business process functioning, and in terms of data lost or not backed-up during that period of time(RPO) respectively. The rule in setting an RTO should be that the RTO is the longest period of time the business can do without the IT Service in question.

A “recovery point objective” or “RPO”, is defined by business continuity planning. It is the maximum tolerable period in which data might be lost from an IT service due to a major incident.[1] The RPO gives systems designers a limit to work to. For instance, if the RPO is set to 4 hours, then in practice, offsite mirrored backups must be continuously maintained- a daily offsite backup on tape will not suffice. Care must be taken to avoid two common mistakes around the use and definition of RPO. Firstly, BC Staff use business impact analysis to determine RPO for each service – RPO is not determined by the existent backup regime. Secondly, when any level of preparation of offsite data is required, rather than at the time the backups are offsited- the period during which data is lost very often starts near the time of the beginning of the work to prepare backups which are eventually offsited.

How RTO and RPO values affect computer system design

The RTO and RPO form part of the first specification for any IT Service. The RTO and the RPO have a very significant effect on the design of computer services and for this reason must be considered in concert with all the other major system design criteria.

When assessing the abilities of system designs to meet RPO criteria, for practical reasons, the RPO capability in a proposed design is tied to the times backups are sent offsite- if for instance offsiting is on tape and only daily (still quite common), then 49 or better, 73 hours is the best RPO the proposed system can deliver, so as to cover for tape hardware problems (tape failure is still too frequent, one bad tape can write off a whole daily synchronisation point). Another example- if a service is to be properly set up to restart from any point (data is capable of synchronisation at all times) and offsiting is via synchronous copies to an offsite mirror data storage device, then the RPO capability of the proposed service is to all intents and purposes 0 hours- although it is normal to allow an hour for RPO in this circumstance to cover off any unforeseen difficulty.

If the RTO and RPO can be set to be more than 73 hours then daily backups to tapes (or other transportable media), that are then couriered on a daily basis to an offsite location, comfortably covers backup needs at a relatively low cost. Recovery can be enacted at a predetermined site. Very often this site will be one belonging to a specialist recovery company who can more cheaply provide serviced floor space and hardware as required in recovery because it manages the risks to its clients and carefully shares (or “syndicates”) hardware between them, according to these risks.

If the RTO is set to 4 hours and the RPO to 1 hour, then a mirror copy of production data must be continuously maintained at the recovery site and close to dedicated recovery hardware must be available at the recovery site- hardware that is always capable of being pressed into service within 30 minutes or so. These shorter RTO and RPO settings demand a fundamentally different hardware design- which is for instance, relatively much more expensive than tape backup designs.