About This Case

Closed

29 Nov 2009, 11:59PM PT

Bonus Detail

  • Each Selected Insight
    Earns a $100 Bonus

Posted

22 Nov 2009, 9:24PM PT

Industries

  • Enterprise Software & Services
  • Hardware

Essential Datacenter Tips On Application Performance Monitoring

 

Closed: 29 Nov 2009, 11:59PM PT

Earn up to $100 for Insights on this case.

We are looking for engaging content and experts to be featured who can help educate IT decision makers on the management of mission-critical applications in datacenters.

The topics for this case will focus on application performance monitoring and testing. We're looking for at least 300 words in the form of a blog post that can serve as a discussion starter, and we'd also like to encourage commenting on the submitted insights. Appropriate topics for these discussions include:

  • tips for datacenter managers on improving efficiency of existing resources and an overview of the methods to do so;
  • a review of software tools and metrics for predicting future capacity and development;
  • a compilation of the common risks and benefits of automating application monitoring and testing tools; and
  • guidelines on the development of internal benchmarks to assess current performance and to set future performance goals.

These topics are not exhaustive, and you do not need to address all of these suggested conversations. We welcome additional proposals for alternative subjects, and if you have any questions, please do not hesitate to ask.

6 Insights

 



 

In my experience, the keystone to deploying useful application performance monitoring is to first define and build a hierarchical model of the target application as a service.  We have long left a world of stand alone servers and atomic applications, and good performance monitoring solutions should be designed from the ground up to account for each technology involved in the application’s delivery. 

 

Naturally the scope of data and events involved in monitoring a modern application’s performance as a service can quickly become unwieldy, with information being collated and (hopefully) aggregated from a large number of distinct technologies. To further add to the complexity the software systems used to target and extract data will also vary from technology to technology.  While one might choose to license a single vendor’s “monolithic framework” instead of programming solutions internally, the vendor’s code to gather data from a router will be distinct from the code used to gather storage information, and so on.  Contrary to what marketing claims may be made, there is no “magic button” in any monitoring product that will sort out the relevance of data coming from a subsystem as it relates to an application as it is implemented in the environment.

 

Defining a common named hierarchy of the application’s involved technology upfront enables engineers to systematically “tag” relevant data and events coming from a multitude of sources and correlate them to the appropriate application service (or in some cases… multiple applications).  Correlation is the key to all-inclusive monitoring, as information without context is useless.

 

If the company has not already started using named hierarchies elsewhere in monitoring, it might be a good ideas to develop a basic hierarchical naming standard in the open and let other parties who will (or should) be using the standard provide input to the project.  Once an environment starts successfully treating and monitoring one application as a service, there will be a drive from all application owners to do the same for them.

Specializes in network, application and service monitoring systems design and rescue.

I have great concerns about whether or not mission-critical applications are having their SLA's met in datacenters, whether they are hosted in-house, third-party supported, or any other form of datacenter-based hosting.  First, consider the alternative: the server sits in a room next to your expert developers.  Sure, it's probably a SOX violation, but I can tell you this much: that server will not go down often, and if it does, you can be sure that it will be restored as fast as humanly possible.  That's the advantage to having an expert babysit your system.  If you have two experts in different geographic locations and each babysits a server in case one goes down, then you have about the best support possible for the money.  However, for large systems, this may not be convenient, etc.

But how do you know that a datacenter-hosted app has this type of support?  First, you need to know for sure what the SLA spells out in terms of support and monitoring.  Look for this in your SLA:

"if your app encounters event W, person X will do Y about that specific event within Z amount of time"

I guarantee that anything less specific than that, or anything as specific that's not in writing in the SLA to that effect will not be honored.  Vague responses equal no responses, because why would the datacenter host open themselves up to liabilities by initiating a response that wasn't specified in writing?  Specific, measurable responses with noted responsible parties are required to be honored for the SLA or the datacenter host can be held accountable for any failure to respond as specified.  

So now you've fixed your SLA and you know what they're supposed to do.  How can you be sure they'll actually do the thing they say they'll do?  Well, you obviously need to know that before you are counting on the app for something mission-critical, so while the mission-critical app is still running somewhere else (i.e. being babysat by an expert), you set out to prove that the support can respond by staging various types of failures.  You could tell the host about the staged failure attempts, but then they'll know and they will definitely staff and respond appropriately.  I would stage failures and not tell the host that the failures are a test.  After all, from the host's perspective, any failure is a failure.  Be sure to measure closely the response and check if the SLA was honored as expected.  Any failure to honor it, for any reason, should be a strong indication that the host is not prepared to honor the SLA, thus potentially costing you your mission-critical app. 

Do not allow a complicated roll-over or automated monitoring to imply that the datacenter can respond to any event with seamless mission-critical app coverage.  An inexperienced datacenter admin simply hitting the wrong button can send any app to Davy Jones' locker in a big hurry.  If you truly want mission-critical backup performance, ask yourself what would happen if the datacenter was completely unresponsive?  For example, what if it were hit by a hurricane and completely wiped out?  How soon could you be back up and running, and at what capacity?  If you can't answer that, you better find an answer before some random nonsense* knocks out your one server running everything. 

* Consider the following events I and others have encountered that knocked out servers that were supposed to be failover/seamless mission-critical : 

direct lightning strikes on the building, false fire alarms setting off sprinklers that weren't supposed to be connected, power cut due to people hitting the BRB on the datacenter wall thinking it opened the door, failover cluster patched by the provider causes software conflict resulting in irreparable damage to failover functionality, floods from broken pipes, an admin yanking the plug out of the back of a running server, etc.

I am a Sr. Systems Engineer for a major telecommunications company. I have a long family history in the Computer Science field.

What type of monitor is the best monitor - Scripting/API or SNMP ?

When you are monitoring Windows servers, your engineer will have two fundamental choices: SNMP or Scripting. Using SNMP on Windows means installing the SNMP agent software, and then performing some simple configuration. The Application Monitoring system would be configured to poll a given SNMP variable. The variable polled will either return a number or a text value. The Application Monitoring software needs to take this data and turn it into something useful.

For a simple variable you can:
<ul>
<Li>use the variable as a point on a graph showing trend information</li>
<Li>use the variable to show the current status of a service e.g. Up, down, loading, locked etc</li>
<Li>compare the variable against a threshold and alert as needed </li>
</ul>

The alternative is to use a script or API to poll specific data from the Windows operating system. Microsoft has delivered API’s such as Windows Management Instrumentation (WMI) to provide rich data about the status of an operating system including status of individual services, more detailed view of memory and CPU resources.

The choice between agentless (SNMP) or agent (WMI or proprietary) will be dependent according to the requirements. The rich data set that can be accessed from an agent can provide a much enhanced service console with multiple levels of information. However, the installation of an agent requires compatibility testing, vendor certification and possibly licensing fees.

An agentless install can often meet the requirements of the business, but only when the extensive thinking and analysis has been performed so as to make the number meaningful and relevant. Of itself, the value has no context to business performance, and like the profit on the balance sheet, needs to have context against previous or projected performance to be relevance. An agent install provides an extended data set and, usually, a lot information and context, and needs less interpretation to be relevant for the business.

Greg Ferro is a freelance Network Architect for large organisations specializing in Data Centre and Security design and operations. A focus on operational outcomes, advanced technologies and creative thinking has proven valuable for customers.


The area of greatest complexity for Application Performance Monitoring is the mapping of business expectations to the deliverables from the technology. When a business spends capital on a monitoring system, they do so in expectation of solving a business problem but the software does not readily provide status reports that are meaningful.

For example, lets consider using Cacti (http://cacti.net) to monitoring network performance. Cacti is an open source package and often the first monitoring system deployed when monitoring is needed. In a real sense Cacti and its widespread use sets a base standard for Performance Monitoring, a standard that is readily exceeded by commercial tools.

Cacti, in a basic install, can deliver only graphs of performance information. That is, by polling devices or servers in the data center at regular intervals, it can populate a circular database with the values. The data points can then be plotted onto a simple graph. For example, Cacti can use SNMP to poll the current CPU utilisation of a firewall. This value is typically the instantaneous CPU value at time of poll, and gives a very good view over time of the utilisation of the firewall CPU.

Cacti could also be configured to poll the number of concurrent connections, also a key indicator of firewall performance. And charting the number of concurrent connections over the period of day, weeks and months will give a baseline suitable for budgeting the replacement firewall.

An experienced engineer also knows that these criteria are not the only factors that determine firewall upgrades. Other issues relating to firmware upgrades, growth in firewall rules and new security features are more likely to have a major impact on firewall performance.

For this example, it can be seen that Application Performance Monitoring is a key tool is understanding the performance of devices, but the deliverables, Firewall Performance, requires interpretation and expertise to get an answer that the business requires. This means that Application Performance Monitoring requires tool and knowledge to meet the expectations of the business.

Greg Ferro is a freelance Network Architect for large organisations specializing in Data Centre and Security design and operations. A focus on operational outcomes, advanced technologies and creative thinking has proven valuable for customers.

Monitoring of IT infrastructures involves the abstraction of the detailed technical implementation into a collation of information that matches the business requirements. It’s not as simple as ‘installing the system’ and away we go.

I have spoken before about the challenges of selecting and monitoring variables for a given system and that the results may need to be interpreted by an Engineer who can consider the operational environment to understand the complete system before drawing conclusions. This consideration relies solely on the engineer to have knowledge of the business platform at an operational and a strategic level.

In many business, an engineer is not a part of the business level processes. Dissemination of the current business strategy and forward planning is not typically part of an engineers life. Ut to get the most from your Application Monitoring, you have a requirement to be able to interpret the data in the light of the current business expectations. That makes for a tidy conundrum.

For Application Monitoring to be effective, Business Managers and Service Owners will need to get involved and learn, comprehend and understand the data that is being presented to them. This involvement will require their technical involvement to understand the data collection processes, the limitations that this creates on the reporting and how the system can be adapted to meet their requirements.

This can be a challenge for Management types, who have often moved into soft skills and lost touch with the harsh realities of hard skills i.e. It works / doesn’t work vs “We can define what works” and will resist becoming part of the solution.

Part of deploying an Application Performance Monitoring system is getting engagement from stakeholders, but specifically, managers will need to engage technically to make the most from teh system. This is no different from the Accounting system, it’s just not part of the managers job description, and takes a while to gain acceptance.

Greg Ferro is a freelance Network Architect for large organisations specializing in Data Centre and Security design and operations. A focus on operational outcomes, advanced technologies and creative thinking has proven valuable for customers.

A Case Study on the specific aspects of monitoring a Wordpress installation.

A Wordpress Content Management System can be separated into three key parts. They are:


  • The server availability

  • and software stack such as PHP / Apache

  • the MySQL database



When reviewing the requirements for analysing the performance for Wordpress then you could configure the following probes for Availability:


  • ping probe to confirm that the server OS is running and has access to Internet

  • A simple PHP script that does a single database query that shows the MySQL server is running.

  • Poll a single image from the web server to validate the web server is operational

  • Poll the home page of the Wordpress to confirm the full software stack is operational



The interesting part of this monitoring is that it can also be used to track the performance of the same subsystems. For example, by monitoring the response of the SQL Server for a single, simple query you will have a baseline for SQL performance. If you notice that the Wordpress response time increases, and corresponding increase in SQL response time also occurs, then you can make the assumption that the SQL server is the performance bottleneck. If the SQL response time does not increase, then something in the Wordpress code is slowing down the performance.

The other baseline is the image file test. By downloading an image file, you are actually baselining the entire system. Thus, a slow down of the single image indicates that the OS and Web Server are having performance problems. If the Wordpress is also impacted, then the it’s clearly the Server/OS that needs investigation. If the MySQL probe is also slow, then this confirms the original diagnosis. However, it is more common for the SQL Server to be hosted on a separate server, and this would not correlate in this case.

Many people would opt for an Application Monitor that only monitor the home page of their Wordpress site. However, this doesn’t provide a baseline to analyse performance slowdowns of each subsystem. The correct use of Application Monitoring requires a test for each functional area as a subsystem, and then collection of data relating to that system so that comparisons between the performance of each functional area.