Masterful Monitoring

One of the challenges that most MSPs face is managing the alerts from the systems that they monitor. All too often, the default or sample monitor sets that are provided by the monitoring platform are deployed without customization. Kaseya, like every monitoring platform from Nagios to HP OpenView provide sample monitor sets. These are ideal to demonstrate the capabilities of the monitoring platform, but are usually unsuited for production use. When these sample monitors are deployed, the result is an overwhelming assortment of useful and meaningless tickets that arrive at your help desk. How do you filter the critical information from the noise, short of just turning all of the monitor sets off? Hopefully, you’ll glean some insight from this article!

Monitor What is Important!

First and foremost, you need to develop a monitor set that alerts on events that you can respond to and resolve. A monitor that creates an informational alert or ticket serves no real purpose other than to drain resources from your help desk.

Start developing good monitor sets by just logging the alerts without creating tickets. The sample monitors provide a good starting point for this task. Collect the data for a few weeks and review the alerts with the technical team to determine the following:

  • Does the alert represent a problem that requires action? If the alert is informational or reports a failure because an optional feature is not present, it should be eliminated.
  • Can automation be used to resolve the condition?
    • Develop scripts that are run manually at first to verify the remediation process.
    • Use Service Desk (or other automation methods) to recognize these events and trigger the remediation scripts.
  • Identify the priority of the alert. Is it important enough to wake someone up for if it occurs after-hours?
  • Create a general classification so that similar alerts can be grouped later when creating production monitor sets. For example: Baseline, Backup, Active Directory, and other application-specific types.
  • Decide on parameters that will help identify and route the ticket. For example: “SERVICE+servicename”, “EVENT-APP+Source+EventID”, or “APPLICATION+AppName+FailureType”. This information can be used to determine the type of remediation process as well as classify the ticket in your PSA.

Once you narrow down the list of monitors to those that meet the criteria above, you can start creating the actual monitor sets that you will use in production. Recognize that this is an evolutionary process! Start with the most critical monitors, then revisit them and add the next level. This will keep your help desk from being overwhelmed, and allow the monitor sets to be “zeroed-in” to an optimal set for your organization and customer base.

Organization

Creating multiple monitor sets will allow them to remain smaller, focused, and more manageable, as well as applied only where needed. Start with a “core” set that generically covers the services or events that occur on every platform, regardless of what special services are installed. Separate core sets for Servers and Workstations are appropriate. Next, create a monitor set that represents all of the services or events associated with a specific application or service role. A monitor set should contain monitors of like priority – don’t mix high and low priority events in a single set. Separating the monitors by priority allows your automation to decide if a special notification (call or page) should be made for alerts above a certain priority.

Another key concept for organization and automation is a naming standard for alerts. Most platforms can parse the header or subject line of an alert, but not the body. Thus, a subject of “service failed on host <hostname>” might be meaningful to your technician, but is nearly useless for parsing by automated procedures! A machine-readable subject of “Alert,Hostname,Service,ServiceName” might be less friendly to your technician, but allows your Service Desk to quickly identify this as a Service alert, and that “ServiceName” has failed and should be restarted.

Performance Monitors

When creating monitors for performance, consider that not all customers have the same platform specifications. Consider creating duplicate performance monitors adjusted for different thresholds so that they can be applied based on the customer platform capabilities and the level of tuning applied. The well-publicized “performance thresholds” are often based on highly optimized platforms that many smaller customers simply don’t implement! Your thresholds should be based on real-world observations within your customer base.

Use a “Monitor, Tune, Adjust, Alert” method for performance monitors:

  • Monitor - Create the monitor set but don’t enable alerting. Track the alerts that would be generated for a week or more.
  • Tune - Identify the worst performing systems and tune the platform as much as possible to reduce or eliminate potential alerts. Disk configuration and pagefile settings along with basic application tuning go a long way!
  • Adjust – Adjust the monitor alert thresholds to match the customer environment to accommodate tuning limitations. When defining thresholds, consider how long a performance issue will be permitted before alerting – avoid alerts from transient conditions.
  • Alert – Enable alerting, knowing that when an alert arrives, it most likely needs your attention!

One thing to consider is to restrict when performance alerts can be submitted. Our RMM Suite allows specific alert classes to be ignored after-hours when backups and daily close operations are expected to cause higher than normal utilization.

Why Is This Important?

Focused monitor sets simplify deployment, customization, and management. When a monitor set needs to be updated after a new application version is released, the engineer knows exactly which monitor set to update – no need to scroll through hundreds of unrelated definitions.

Testing does take time, effort, and can pull an engineer away from customer service, but good monitors will significantly reduce the noise from unmanaged alerts – these simply attack your tech team resources! Taking the time to validate monitors and focus on result-based alerts improves overall productivity.

Developing standards – even with simple documentation – provides the ability for any engineer on your team to efficiently maintain the monitor sets and support the overall RMM platform.

Examples of Good Monitoring Practices

  • Define the alert as specifically as possible. Using wildcards can introduce unwanted alerts. For Event Logs, this means defining the Event Source, Event ID, and possibly a message text fragment.
  • Use standard, “machine readable” subject lines and detailed body content to clearly convey the issue to the help desk.
    • Good subject lines allow the RMM and PSA to parse the information to determine the response, priority, possible remediation, and provide data for classifying and routing the ticket.
    • The subject should provide all necessary information, including agent, alert classification fields, and priority.
  • Implement “Smart Monitors” – scripts that intelligently monitor a condition, auto-adapt to local configuration, suppress transient conditions, and alert only after attempting to remediate the event. A smart monitor may do any or all of these things. For example:
    • Disk capacity – adjust the threshold based on disk size – no more issues from choosing 10% free on 100G and 100TB volumes or deploying per disk.
    • Antivirus – force a check-in and update when definitions are outdated. This has alone reduced our tickets from 30/day to 3-4/week!

Summary

  • Invest in quality monitor sets. This takes time or money, but the payback can be rapid and significant.
  • Focus on alerts that can be addressed and resolved. Everything else simply drains your resources.
  • Start small by selecting the most critical events first. Allow time for these monitors to be reviewed and adjusted, then add less-critical events.
  • Set appropriate thresholds.
  • Use “Smart Monitors” instead of or in addition to built-in monitors:
    • Auto-adapt disk capacity alert to disk size.
    • Remediate before Alert – the monitor triggers a remediation process and alerts only if unsuccessful. Scripts, BAT Files, and EXEs can communicate with Kaseya for alerting.
    • Perform transient suppression – alert only when the condition persists for a specific period of time.
  • Reduce labor cost and free up technical resources with good RMM design, leveraging available tools.

MSP Builder Solutions

There are several ways that MSP Builder can assist with streamlining your Kaseya Service Desk platform. Two of our most popular are the Multi-Tool and the RMM Suite.

  • The Multi-Tool is designed for DIY solutions for Service Desk automation. It provides over 60 functions in a single executable for high-precision math and comparisons, string manipulation, logic, networking, and time calculations. The time functions alone allowed us to perform complex calculations to eliminate performance alerts outside of production hours, as well as determine the operational state of the helpdesk to decide when to send after-hour alerts. This tool provides days of coding effort for the cost of a few hours of professional services.
  • The RMM Suite provides a collection of tools that “Jump Start” a Kaseya VSA platform, including starter packs for KAV & KAM profiles, patching policies, KNM monitor templates with multiple performance tiers, agent monitor sets, and a broad collection of agent procedures, including a sub-group for remediation and NOC management. The core components are a highly automated service desk that integrates tightly with both agent and KNM monitors, including our “smart monitors” and a large collection of autonomous maintenance tools. The maintenance package includes an end-user interface used to provide status (answering the “what have you done for me lately” question) and provides info about upcoming maintenance and reboot tasks. We also provide an extensive set of System Policies and Views, that together automate a large part of agent management – we detect specific system roles with a daily audit process and use that role data to automatically apply appropriate monitor sets. These policies also auto-apply patch configuration, application updates, daily maintenance, and other management tasks without any manual intervention – only the exceptions need to be touched, and most management is done via agent procedures.

 

Comments

Comments are closed on this post.