How to Maximize all Monitoring Tools

Every time I walk into a new engagement with a Fortune 500 company, I always ask the customer the question “what are issues they want to address”. Of course, I always get these cool stories on how performance bottlenecks like high CPU and Memory slowed up their environment, how things are breaking left and right and no one knows about it, and how they would like to find out what is causing all their problems and outages. All companies want that magic 8-ball solution, something that will instantly identify and solve all their problems magically.

In the end, I am always handed a dozen requirements and sure I can just resolve those and be done with the engagement and then wait a few month later when new problems occur. By then, they might possibly ask for help again or even worst not use the tool at all and go back doing things the manual way or buy another tool thinking it will solve all their problems.

Based on my experience: All companies pain points are the same and the typical customer only mention 10% of what today’s top monitoring applications like vROPs can actually do. It is up to the expert to guide them towards to what I call “Monitoring 360 degrees” which basically means maximizing what a monitoring tool should be doing. When I say all companies pain point are the same it is because all their requirements will fall under these 6 categories below. (Disclaimer: this is all my own personal opinion)

All Monitoring Use Cases

  1. Being Proactive (Fix problems now to prevent future outages)
  2. Root Cause Analysis in minutes (Troubleshooting VM, Host, vCenter, etc)
  3. 24/7 Critical Monitoring (Monitoring critical Apps, Websites, Infrastructure)
  4. Capacity Management (Realizing Utilization, Growth, Inventory)
  5. Optimize Resources (Reclaiming Resources, which resources need more cpu & memory)
  6. Compliance (Hardening check, Environment consistency checks)

Now you probably thinking where are the reports, alerts, ticketing, and auto remediation? These are not use cases, these are bells and whistles that apply to all 6 use cases above. For example, anyone of those use cases mentioned above can involve reports and alerting for example if the customer wants it. I can easily create alerts on performance, capacity, compliance, etc.

With this being said, this is what all monitoring tools should consist of at a bare minimum. Monitoring 360 Degree Checklist

  1. Proactive Dashboard
  2. Troubleshooting Dashboards
  3. 24/7 Critical Monitoring Dashboard
  4. Capacity Management Dashboards
  5. Optimization Dashboards
  6. Critical alerts being sent as email
  7. Selected Alerts being generated as tickets

Let me explain how some of this should work.

  1. Majority of companies are reactive, but they all want to be proactive – To achieve this, you need the following
    1. Health dashboard – you can’t begin to prevent problems before they happen when you have no idea what problems you have to begin with. That is why you need a dashboard that displays all the problems you have today (Host, vCenter, VMs, Datastores, Clusters, etc). Sounds difficult? Good thing is I created an environment health checker dashboard already if you have vROPs. It is the number one most downloaded dashboard on VMware code. Download it here
    2. Alerts being sent out – Next you need to decide what alerts should be forwarded to your ticketing system and what should be sent as email alerts to engineers to fix immediately. If the alert that caused the outage doesn’t exist yet you will need to create it! I wrote a guide on this here:
  2. When outages do happen, you need to figure out what caused it immediately. You will need the following:
    1. Troubleshooting dashboards – An excellent troubleshooting dashboard should be able to find root cause analysis in a minute! For example, to troubleshoot a problem VM in a minute, I will need to be able to eliminate the Host, Network, Physical Server, and Storage from the equation. On top of that I will need to be able to identify all VM bottlenecks such as CPU, Memory, contention, Disk Latency, application, configurations, etc. Sounds impossible to find root cause analysis in a minute? I have proven to be able to do this with all my customers. Sorry can’t share this dashboard, you can download some light versions of my troubleshooting dashboard on my download page
  3. All management, engineers, and operators need insight of the entire environment in one pane of glass
    1. 24/7 Critical Monitoring Dashboard – let’s say you want to monitor in one pane of glass an environment that consist of 20 vCenters, 1000 Host, 1000 Datastore, 100 Cluster, and 20,000 VMs. On top of that I want to monitor critical websites, vsan, and nsx as well. Also if a site is having problems, show me what objects are causing the problem using the same dashboard. This is a dashboard that I have created for my customers and they have dedicated monitors and even TVs to display throughout the company. Here is a simple version I created, the more advance one is for my customers:

Once again, all this is my own personal opinion but was based on my 8 year experience working for VMware as a consultant. To learn more granular features on what a powerful monitoring tool can do read the following

Updated: 1/09/2022


Leave a Comment

Scroll to Top