vROPS – Alerting Do’s and Don’t
In this post, I will answer how I personally handle alerting for my customers. Once you install an enterprise monitoring tool such as vROPS you will see that your environment has hundreds of alerts for Virtual Machines, Hosts, Datastores, etc. I fully understand the frustration when my customers say they get overwhelmed by all these active alerts as they don’t have any clue on where to begin to start fixing them. Management looks at these alerts and says go fix them. Believe it or not, I rarely ever go to the alerting tab at all, after showing my customers my dashboards they stop relying on the alerting tab to fix issues and start using my dashboards instead. Reason being is because my Health-Check Dashboard covers most of all the alerts that matter and more. With that being said, I will answer all the questions that customers have about alerting based on my 9 years experience at Vmware.
First, I need to put a disclaimer. This is based on my own personal experiences and opinions. This is not an official Vmware best practice guide.
In this post I will answer the following:
- Do’s and Don’t for alerting (Experience sharing)
- Too many active alerts, how do I begin to fix all of them? Management expects us to fix all theses. (Resolving alerts)
- How to manage these alerts? (Alert Management)
- How to best define what alerts get ticketed and what alerts get emailed? (Prioritizing Alerts)
- What is my recommended strategy for alerting? (Alerting Strategy)
Do’s (based on personal experience)
- Do create more alerts as needed, and make sure you test all possible scenarios – if you need an alert that is not out of the box, create it yourself. It does take logic and lots of testing.
- Do use dashboards instead as they make the process much easier – see my healthcheck dashboard to understand why (link below)
- Do separate your alerts to 3 separate categories (ticketing, email, and dashboard worthy) – I will explain this more in the last section.
- Do put a suffix in front of any custom alerts you create to help distinguish it from an Out of the box alert
- Do use policies to disable and modify alerts as needed
Don’t (based on personal experience)
- Don’t create a lot of vROPS Policies – this is way too hard to manage even for me. If you have more than 2 policies, you are making it more complicated for yourself in my opinion. 90% of my customers only need one. Only time they need a 2nd is if they have a vRA, or VDI environment that needs different alerting, metric thresholds, etc. I been doing this for 9 years and I can tell you reverse engineering multiple policies is not easy even for me and possibly the person who created it.
- Do not forward all alerts to ticketing or email as they will get filtered!
- Do not delete any Out of the box alerts, disable them from the policies instead – they will come back once you do an upgrade
- Do not modify any of the Out of the box alerts – modify the symptoms from the policies or clone the alert and modify the clone. Then disable the original alert from the policy. Reason is your changes will all get lost once you update vROPS as those alerts will be overwritten back to the default.
Too many active alerts, how do I begin to fix them?
A superior monitoring tool like vROPS will have 1000+ alerts out of the box. Having too many canned alerts is a good thing, in fact this is what you paid for. You paid for the vendor to provide you with preconfigured alerts for all the objects you want to monitor (VMs, Host, vCenter, etc). I rather have 1,000 alerts out of the box than have 200 alerts. Reason is because the less alerts you have the more you must create manually. So, this is not a valid complaint.
Now I fully understand Management looks at these alerts and expect you to use it to make everything better and greater than it was before. Easier said than done as you begin to look at all the active alerts and you are confused on where to begin to fix all of these. To best answer your question on how I deal with this. I rarely go to the active alerts section at all. Yes, I been doing this long enough to know what the alerts mean but even I find it hard to start fixing everything by going to this section. Hence why I skip it totally and create a dashboard.
Now why do I use a dashboard instead? First let me explain the goal of all these alerts is to notify you what is wrong with your environment today. A dashboard allows me the best way to organize the data, modify the thresholds, and view it the best way possible. Download my dashboard and you will understand what I mean. Best thing is, it is free to download on my site and it is also the number 1 most downloaded dashboard on Vmware code. You can download it below, make sure to read the user guide.
Also, you can use the alert widget to enhance any dashboard
How do I manage these alerts?
Now that I answered the question on how to remediate the alerts by choosing which ones you want to address and add it to a dashboard. The next questions would be how do I best manage the alerts (add/remove/modify)? To best answer this I will break this up to 3 parts below.
Do I ever Remove Alerts?
Answer is no, mainly because the alerts definitions created out of the box is my way of asking vROPS what do you see. So therefore, I don’t remove any alerts personally. However, I do have customers who would like to remove some alerts from alerting for very good reasons. Most of the time, the alert doesn’t apply to their specific environment, so they don’t want to see it at all. Do not delete the alert! Go to your active policy and look for your alert and disable it from the policies. Reason why you don’t want to delete the alert is because it will come back once your update vROPS.
What do I do if I want to modify the alert?
If it something simple like modifying the symptom from a threshold of 80% to 90%. You can change that in the policy. Now if you want to add on to the alert, you will need to clone the alert and add the extra symptoms to the new alert. Then you will need to disable the original alert in the policy. Reason being is because once you update vROPS, the default alert will go back to the original state, therefore all the modifications you made will get reset back.
Do I add more alerts?
Answer is yes, some alerts are specific to customers ask. Make sure any custom alerts you put a naming in front of it such as VMignite.com – VM Alert. This way you will know that this is a custom alert and it also makes things easier to filter as all you need to do is type in VMignite in the search and it will show you all the custom alerts you created.
How do I best define what alerts get ticketed and what alerts get emailed?
Out of the 1000+ alerts vROPs has out of the box, a company must gather all their Engineers, Operators, and management to decide what alerts matter. All alerts should be defined into three categories:
- These alerts should only be shown in vROPS. Some example of these would be VMTools outdated or VM Snapshots. These are worthy to be shown on a dashboard but will be a nightmare if they get ticketed or emailed every time this occurs that’s because they happen daily and often. These lower priority alerts often get sent to ticketing systems and email notifications which leads to filtering of all vROPS alerts sent to a folder. So now when something critical such as vCenter is down alert gets sent, they will not see it in their inbox because it got filtered in a folder. This is one of the biggest mistakes that most of my customers do before I got there.
- These alerts are actionable by Operations and Engineering team. Examples would be Datastore is running out of space, ESXi Host has contention that is affecting all VMs on that host, etc. These tickets are actionable, as someone needs to address these before it becomes a greater problem. However, they are not something you would drop what you are doing to fix it. Which leads us to our next category.
Email Notification Worthy
These are alerts that you drop everything you do and go fix it immediately. In other words, these are alerts that you would ask to be excused in the middle of a meeting or drop your uneaten lunch to go fix it. Examples of these would be Email server is down, Datastore is 100% full which caused VMs to fail, ESXi Host crashed which brought down production VMs, E-Commerce site is down so customers can’t buy anything from our site, etc. These are events that Engineering should be the first to know and will need to fix it immediately before things sprawl out of control. Lots of time this gets filtered due to Engineering receiving too many non-critical alerts, which caused them to filter everything from vROPS to an email folder. This leads to the worst case scenario where the paying customer are the first to find out their servers are broken and calls to complain, which leads to upper management being furious and questioning why did IT Operations not catch it and why are we the last to know. The answer is simple, you bought the right tool, but you haven’t got the right strategy in place. Happened to you before? Well-read below on how to put the right strategy in place.
How to implement the right strategy (based on personal experience)
Make a list of Alerts that matter for each object (Virtual Machines, vCenter, Host, Datastores, etc). Lost? Just copy and paste my spreadsheet below to an excel. Add and remove as needed. Make sure you tell management you got it from VMignite.com. This at least will get you more credibility as I noticed people don’t like or trust using something one person put together. This list was perfected through many engagements with my Fortune 500 customer which make it more creditable.
Setup multiple meetings to get this list done. Noticed I said multiple, this is a timely process but worth it in the end
- It is a team effort! Not one person can make all these decisions. You need Operations, Engineering, and management all 100% involved
- Go through each line item one by one and ask the question, is this dashboard worthy? Is this also ticket worthy? Should this alert be sent out by Email notification? Make sure you explain them the clear difference between the three or show them this blog post if you must. You may use your Vmware TAM to coordinate this, as I find it always good to have an outsider to manage the meetings.
- One line item could fall in all three categories, for example “vCenter server is down” should be on a dashboard, should get a ticket sent, and an email alert as well.
Fill out which team is responsible to resolve the alert. This will prevent conflict later, as everyone has agreed on who will take ownership of that alert.
- Once everything is all filled out, use the spreadsheet to create the forwarding rules/notifications in vROPS.
|IaaS vROPS Alerting|
|vCenter Server||vCenter Service is Down|
|vCenter Server||Certificate for VASA Provider(s) will expire soon|
|vCenter Server||Number of Ips to be pinged exceeds the limit|
|vCenter Server||Duplicate object name found in vCenter|
|vCenter Server||A problem occurred with a vCenter Server component.|
|vCenter Server||Refreshing CA certificates and CRLs for VASA Provider(s) failed|
|vCenter Server||The vCenter Server Storage data collection failed.|
|vCenter Server||VASA Provider(s) disconnected|
|vCenter Server||vCenter HA health is degraded|
|vCenter Server||vCenter data collection is slow|
|vCenter Server||vCenter NTP Status is Down|
|vCenter Server||vCenter Backup Job failed|
|vCenter Server||vCenter License is Overused|
|Cluster||vSphere HA failover resources are insufficient.|
|Cluster||vSphere HA master missing|
|Cluster||Proactive HA provider has reported health degradation on the underlying hosts.|
|Cluster||Cluster has CPU Contention caused by Virtual Machines|
|Cluster||Cluster has high CPU workload|
|Cluster||Cluster has Memory Contention caused by Virtual Machines|
|Cluster||Cluster has high Memory workload|
|Host System||Host has CPU Contention for longer than 24 hours|
|Host System||Host has Memory contention for longer than 24 hours|
|Host System||ESXi host has detected a link status ‘flapping’ on a physical NIC|
|Host System||ESXi host has detected a link status down on a physical NIC.|
|Host System||Path redundancy to storage device degraded|
|Host System||vSphere High Availability (HA) has detected a network-isolated host.|
|Host System||The host has lost connectivity to the physical network|
|Host System||vSphere High Availability (HA) has detected a possible host failure.|
|Host System||The host has lost connectivity to a dvPort|
|Host System||A fatal error occurred on a PCIe bus during system reboot.|
|Host System||A fatal memory error was detected at system boot time.|
|Host System||A PCIe error occurred during system boot, but the error is recoverable.|
|Host System||A recoverable memory error has occurred on the host.|
|Host System||Host has lost connection to vCenter Server|
|Host System||Host is experiencing high number of packets dropped|
|Host System||Uplink redundancy on DVPorts degraded|
|Host System||The host lost connectivity to a Network File System (NFS) server|
|Host System||The host has lost redundant uplinks to the network|
|Host System||vSphere High Availability (HA) has detected a network-partitioned host|
|Host System||The host has lost redundant connectivity to a dvPort|
|Virtual Machine||VM Low in Disk Space|
|Virtual Machine||VM High CPU for over a day|
|Virtual Machine||VM High Memory for over a day|
|Virtual Machine||Virtual machine is experiencing memory compression, ballooning or swapping due to memory limit.|
|Virtual Machine||Virtual machine snapshot longer than 2 days old|
|Virtual Machine||Virtual machine has memory contention due to swap wait and high disk read latency.|
|Virtual Machine||Virtual machine has CPU contention caused by IO wait.|
|Virtual Machine||Virtual machine has memory contention due to memory compression, ballooning or swapping.|
|Virtual Machine||Virtual machine has disk I/O read latency problem.|
|Virtual Machine||Virtual machine has disk I/O write latency problem.|
|Virtual Machine||Virtual machine has disk I/O latency problem caused by snapshots.|
|Virtual Machine||Not enough resources for vSphere HA to start the virtual machine.|
|Virtual Machine||vSphere HA failed to restart a network isolated virtual machine.|
|Virtual Machine||vSphere HA cannot perform a failover operation for the virtual machine|
|Virtual Machine||Virtual machine has CPU contention due to memory page swapping in the host.|
|Virtual Machine||Virtual machine has CPU contention due to multi-vCPU scheduling issues (co-stop) caused by snapshots|
|Virtual Machine||Virtual machine has CPU contention caused by co-stop.|
|Datastore||High Disk Latency for over 1 hour|
|Datastore||Datastore is running out of disk space.|
|Datastore||Datastore has lost connectivity to a storage device.|
|Datastore||Datastore has one or more hosts that have lost redundant paths to a storage device.|
|Datastore||A storage device for a datastore has been detected to be off|