Log Insight – VM Monitoring Dashboard (Download)

This is a must-have dashboard for anyone who wants to know who did what with my Virtual Machines. With this dashboard alone you will be able to know who created, deleted, modified, updated, power cycled, moved, remoted in, and exported a VM. It’s a 360 audit monitoring dashboard for everything Virtual Machines related. Details below.

What you will be able to monitor

  • VMs Created/Deleted
  • VMs Powered On/Off
  • VMs Rebooted
  • VMs Configured (Disk, Network, CPU, Memory)
  • VMs Renamed
  • VMs that got vMotioned
  • VMs that need Disk consolidation
  • Reservations
  • Limits
  • Snapshots
  • VMs Exported
  • VM Configuration Parameters changes
  • ISO Mount
  • VMs moved to folders
  • VM made to a template
  • Remote Consoled used to access a VM
  • VM Hot Add Modifications (CPU/Memory)
  • VM Versions updated
  • VMs Customized
  • VM HA event

Download Here: https://code.vmware.com/samples?id=7667

Install Guide

To import go to Content Packs > Import Dashboard. Import as Content Pack. Go to Dashboards to view the dashboard.

twitterpinterestlinkedinmail


vROPS 8.4+ – Executive Dashboard (Download)

With the new features of 8.4, I was finally able to finish my Executive Dashboard the way I envisioned it. In one pane of glass, executives will be able to see how much capacity do they have left, what is my current inventory, how fast am I growing, does my current infrastructure have any cpu and/or memory bottlenecks, is my storage good in space and running at optimal speed, do I have any ESXi host down, is Cluster HA/DRS enabled, and does my VMs have enough resources to prevent any major outages. Read the user guide below to fully understand how to use the dashboard.  This dashboard could also be a 24/7 critical monitoring dashboard as well.  This dashboard should work with older versions as well, just won’t look as nice.

Download the Dashboard here: https://code.vmware.com/samples?id=7628

What the dashboard covers

Capacity

  1. Inventory
    1. vCenters, Host, Clusters
    2. Datastores
    3. VMs (Powered On, Powered Off, VM to host ratio)
  2. Cluster Capacity Remaining
  3. Capacity Growth in the past 6 months

Infrastructure Health

  1. Host high in CPU usage %
  2. Host high in Memory usage %
  3. Host that are down or powered off
  4. Datastores out of space
  5. Datastores with disk latency
  6. Clusters with HA/DRS turned Off

VM Health

  1. C: Drive low space
  2. Root Drive low space
  3. Disconnected VMs

Note:

  1. The entire dashboard auto refreshes every 5 minutes, so you always have the latest updates.  Therefore could also be used as a NOC dashboard
  2. You must be on vROPs 8.4 to have the best experience.  This dashboard should work with older versions as well, just won’t look as nice.
  3. You will need to enable these metrics to see C: Drive space low and Root drive space low.

    http://www.vmignite.com/2021/02/vrops-8-how-to-enable-hidden-metrics-and-properties/

  4. To remove any cluster, host, environments you don’t want to see in any of the widgets, just edit the widget and filter it out.
  5. Click on the expand button on any widget to see more values and to maximize the window

User Guide

View what you have in your entire environment

Capacity Remaining % is calculated by the lowest remaining value for CPU/Memory/Disk remaining. By default, it will use actual usage% for CPU/Memory/Disk. If you like you can adjust it to allocation model in the policies. You can also add a buffer as well to CPU/Memory/Disk. I set the thresholds for 20% for yellow, 15% for orange, and 10% for Red. This should all be green.

Measures VM growth in the past 6 months. Hover your mouse over the graph to see exact numbers of VMs at a certain time. Measures Total VMs and Running VMs

Make sure all ESXi Host utilization is green. Anything above 80% is yellow, 85% is amber, and 90% is red. Having high utilization may cause CPU/Memory bottlenecks. And if you reach the max, it may cause outages to VMs.

If the Power State is Unknown, this is bad. It means that the ESXi Host is either disconnected, orphaned, or not responding. This is most of the time not planned therefore you should address this immediately. If the Host is Powered Off, this is usually planned.

Do not let any Datastores be in the red zone. Even when you think it was planned and is under control, I have seen many customers fill up disk space which caused outages to many VMs that are on the same datastore. Thresholds are 85% yellow, 90% Orange, and 95% for red.

Disk latency can degrade performance for VMs on that datastore. Thresholds are 10ms for Yellow, 15ms for Orange, and >20ms for Red.

This widget will catch if one of the following is turned off Cluster HA or Cluster DRS. If this is blank, that means none are found which is a good thing

Having no space on the C: Drive for Windows OS and Root Drive for Linux OS can cause outages and degraded performance. Make sure no VMs get to that point. You should see values here no matter what. If you don’t you must enable the metrics in the policies. http://www.vmignite.com/2021/02/vrops-8-how-to-enable-hidden-metrics-and-properties/

Any VMs shown here should be addressed immediately on why they are not disconnected. If this is blank, that means none are found which is a good thing

Instructions on how to Import Dashboard

To import in version 7.0 and above

  1. First unzip the file you just downloaded, it will contain a dashboard and a view file
  2. Go to Dashboards > Actions > Manage Dashboards

  3. Hit the dropdown and select Import Dashboards. Import the Dashboard.zip file

  4. Next to go Views > Dropdown > Import. Import the View.zip file

  5. If you get any errors during the process, make sure to click overwrite before importing
twitterpinterestlinkedinmail


vROPS – Alerting Do’s and Don’t

In this post, I will answer how I personally handle alerting for my customers. Once you install an enterprise monitoring tool such as vROPS you will see that your environment has hundreds of alerts for Virtual Machines, Hosts, Datastores, etc. I fully understand the frustration when my customers say they get overwhelmed by all these active alerts as they don’t have any clue on where to begin to start fixing them. Management looks at these alerts and says go fix them. Believe it or not, I rarely ever go to the alerting tab at all, after showing my customers my dashboards they stop relying on the alerting tab to fix issues and start using my dashboards instead. Reason being is because my Health-Check Dashboard covers most of all the alerts that matter and more. With that being said, I will answer all the questions that customers have about alerting based on my 9 years experience at Vmware.

First, I need to put a disclaimer. This is based on my own personal experiences and opinions. This is not an official Vmware best practice guide.

In this post I will answer the following:

  • Do’s and Don’t for alerting (Experience sharing)
  • Too many active alerts, how do I begin to fix all of them? Management expects us to fix all theses. (Resolving alerts)
  • How to manage these alerts? (Alert Management)
  • How to best define what alerts get ticketed and what alerts get emailed? (Prioritizing Alerts)
  • What is my recommended strategy for alerting? (Alerting Strategy)

Do’s (based on personal experience)

  • Do create more alerts as needed, and make sure you test all possible scenarios – if you need an alert that is not out of the box, create it yourself. It does take logic and lots of testing.
  • Do use dashboards instead as they make the process much easier – see my healthcheck dashboard to understand why (link below)
  • Do separate your alerts to 3 separate categories (ticketing, email, and dashboard worthy) – I will explain this more in the last section.
  • Do put a suffix in front of any custom alerts you create to help distinguish it from an Out of the box alert
  • Do use policies to disable and modify alerts as needed

Don’t (based on personal experience)

  • Don’t create a lot of vROPS Policies – this is way too hard to manage even for me. If you have more than 2 policies, you are making it more complicated for yourself in my opinion. 90% of my customers only need one. Only time they need a 2nd is if they have a vRA, or VDI environment that needs different alerting, metric thresholds, etc. I been doing this for 9 years and I can tell you reverse engineering multiple policies is not easy even for me and possibly the person who created it.
  • Do not forward all alerts to ticketing or email as they will get filtered!
  • Do not delete any Out of the box alerts, disable them from the policies instead – they will come back once you do an upgrade
  • Do not modify any of the Out of the box alerts – modify the symptoms from the policies or clone the alert and modify the clone. Then disable the original alert from the policy. Reason is your changes will all get lost once you update vROPS as those alerts will be overwritten back to the default.

Too many active alerts, how do I begin to fix them?

A superior monitoring tool like vROPS will have 1000+ alerts out of the box. Having too many canned alerts is a good thing, in fact this is what you paid for. You paid for the vendor to provide you with preconfigured alerts for all the objects you want to monitor (VMs, Host, vCenter, etc). I rather have 1,000 alerts out of the box than have 200 alerts. Reason is because the less alerts you have the more you must create manually. So, this is not a valid complaint.

Now I fully understand Management looks at these alerts and expect you to use it to make everything better and greater than it was before. Easier said than done as you begin to look at all the active alerts and you are confused on where to begin to fix all of these. To best answer your question on how I deal with this. I rarely go to the active alerts section at all. Yes, I been doing this long enough to know what the alerts mean but even I find it hard to start fixing everything by going to this section. Hence why I skip it totally and create a dashboard.

Now why do I use a dashboard instead? First let me explain the goal of all these alerts is to notify you what is wrong with your environment today. A dashboard allows me the best way to organize the data, modify the thresholds, and view it the best way possible. Download my dashboard and you will understand what I mean. Best thing is, it is free to download on my site and it is also the number 1 most downloaded dashboard on Vmware code. You can download it below, make sure to read the user guide.

http://www.vmignite.com/2021/02/vrops-vsphere-health-checker-dashboard-2-0/

Also, you can use the alert widget to enhance any dashboard

How do I manage these alerts?

Now that I answered the question on how to remediate the alerts by choosing which ones you want to address and add it to a dashboard. The next questions would be how do I best manage the alerts (add/remove/modify)? To best answer this I will break this up to 3 parts below.

Do I ever Remove Alerts?

Answer is no, mainly because the alerts definitions created out of the box is my way of asking vROPS what do you see. So therefore, I don’t remove any alerts personally. However, I do have customers who would like to remove some alerts from alerting for very good reasons. Most of the time, the alert doesn’t apply to their specific environment, so they don’t want to see it at all. Do not delete the alert! Go to your active policy and look for your alert and disable it from the policies. Reason why you don’t want to delete the alert is because it will come back once your update vROPS.

What do I do if I want to modify the alert?

If it something simple like modifying the symptom from a threshold of 80% to 90%. You can change that in the policy. Now if you want to add on to the alert, you will need to clone the alert and add the extra symptoms to the new alert. Then you will need to disable the original alert in the policy. Reason being is because once you update vROPS, the default alert will go back to the original state, therefore all the modifications you made will get reset back.

Do I add more alerts?

Answer is yes, some alerts are specific to customers ask. Make sure any custom alerts you put a naming in front of it such as VMignite.com – VM Alert. This way you will know that this is a custom alert and it also makes things easier to filter as all you need to do is type in VMignite in the search and it will show you all the custom alerts you created.

How do I best define what alerts get ticketed and what alerts get emailed?

Out of the 1000+ alerts vROPs has out of the box, a company must gather all their Engineers, Operators, and management to decide what alerts matter. All alerts should be defined into three categories:

  • Dashboard Worthy
    • These alerts should only be shown in vROPS. Some example of these would be VMTools outdated or VM Snapshots. These are worthy to be shown on a dashboard but will be a nightmare if they get ticketed or emailed every time this occurs that’s because they happen daily and often. These lower priority alerts often get sent to ticketing systems and email notifications which leads to filtering of all vROPS alerts sent to a folder. So now when something critical such as vCenter is down alert gets sent, they will not see it in their inbox because it got filtered in a folder. This is one of the biggest mistakes that most of my customers do before I got there.
  • Ticket Worthy
    • These alerts are actionable by Operations and Engineering team. Examples would be Datastore is running out of space, ESXi Host has contention that is affecting all VMs on that host, etc. These tickets are actionable, as someone needs to address these before it becomes a greater problem. However, they are not something you would drop what you are doing to fix it. Which leads us to our next category.
  • Email Notification Worthy
    • These are alerts that you drop everything you do and go fix it immediately. In other words, these are alerts that you would ask to be excused in the middle of a meeting or drop your uneaten lunch to go fix it. Examples of these would be Email server is down, Datastore is 100% full which caused VMs to fail, ESXi Host crashed which brought down production VMs, E-Commerce site is down so customers can’t buy anything from our site, etc. These are events that Engineering should be the first to know and will need to fix it immediately before things sprawl out of control. Lots of time this gets filtered due to Engineering receiving too many non-critical alerts, which caused them to filter everything from vROPS to an email folder. This leads to the worst case scenario where the paying customer are the first to find out their servers are broken and calls to complain, which leads to upper management being furious and questioning why did IT Operations not catch it and why are we the last to know. The answer is simple, you bought the right tool, but you haven’t got the right strategy in place. Happened to you before? Well-read below on how to put the right strategy in place.

How to implement the right strategy (based on personal experience)

  1. Make a list of Alerts that matter for each object (Virtual Machines, vCenter, Host, Datastores, etc). Lost? Just copy and paste my spreadsheet below to an excel. Add and remove as needed. Make sure you tell management you got it from VMignite.com. This at least will get you more credibility as I noticed people don’t like or trust using something one person put together. This list was perfected through many engagements with my Fortune 500 customer which make it more creditable.
  2. Setup multiple meetings to get this list done. Noticed I said multiple, this is a timely process but worth it in the end
    1. It is a team effort! Not one person can make all these decisions. You need Operations, Engineering, and management all 100% involved
    2. Go through each line item one by one and ask the question, is this dashboard worthy? Is this also ticket worthy? Should this alert be sent out by Email notification? Make sure you explain them the clear difference between the three or show them this blog post if you must. You may use your Vmware TAM to coordinate this, as I find it always good to have an outsider to manage the meetings.
    3. One line item could fall in all three categories, for example “vCenter server is down” should be on a dashboard, should get a ticket sent, and an email alert as well.
    4. Fill out which team is responsible to resolve the alert. This will prevent conflict later, as everyone has agreed on who will take ownership of that alert.
  3. Once everything is all filled out, use the spreadsheet to create the forwarding rules/notifications in vROPS.
IaaS vROPS Alerting        
Source Name Dashboard Ticket? Email Team Responsible?
vCenter Server vCenter Service is Down
vCenter Server Certificate for VASA Provider(s) will expire soon
vCenter Server Number of Ips to be pinged exceeds the limit
vCenter Server Duplicate object name found in vCenter
vCenter Server A problem occurred with a vCenter Server component.
vCenter Server Refreshing CA certificates and CRLs for VASA Provider(s) failed
vCenter Server The vCenter Server Storage data collection failed.
vCenter Server VASA Provider(s) disconnected
vCenter Server vCenter HA health is degraded
vCenter Server vCenter data collection is slow
vCenter Server vCenter NTP Status is Down
vCenter Server vCenter Backup Job failed
vCenter Server vCenter License is Overused
Source Name Dashboard Ticket? Email Team Responsible?
Cluster vSphere HA failover resources are insufficient.
Cluster vSphere HA master missing
Cluster Proactive HA provider has reported health degradation on the underlying hosts.
Cluster Cluster has CPU Contention caused by Virtual Machines
Cluster Cluster has high CPU workload
Cluster Cluster has Memory Contention caused by Virtual Machines
Cluster Cluster has high Memory workload
Source Name Dashboard Ticket? Email Team Responsible?
Host System Host has CPU Contention for longer than 24 hours
Host System Host has Memory contention for longer than 24 hours
Host System ESXi host has detected a link status ‘flapping’ on a physical NIC
Host System ESXi host has detected a link status down on a physical NIC.
Host System Path redundancy to storage device degraded
Host System vSphere High Availability (HA) has detected a network-isolated host.
Host System The host has lost connectivity to the physical network
Host System vSphere High Availability (HA) has detected a possible host failure.
Host System The host has lost connectivity to a dvPort
Host System A fatal error occurred on a PCIe bus during system reboot.
Host System A fatal memory error was detected at system boot time.
Host System A PCIe error occurred during system boot, but the error is recoverable.
Host System A recoverable memory error has occurred on the host.
Host System Host has lost connection to vCenter Server
Host System Host is experiencing high number of packets dropped
Host System Uplink redundancy on DVPorts degraded
Host System The host lost connectivity to a Network File System (NFS) server
Host System The host has lost redundant uplinks to the network
Host System vSphere High Availability (HA) has detected a network-partitioned host
Host System The host has lost redundant connectivity to a dvPort
Source Name Dashboard Ticket? Email Team Responsible?
Virtual Machine VM Low in Disk Space
Virtual Machine VM High CPU for over a day
Virtual Machine VM High Memory for over a day
Virtual Machine Virtual machine is experiencing memory compression, ballooning or swapping due to memory limit.
Virtual Machine Virtual machine snapshot longer than 2 days old
Virtual Machine Virtual machine has memory contention due to swap wait and high disk read latency.
Virtual Machine Virtual machine has CPU contention caused by IO wait.
Virtual Machine Virtual machine has memory contention due to memory compression, ballooning or swapping.
Virtual Machine Virtual machine has disk I/O read latency problem.
Virtual Machine Virtual machine has disk I/O write latency problem.
Virtual Machine Virtual machine has disk I/O latency problem caused by snapshots.
Virtual Machine Not enough resources for vSphere HA to start the virtual machine.
Virtual Machine vSphere HA failed to restart a network isolated virtual machine.
Virtual Machine vSphere HA cannot perform a failover operation for the virtual machine
Virtual Machine Virtual machine has CPU contention due to memory page swapping in the host.
Virtual Machine Virtual machine has CPU contention due to multi-vCPU scheduling issues (co-stop) caused by snapshots
Virtual Machine Virtual machine has CPU contention caused by co-stop.
Source Name Dashboard Ticket? Email Team Responsible?
Datastore High Disk Latency for over 1 hour
Datastore Datastore is running out of disk space.
Datastore Datastore has lost connectivity to a storage device.
Datastore Datastore has one or more hosts that have lost redundant paths to a storage device.
Datastore A storage device for a datastore has been detected to be off
twitterpinterestlinkedinmail


vROPS 8.4 – How to use Automation Central to schedule task

New in vROPS 8.4 is an Automation Central section. Automation Central allows you to automate task such as rebooting VMs, reclaiming resources, and rightsizing VMs in a recurring schedule you choose. In this guide I will show you how to set a recurring job to delete old snapshots past 7 days using this feature. I will also show you how to filter VMs that I don’t want snapshots removed, for example Windows Templates.

Automation Central can schedule the following task

Reclamation

  • Delete old snapshots
  • Delete idle VMs
  • Power off idle VMs
  • Delete powered off VMs

Performance Optimizations

  • Downsize oversized VMs
  • Scale-up undersized VMs

General

  • Reboot VMs

How to Schedule a weekly deletion of VM Snapshots older than 7 days

As you can see using my Healthchecker Dashboard. I can see I have 3 VMs with Snapshots. I only want to delete anything past 7 days old, but I also want to leave the Windows 2016 VM alone since it is a template. If we do everything correctly, the ws1connect VM should be the only VM snapshot that gets deleted.

Go to Home > Automation Central

Select a date on the calendar and click on Add Job

Provide a name and select Delete Old Snapshots. Modify the days as needed. Click on Next

Add the clusters you want to apply the job too. And then click on Preview Scope to see what VMs will get effected.

After previewing you see that it will apply this to all VMs in the cluster. Even the three Windows Template that I don’t this to apply too. That is because we haven’t added a filter to tell it not to yet.

To remove the windows templates. I add a Filter Criteria of Object Name does not contain win. This will eliminate any VMs that has the word “win” in it. You can also filter it based on Tags and other properties as well.

Now I click on the Preview Scope again to verify that the Windows Templates are indeed removed. Click on Next

Below is a sample of how to do schedule this job to run weekly starting at every Monday morning at 3:45am. I set it to indefinitely which means this has no end date. Click on Create when done.

Click on Jobs to view the job you just created. You can also Disable and edit your job from this section as well.

After the time has elapsed, you can click on History to see that indeed the Automated did ran. It also deleted only the one VM that had over 7 day old snapshot and not our Windows Template that also had an old snapshot.

I also verified that in vCenter as well as the Healthcheck dashboard.

You can use this feature to reclaim CPU and memory as well, but make sure you test this feature on non-essential VMs before you push it out to production.

twitterpinterestlinkedinmail


vROPS – How to monitor VM uptime using vSphere Tags

In this guide I will show you how you can attach an uptime tag on any VM you like and vROPS will automatically alert you if the VM power state is not Powered On. This is a useful trick to monitor critical applications that always needs to be powered on. This guide will walk you through the process on creating a vsphere tag in vCenter and then creating the alert in vROPS. Once the alert is created, are you need to do is simply just tag away in vCenter. You also have the options to view it in vROPS and/or get an email notification as well.

Note: this will not monitor application crashing and blue screen of death to the VM.

Creating the vSphere Tag

First, we need to create the tag that we will use to identify which VMs will have power state monitoring. Click on Menu > Tags & Custom Attributes

Click on Add and fill out a name

Click on Create New Category and click on OK

Now the Tag is ready, click on OK to save it

You should see your tag shown below

Now select any VM you like and Assign the Tag we just created. Click on Tags & Custom Attributes > Assign Tag

Select the tag we just created and click on Assign

I also added the tag to a VM called Win10, you can view the tag field in the VM summary.

Creating the Alert in vROPS

Next we need to create the alert in vROPs. Go to Alerts > Alert Definitions > Add

Fill out the following

  • Name
  • Base Object Type: Virtual Machine
  • Criticality: Critical
  • Alert Type: Virtualization/Hypervisor: Availability

Click on Next

Click on Create New Symptom

Change Symptom Type to Properties and double click on the vSphere Tag

Fill out the following.

  • Name
  • Condition: Equals
  • Value: Tag we created
  • Trigger: Info

Add another symptom to monitor the power state. Double click on Summary > Runtime > Power State

Fill out the following

  • Name
  • Condition: Not equal to
  • Value: Powered On
  • Trigger: Critical

Add both alerts to the left as shown. You can either drag or double click on the alert we just created

Click on Next and then Next again until you reach Policies page. Make sure to check all polices that will apply to this alert.

Click on Create or Update to create the alert

Now lets shutdown our Win10 VM that has the tag.

Wait a few minutes and you will see that the VM is now shown in vROPS alerts

If you would like to receive an email notification. Setup Notifications and point the alert for all Virtual Machines as the object type.

twitterpinterestlinkedinmail


vROPS – Vmware Appliance Monitoring Dashboard

Monitor the performance and configuration of the following appliances: vCenter Servers, NSX, NSX-T, vRA, vROPS, Log Insight, Orchestrator, Life Cycle Manager, Network Insight (vRNI), Vmware SRM, vIDM, Air Watch, and Cloud Proxy appliances. Quickly compare performance stats such as CPU, Memory, Contention, Disk performance, and more. You can also compare configuration stats such as CPU, Memory, IP addresses, VM Tool versions, VM version, and more.

Download here: https://code.vmware.com/samples?id=7599

Monitors the following products

  • vCenter Server Appliance
  • NSX, NSX-T
  • vRA
  • vROPS
  • Log Insight
  • Orchestrator
  • Life Cycle Manager
  • Network Insight (vRNI)
  • Vmware SRM
  • vIDM
  • Air Watch
  • Cloud Proxy appliances

User Guide

Compare product to each other based on performance metrics (CPU, Memory, Disk Latency, IOPS, Contention, etc)

Scroll over to the right to get configuration metrics

Highlight any VM and scroll to the bottom to view alerts and properties of the VM

Instructions on how to Import Dashboard

To import in version 7.0 and above

  1. First unzip the file you just downloaded, it will contain a dashboard and a view file
  2. Go to Dashboards > Actions > Manage Dashboards

  3. Hit the dropdown and select Import Dashboards. Import the Dashboard.zip file

  4. Next to go Views > Dropdown > Import. Import the View.zip file

  5. If you get any errors during the process, make sure to click overwrite before importing
twitterpinterestlinkedinmail


vROPS – How to monitor VMware SDDC Stack (vCenter, vSAN, NSX, vRA, vIDM, vRLI, vRO, SRM, vROPS)

My customers ask me this question all the time, how do I monitor services and get alerted on Vmware SDDC components such as vCenter, vSAN, NSX, vRA, vIDM, vRLI, vRO, SRM, and vROPS. The answer is very easy, I ask them to download and install the SDDC Health Monitoring Solution Pack. The latest updated version makes this a must have install for anyone owning vROPs. Below is a guide on why you need it, what it contains, and how to best maximize the tool.

Reasons why you need to install the SDDC Health Monitoring Solution Pack right now

  1. Totally free and easy to install. Just download it using the link below and install it the same way you would install any management pack.

    https://marketplace.cloud.vmware.com/services/details/sddc-health-monitoring-solution-8-4111?slug=true

  2. The SDDC Management Health Overview dashboard that comes out of the box is super useful. The top part of the dashboard auto-detects all the SDDC appliances you have installed and monitors the health automatically.

  3. Provides dozens of additional monitoring metrics for each component. Including individual service monitoring (see example below)

  4. The built-in alerts cover services going down and critical components outages. See the full list below

  1. This dashboard is always improving so always check back every so often for updates
twitterpinterestlinkedinmail


vROPS 8 – How to Enable Hidden Metrics and properties

In this guide I will show you how to enable hidden metrics using vROPs 8.0 and above. There are over 300 combined metrics and properties for VMs out of the box. That alone is a lot of metrics for one object, but what most people don’t know is that there is an additional 200 hidden metrics for VMs as well. That is a total of 500 metrics/properties for Virtual Machines alone. I will also show you what hidden metrics I enable to get the best vROPs experience.

What we will be enabling?

Virtual Machine

  • Guest File System Free (GB) – this metric displays the amount of free space left for each partition of the Operating System
  • Subnet – displays the subnet information for the Virtual machine
  • Gateway – displays the gateway information for the virtual machine
  • Memory Guest Active (%) – a better metric to measure the memory the guest Operating System is using

How to enable hidden metrics and properties

Go to Administration > Policies

This should bring up a list of all policies. Click on the Status Icon to sort by which ones are active. In most cases, whatever changes we make to a single policy must be made again in all active policies. If you only have one active Policy then the process will be much simpler as you only have to make changes to one policy. In the example below we have two policies therefore we need to make changes to both.

To edit the active policy, just highlight it and click on Edit

Select Metrics and Properties

In the following example we want to enable the following VM Metrics and Properties: Guest File Free, Subnet, and Gateway for Virtual Machine. Under Object Type select vCenter Adapter >
Virtual Machine

This will bring up all Metrics, Properties, and Super Metrics for Virtual Machines. Expand Metrics > Guest File System. Notice how the Guest File System Free (GB) is Disabled

To enable the metric select the dropdown next to the metric and select Enabled. This will enable the metric in the next vROPS cycle.

Repeat this process for Subnet and Default Gateway which is located under Properties > Network and Guest Active Memory (%) which is under Memory.

Once completed click on Save on the bottom of the screen. Give about 5-10 minutes and you will now have your new metrics and properties available.

Repeat this process for all active Policies as needed.

twitterpinterestlinkedinmail


vROPS – vSphere Health Checker Dashboard 2.0

First thank you everyone for making this the number 1 most downloaded dashboard on Vmware code. Also, a big thanks to the Vmware TAM community for all the positive feedback they have been telling me on how my website has helped their customers. The feedback will drive me to write even more useful content. With that being said, what better way to start than to share out my latest 2.0 version of the vSphere Health Check dashboard which has many major enhancements and has a greater amount of details than before. In this post I also wrote a full guide on how to resolve some of the main issues.

Download on VMware Code Exchange Here https://code.vmware.com/samples?id=5639#

Purpose:

Does a full health check of problems found in the environment (VMs, Host, Clusters, Datastores, vCenter).  This dashboard will help prevent issues before they happen (being proactive) by identifying everything wrong with your environment today so you can fix it before it causes a problem in the future.

What it monitors:

Monitor’s capacity issues, configuration issues, and performance bottlenecks (CPU, Memory, Contention, Disk Latency)

User Guide

Select any vCenter Servers, Datacenter, Clusters, or entire environment (vSphere World) to do a complete health check on it

You can also use the search box to search what you are looking for

Any of these widgets can be easily exported to excel. They are also report ready, meaning you can add any of the filters I’ve created to a custom report.

Monitoring vROPS

Monitor the health of all vROPS nodes. Make sure the vROPs DB Usage % doesn’t reach over 90%. You will need to add more disk to vROPS if this gets high. It will also monitor any adapters that are down and any vROPS alerts.

Monitoring vCenter

This monitors vCenter for any disk space issues and vCenter alerts.

Monitoring Virtual Machines

When monitoring high CPU and high memory usage. It is important to see the 7-day average as well. If the average is high, then the VM pretty much needs more resources immediately.

If any VM has high contention, CPU ready time, CPU Co-stop, or memory ballooning. It basically means something is constraining the VM from getting the resources it needs. There is no one simple way to fix this. These are the checks I would perform to troubleshoot the issue.

  • Check to see if there are any memory or cpu limits on the VM
  • Check to see if the VM has an alert that states that the Host Power settings is causing contention. ESXi Host that are not set to high performance in the power settings usually causes contention
  • Check if the VM is on a highly utilized ESXi Host
  • Check if the VM is on a resource pool that has resource limits assigned
  • Check to see if the VM has enough CPU and/or memory resources. Notice how this is the last step. This should be your last resort

If the VM Disk latency is high, this means your disk performance is suffering.  Check to see if the datastore that the VM is on has high latency as well. If the datastore is overworked, it will cause latency on other VMs that are hosted by it as well. Another good way to find out if it’s the VM or the datastore that is causing the latency is by doing a storage vMotion to an isolated datastore. If the latency drops down dramatically you will know it was the datastore that caused it

For VM Disk IOPS and Network Usage notice how this widget says awareness only. High Disk IOPs doesn’t mean there is a disk performance issue. It just means that the VM has lots of disk activity. A typical VM usually doesn’t have more than 1000 IOPS. A busy VM such as heavy Database servers, File Servers, Exchange Server, etc will have IOPS in the range of 1000-8000 depending how busy it is. If any VMs that has high IOPS or high network usage that looks off to you should be investigated. Anything over 10,000 IOPS is extremely rare and should be investigated immediately. None of the Fortune 500 companies that I know of has a VM that is higher than 10,000 IOPS.

If I see any VM that has 0 capacity on the C: Drive or the Root Drive. The performance of the VM will suffer dramatically until you add more partition space. For snapshots, most of the time snapshots should not be older than 7 days.

Note: If this is blank you will need to enable the Guest File Free metric for VMs in all the active policies. This is not enabled by default in vROPS.  Use this guide to enable this metric.  http://www.vmignite.com/2021/02/vrops-8-how-to-enable-hidden-metrics-and-properties/

If you see any type of VM limits this is not a good sign. A limit on a VM basically means the VM won’t perform more than what the limit is set to. For example, the VM I highlighted below has a memory limit of 8GB. However, the configured memory is set to 24GB. Even thou the VM has 24GB of memory configured, it will only be consuming 8GB of it because of the 8GB limit set on the VM. You won’t get full performance until you remove the limit of the VM in vCenter. Hopefully you now understand why Limits are not good at all for any reason.

Monitoring Clusters

Cluster HA and DRS in mostly all cases should always be enabled. A large company could have 100s of clusters, it will be quite a nightmarish task to check all these manually. Luckily, I got you covered as all you need to do is look at my dashboard. Another thing to look for is DRS Policies not set to automatic. Although DRS is enabled, if it is not set to automatic, resource balancing will not automatically occur. This setting is often overlooked.

Monitoring Datastores

Any datastore that has high latency is not a good sign, most of the time it will affect the VM performance as well. Do not let your datastores run more than 80% utilized. A datastore that is out of space will cause outages.

Monitoring ESXi Host

This is one of my favorite widgets because it reports back on physical ESXi host failures that vCenter detects. Customers have told me that this some times detects issues that even the vendor software didn’t even detect.

Download on VMware Code Exchange Here https://code.vmware.com/samples?id=5639#

Instructions on how to Import Dashboard

To import in version 7.0 and above

  1. First unzip the file you just downloaded, it will contain a dashboard and a view file
  2. Go to Dashboards > Actions > Manage Dashboards

  3. Hit the dropdown and select Import Dashboards. Import the Dashboard.zip file

  4. Next to go Views > Dropdown > Import. Import the View.zip file

  5. If you get any errors during the process, make sure to click overwrite before importing
twitterpinterestlinkedinmail


Download – vROPS Complete 360 Inventory Dashboard

The new vCenter Inventory is even more complete than the last one. In one click you can get a complete inventory of your environment and performance stats. I’ve enhanced this dashboard with dozens of more metrics, features, and inventory objects collected in this new update. Also newly updated is the ability to highlight any object (vCenter, Host, Clusters, etc) and it will update the list to reflect related objects to it (super useful for unlimited drilldowns capability). Read the full guide below to see how detailed this dashboard is.

Download here https://code.vmware.com/samples?id=5629#

What it provides in one click?

  • Count of how many objects of each type in the environment (folders, switches, VMs, etc)
  • Environment Capacity Total, capacity used, and capacity provisioned
  • Chart on all Physical Host types
  • Chart on all Operating Systems
  • Chart on all on all ESXi Host versions
  • Graph on VM memory configurations
  • Charts on Cluster HA, DRS, Admission Control, and DRS Policy settings
  • Latest VMs, Host, Datastore, vCenter, and Clusters added in the environment
  • Complete vCenter Inventory in a list view with performance metrics and properties
    • vCenters
    • Datacenters
    • Clusters
    • Hosts
    • VMs
    • Datastores
    • vDS Switches
    • Port Groups
    • Datastore Clusters
    • Resource Pools

User Guide

Select any vCenter or all of them combine (vSphere World). Also shows you inventory and performance stats

Get an object count of everything in that environment

Get Inventory and capacity details down to the used, total provisioned level for your selected environment

Get a great overview of what is in your environment. What are all the Physical Servers models? Which server model is the majority? Also shows ESXi Host versions, Operating System, and VM Memory configured.

Clicking on the pie chart will show you what those objects are on the right side

Shows latest Objects added to the environment (VMs, Host, Clusters, Datastores)

Shows List of Inventory along with useful performance metrics

Can instantly export most of the content to Excel by just clicking on the export button

Highlighting anything in the inventory will update the related objects below it. For example I highlighted a Cluster and it instantly updated the Host that are on it below. (Note: Do not click on the object name but highlight the metrics on the right side of it to highlight it.)

Download here https://code.vmware.com/samples?id=5629#

twitterpinterestlinkedinmail


vROPS 8.2 – Three ways to backup vROPS Content

Now with the new release of vROPS 8.2, backing up all your content is now easier than ever before. In this post, I have written a full backup guide using the new Content Management feature of vROPS 8.2. However, using this method will back up all the content as one big file. Since there are limitations to this, I have also listed two other ways to backup everything just in case you need more options.  Note, everything mentioned in this post only backs up content.  No historical data and metrics and backed up.

Using vROPS Content Management export (Best method)

You must be on at least vROPS 8.2. This will backup all the following below as one file. Currently there are no scheduling options and you can’t customize what you want to backup. It is either all or nothing. It will take about 10 minutes for a complete back to be done. On the plus side it also backs up Policies as well. One more thing to note is before restoring to another vROPS environment. Make sure all your Active Directory Users and Groups are setup beforehand. If you import a dashboard for example belonging to a user that doesn’t exist in the new environment. Those user files will not show

What gets backed up?

  • Alert Definitions
  • Custom Compliance Benchmarks
  • Custom Groups
  • Dashboards
  • Metric Configurations
  • Notification Rules
  • Policies
  • Recommendations
  • Report Templates
  • Super Metrics
  • Symptom Definitions
  • Views
  1. Go to Administration > Content Management > Generate Export Content

  2. Wait about 10 minutes and once it is completed you will see a Download Zip file. Click on it and save the file.

  3. To import to another environment. Just click on the Import Content tab and browse for your backup file. One more thing to note is before restoring to another vROPS environment. Make sure all your Active Directory Users and Groups are setup beforehand. If you import a dashboard belonging to a user that doesn’t exist in the new environment. Those particular user files will not show

Use Life Cycle Manager to backup vROPS

vRealize Suite Lifecycle manager can be used to backup lots of Vmware products. One of them of course is vROPS. However Policies do not get backed up.

What gets backed up?

  • Alert Definitions
  • Dashboards
  • Metric Configurations
  • Recommendations
  • Report Templates
  • Super Metrics
  • Symptom Definitions
  • Views

For instructions on how to use this. Use the link below

https://vedaa.net/vmware/backup-dashboards-in-vrealize-operations-with-vrealize-suite-lifecycle-manager/

How to manually backup vROPS

So if you are on an older version of vROPS and don’t have vRealize Suite Lifecycle manager. You can always backup the old classic way. Good thing is I have already wrote instructions on how to back up things manually here.

http://www.vmignite.com/2018/11/vrops-7-how-to-manually-backup-vrops-customized-work/

twitterpinterestlinkedinmail


vROPS 8.2 – How to Ping almost anything with the Ping Adapter

The ping adapter allows you to ping and monitor any IP address, almost any DNS names, a complete subnet, or even a range of IPs. Best of all it is already built-in to vROPS 8.2. All you need to do is activate the adapter and configure what you would like to monitor. Installing and setting up the ping adapter is so easy that is becomes a no brainer on why not to use it. It also measures latency and packets drops as a bonus. It is highly recommended to use it to ping critical URLs, Physical servers, switches, and more.

This guide will walk you through the following steps

  1. VMignite.com Best Practices and Notes
  2. How to install the VMware vRealize Ping adapter
  3. How to ping a single IP or DNS server
  4. How to ping a full subnet, URL, or even a range of IP Addresses
  5. How to ping an existing ESXi Host or VM
  6. How to use a metric config file to ping many objects

VMignite.com Best Practices and Notes

  • You can ping up to 5,000 IP’s per adapter
  • Only valid Top Level Domains accepted for FQDN (sorry .local or .lan fans)
  • When creating an alert, use the metric Average Latency (ms) is not greater than 0 to indicate that the object is down
  • Use the metric config file method if you would like to have everything backed up using the vROPS 8.2 content management backup feature. This will save you tons of work just in case you lose your vROPS infrastructure.
  • Use it to monitor critical URLS and Physical servers. Then add it to a 24/7 critical monitoring dashboard. Just remember to use scoreboards and set the refresh to On

How to install the Ping Adapter

No need to download any additional files. It is pre-installed in vROPS 8.2.

First you need to install the adapter. Go to Administration > Repository > Vmware vRealize Ping > Activate

How to Ping a single DNS Server or IP Address

Once activated you can configure the Ping Adapter. Click on Add Account

In this example we will ping a simple external server by IP address. You can also use DNS as needed. Just type in the name and the address you want to ping.

Expanding Advance Settings will give you more options. Just click on the information icon to see what it does. Click on Validate Connection and then click on Add to save it once you are done.

Give it about 15 minutes. Now go to Dashboards > Ping Overview Dashboard. You should see your server listed in the dashboard.

It is also ready to be added to an alert or dashboard as needed. As you can see if you go to Environment > Ping Adapter > IP you will see your server listed there. You will get some valuable metrics under the metrics section.

How to Ping a full subnet, URL, or even a range of IP Addresses

Adding multiple object just requires adding more to what you already have. Notice how I use commas to separate the entries. In the following example:

  • 1.10.0.1/24 will ping the entire subnet
  • 10.16.1.3-10.16.1.254 will ping that exact range of IP addresses
  • Vmware.com will ping the URL

How to ping an existing Host or VM

Search for your Host or VM and then select it. This will bring you to the summary page of that object. Click on Enable Ping Monitoring in the Ping Statistics bow. In the example below, we are monitoring a single ESXi Host.

Give it about a good 30 minutes and refresh the object. You should now see Ping statistics for that ESXi Host in the summary page

You will also have a new set of metrics for the ESXi Host as well. Look for Ping Statistics under Metrics for that ESXi Host

Using Metric Config File to monitor servers

Another alternative way to monitor many servers is to use a metric config file (XML File). Using a Metric config file is easier to manage if you have over a dozen or even hundreds of servers you would like to monitor. Since it is file based, it is easier to copy and manage. Also, it can get backed up using the vROPS content management backup feature.

Go to Administration > Metric Configurations > SolutionConfig. Take a look at the ping_adapter_config to get a good baseline on how to configure the ping pack. You can choose to modify or add a new one. Whichever way you choose just remember the name of the config file. We will need it in the next steps.

Once configured, go back to Administration > Repository > Vmware vRealize Ping adapter. Click on Add Account

Fill in the name and unique name section but this time we will leave the Address List empty. Instead we will enter in the name of the config file we would want to use. Click on Add to save

twitterpinterestlinkedinmail


Download – VM Uptime Dashboard for vROPS 8.2+

The VM uptime dashboard will keep track of any VMs that are currently down but has an uptime of more than 80% of the time for the last 30 days. A production VM should be up a majority of the time, therefore giving an uptime of over 80% for 30 days will eliminate a lot of VMs that are used for temporary testing, powered off majority of the time, VM templates, etc.

(Note you must have vROPS 8.2 for this to work)

Instructions

  1. You will need to install and active the uptime supermetric which you can download here first https://code.vmware.com/samples?id=7421
  2. Next download and install the dashboard and view here https://code.vmware.com/samples?id=7476

User Guide

  1. If any production VMs are down you will see it listed here. As you can see in the sample, the VMs are currently in a Powered Off State and the VM has a high uptime in the last 30 days.
  2. Select any of the VMs and you will see the uptime history in the graph below it. Use your mouse to hover when it went down to see exact dates and time when it went down.

How to Install the dashboard

To import in version 7.0 and above

  1. First unzip the file you just downloaded, it will contain a dashboard and a view file
  2. Go to Dashboards > Actions > Manage Dashboards

  3. Hit the dropdown and select Import Dashboards. Import the Dashboard.zip file

  4. Next to go Views > Dropdown > Import. Import the View.zip file

twitterpinterestlinkedinmail


vROPS Super Metric – Virtual Machine Uptime

Personally, Virtual Machine uptime is the hardest metric to calculate and create by far. Luckily, my co-worker Iwan has already created one using his own logic. All I had to do was update it and improve it as needed. Rather than rewrite everything he wrote; you can read a full explanation of how the metrics work here: http://virtual-red-dot.info/vm-availability-monitoring/

You can download the new version that I modified below. In my next post, I will show you how to monitor and get an alert on a Virtual Machine going down using this metric in vROPS 8.2. If you want this capability, it is best to download and activate this supermetric now.

https://code.vmware.com/samples?id=7421

twitterpinterestlinkedinmail


How to Maximize all Monitoring Tools

Every time I walk into a new engagement with a Fortune 500 company, I always ask the customer the question “what are issues they want to address”. Of course, I always get these cool stories on how performance bottlenecks like high CPU and Memory slowed up their environment, how things are breaking left and right and no one knows about it, and how they would like to find out what is causing all their problems and outages. All companies want that magic 8-ball solution, something that will instantly identify and solve all their problems magically.

In the end, I am always handed a dozen requirements and sure I can just resolve those and be done with the engagement and then wait a few month later when new problems occur. By then, they might possibly ask for help again or even worst not use the tool at all and go back doing things the manual way or buy another tool thinking it will solve all their problems.

Based on my experience: All companies pain points are the same and the typical customer only mention 10% of what today’s top monitoring applications like vROPs can actually do. It is up to the expert to guide them towards to what I call “Monitoring 360 degrees” which basically means maximizing what a monitoring tool should be doing. When I say all companies pain point are the same it is because all their requirements will fall under these 6 categories below. (Disclaimer: this is all my own personal opinion)

All Monitoring Use Cases

  1. Being Proactive (Fix problems now to prevent future outages)
  2. Root Cause Analysis in minutes (Troubleshooting VM, Host, vCenter, etc)
  3. 24/7 Critical Monitoring (Monitoring critical Apps, Websites, Infrastructure)
  4. Capacity Management (Realizing Utilization, Growth, Inventory)
  5. Optimize Resources (Reclaiming Resources, which resources need more cpu & memory)
  6. Compliance (Hardening check, Environment consistency checks)

Now you probably thinking where are the reports, alerts, ticketing, and auto remediation? These are not use cases, these are bells and whistles that apply to all 6 use cases above. For example, anyone of those use cases mentioned above can involve reports and alerting for example if the customer wants it. I can easily create alerts on performance, capacity, compliance, etc.

With this being said, this is what all monitoring tools should consist of at a bare minimum.

VMignite.com Monitoring 360 Degree Checklist

  1. Proactive Dashboard
  2. Troubleshooting Dashboards
  3. 24/7 Critical Monitoring Dashboard
  4. Capacity Management Dashboards
  5. Optimization Dashboards
  6. Critical alerts being sent as email
  7. Selected Alerts being generated as tickets

Let me explain how some of this should work.

  1. Majority of companies are reactive, but they all want to be proactive – To achieve this, you need the following
    1. Health dashboard – you can’t begin to prevent problems before they happen when you have no idea what problems you have to begin with. That is why you need a dashboard that displays all the problems you have today (Host, vCenter, VMs, Datastores, Clusters, etc). Sounds difficult? Good thing is I created an environment health checker dashboard already if you have vROPs. It is the number one most downloaded dashboard on VMware code. Download it here http://www.vmignite.com/2021/02/vrops-vsphere-health-checker-dashboard-2-0/
    2. Alerts being sent out – Next you need to decide what alerts should be forwarded to your ticketing system and what should be sent as email alerts to engineers to fix immediately. If the alert that caused the outage doesn’t exist yet you will need to create it! I wrote a guide on this here:  http://www.vmignite.com/2021/06/vrops-alerting-dos-and-dont/
  2. When outages do happen, you need to figure out what caused it immediately. You will need the following:
    1. Troubleshooting dashboards – An excellent troubleshooting dashboard should be able to find root cause analysis in a minute! For example, to troubleshoot a problem VM in a minute, I will need to be able to eliminate the Host, Network, Physical Server, and Storage from the equation. On top of that I will need to be able to identify all VM bottlenecks such as CPU, Memory, contention, Disk Latency, application, configurations, etc. Sounds impossible to find root cause analysis in a minute? I have proven to be able to do this with all my customers. Sorry can’t share this dashboard, you can download some light versions of my troubleshooting dashboard on my download page
  3. All management, engineers, and operators need insight of the entire environment in one pane of glass
    1. 24/7 Critical Monitoring Dashboard – let’s say you want to monitor in one pane of glass an environment that consist of 20 vCenters, 1000 Host, 1000 Datastore, 100 Cluster, and 20,000 VMs. On top of that I want to monitor critical websites, vsan, and nsx as well. Also if a site is having problems, show me what objects are causing the problem using the same dashboard. This is a dashboard that I have created for my customers and they have dedicated monitors and even TVs to display throughout the company. Here is a simple version I created, the more advance one is for my customers: http://www.vmignite.com/2021/06/vrops-8-4-executive-dashboard-download/

Once again, all this is my own personal opinion but was based on my 8 year experience working for VMware as a consultant. To learn more granular features on what a powerful monitoring tool can do read the following

http://www.vmignite.com/2020/03/15-features-that-makes-vrops-the-best-monitoring-tool-period/

Updated: 1/09/2022

twitterpinterestlinkedinmail