(Last Updated On: April 3, 2019)

Greetings guys!. Here I present to you the most efficient and amazing way to Monitor your VMware ESXi infrastructure with Grafana, Telegraf, and InfluxDB. The setup is pretty straightforward and you should have your VMware metrics visualized on Grafana in less than 30 minutes. Our last VMware monitoring was on How To Monitor VMware ESXi Host Using LibreNMS.

This setup uses an official vSphere plugin for Telegraf to pull metrics from vCenter. This includes metrics for vSphere hosts compute(RAM&CPU), Networking, Datastores and Virtual Machines running on vSphere hypervisors. So let’s get started.

Step 1: Install InfluxDB and Grafana

All collected metrics are stored in InfluxDB database. Grafana will connect to InfluxDB to query and display metrics on its dashboards. You need to install both InfluxDB and Grafana before other stuff.

How to install InfluxDB on Ubuntu and CentOS

How to Install Grafana on Ubuntu and CentOS

Once both InfluxDB and Grafana are installed, proceed to install and configure Telegraf which is a powerful metrics collector written in Go.

Step 2: Install and Configure Telegraf

If you used links on step 1 to install InfluxDB, the repository required for Telegraf installation was added. Just use the following commands to install Telegraf.

# CentOS
sudo yum -y install telegraf

# Ubuntu
sudo apt-get -y install telegraf

After installation, we need to configure Telegraf to pull Monitoring metrics from vCenter. Edit Telegraf main configuration file:

sudo vim /etc/telegraf/telegraf.conf

1. Add InfluxDB output storage backend where metrics will be stored.

# Configuration for sending metrics to InfluxDB
[[outputs.influxdb]]
    urls = ["http://10.10.1.20:8086"]
    database = "vmware"
    timeout = "0s"
    username = "monitoring"
    password = "DBPassword"

Replace 10.10.1.20 with your InfluxDB server IP address. if you don’t have authentication enabled on InfluxDB, you can safely remove the username and password line in the configuration.

2. Configure vsphere input plugin for Telegraf. The complete configuration should look similar to this:

# Read metrics from VMware vCenter
[[inputs.vsphere]]
## List of vCenter URLs to be monitored. These three lines must be uncommented
## and edited for the plugin to work.
vcenters = [ "https://10.10.1.2/sdk" ]
username = "[email protected]"
password = "AdminPassword"
#
## VMs
## Typical VM metrics (if omitted or empty, all metrics are collected)
vm_metric_include = [
"cpu.demand.average",
"cpu.idle.summation",
"cpu.latency.average",
"cpu.readiness.average",
"cpu.ready.summation",
"cpu.run.summation",
"cpu.usagemhz.average",
"cpu.used.summation",
"cpu.wait.summation",
"mem.active.average",
"mem.granted.average",
"mem.latency.average",
"mem.swapin.average",
"mem.swapinRate.average",
"mem.swapout.average",
"mem.swapoutRate.average",
"mem.usage.average",
"mem.vmmemctl.average",
"net.bytesRx.average",
"net.bytesTx.average",
"net.droppedRx.summation",
"net.droppedTx.summation",
"net.usage.average",
"power.power.average",
"virtualDisk.numberReadAveraged.average",
"virtualDisk.numberWriteAveraged.average",
"virtualDisk.read.average",
"virtualDisk.readOIO.latest",
"virtualDisk.throughput.usage.average",
"virtualDisk.totalReadLatency.average",
"virtualDisk.totalWriteLatency.average",
"virtualDisk.write.average",
"virtualDisk.writeOIO.latest",
"sys.uptime.latest",
]
# vm_metric_exclude = [] ## Nothing is excluded by default
# vm_instances = true ## true by default
#
## Hosts
## Typical host metrics (if omitted or empty, all metrics are collected)
host_metric_include = [
"cpu.coreUtilization.average",
"cpu.costop.summation",
"cpu.demand.average",
"cpu.idle.summation",
"cpu.latency.average",
"cpu.readiness.average",
"cpu.ready.summation",
"cpu.swapwait.summation",
"cpu.usage.average",
"cpu.usagemhz.average",
"cpu.used.summation",
"cpu.utilization.average",
"cpu.wait.summation",
"disk.deviceReadLatency.average",
"disk.deviceWriteLatency.average",
"disk.kernelReadLatency.average",
"disk.kernelWriteLatency.average",
"disk.numberReadAveraged.average",
"disk.numberWriteAveraged.average",
"disk.read.average",
"disk.totalReadLatency.average",
"disk.totalWriteLatency.average",
"disk.write.average",
"mem.active.average",
"mem.latency.average",
"mem.state.latest",
"mem.swapin.average",
"mem.swapinRate.average",
"mem.swapout.average",
"mem.swapoutRate.average",
"mem.totalCapacity.average",
"mem.usage.average",
"mem.vmmemctl.average",
"net.bytesRx.average",
"net.bytesTx.average",
"net.droppedRx.summation",
"net.droppedTx.summation",
"net.errorsRx.summation",
"net.errorsTx.summation",
"net.usage.average",
"power.power.average",
"storageAdapter.numberReadAveraged.average",
"storageAdapter.numberWriteAveraged.average",
"storageAdapter.read.average",
"storageAdapter.write.average",
"sys.uptime.latest",
]
# host_metric_exclude = [] ## Nothing excluded by default
# host_instances = true ## true by default
#
## Clusters
cluster_metric_include = [] ## if omitted or empty, all metrics are collected
# cluster_metric_exclude = [] ## Nothing excluded by default
# cluster_instances = false ## false by default
#
## Datastores
datastore_metric_include = [] ## if omitted or empty, all metrics are collected
# datastore_metric_exclude = [] ## Nothing excluded by default
# datastore_instances = false ## false by default for Datastores only
#
## Datacenters
datacenter_metric_include = [] ## if omitted or empty, all metrics are collected
# datacenter_metric_exclude = [ "*" ] ## Datacenters are not collected by default.
# datacenter_instances = false ## false by default for Datastores only
#
## Plugin Settings
## separator character to use for measurement and field names (default: "_")
# separator = "_"
#
## number of objects to retreive per query for realtime resources (vms and hosts)
## set to 64 for vCenter 5.5 and 6.0 (default: 256)
# max_query_objects = 256
#
## number of metrics to retreive per query for non-realtime resources (clusters and datastores)
## set to 64 for vCenter 5.5 and 6.0 (default: 256)
# max_query_metrics = 256
#
## number of go routines to use for collection and discovery of objects and metrics
# collect_concurrency = 1
# discover_concurrency = 1
#
## whether or not to force discovery of new objects on initial gather call before collecting metrics
## when true for large environments this may cause errors for time elapsed while collecting metrics
## when false (default) the first collection cycle may result in no or limited metrics while objects are discovered
# force_discover_on_init = false
#
## the interval before (re)discovering objects subject to metrics collection (default: 300s)
# object_discovery_interval = "300s"
#
## timeout applies to any of the api request made to vcenter
# timeout = "60s"
#
## Optional SSL Config
# ssl_ca = "/path/to/cafile"
# ssl_cert = "/path/to/certfile"
# ssl_key = "/path/to/keyfile"
## Use SSL but skip chain & host verification
insecure_skip_verify = true

The only variables to change on your end are:

  • 10.10.1.2 should be replaced with the vCenter IP address
  • [email protected] should match your vCenter user account
  • AdminPassword with the password to authenticate with

If your vCenter server has a self-signed certificate, make sure you turn insecure_skip_verify flag to true.

insecure_skip_verify = true

Start and enable telegraf service after making the changes.

sudo systemctl restart telegraf
sudo systemctl enable telegraf

Step 3: Check InfluxDB Metrics

We need to confirm that our metrics are being pushed to InfluxDB and that we can see them.

Open InfluxDB shell:

With Authentication:

$ influx -username 'username' -password 'StrongPassword'
Connected to http://localhost:8086 version 1.6.4
InfluxDB shell version: 1.6.4
  • username‘ – InfluxDB authentication username
  • StrongPassword‘ – InfluxDB password

Without Authentication:

$ influx
Connected to http://localhost:8086 version 1.6.4
InfluxDB shell version: 1.6.4

Switch to vmware database we configured on telegraf.

> USE vmware
Using database vmware

Check if there is inflow of time series metrics.

> SHOW MEASUREMENTS
name: measurements
name
----
cpu
disk
diskio
kernel
mem
processes
swap
system
vsphere_cluster_clusterServices
vsphere_cluster_mem
vsphere_cluster_vmop
vsphere_datacenter_vmop
vsphere_datastore_datastore
vsphere_datastore_disk
vsphere_host_cpu
vsphere_host_disk
vsphere_host_mem
vsphere_host_net
vsphere_host_power
vsphere_host_storageAdapter
vsphere_host_sys
vsphere_vm_cpu
vsphere_vm_mem
vsphere_vm_net
vsphere_vm_power
vsphere_vm_sys
vsphere_vm_virtualDisk
> 

Step 3: Add InfluxDB Data Source to Grafana

Login to Grafana and add InfluxDB data source – Specify server IP, database name and authentication credentials if applicable.

Give it a name, choose type, specify server IP.

Provide database name and authentication credentials if applicable.

Save and test settings.

Step 4: Import Grafana Dashboards

We have configured all dependencies and test to be working. The last action is to create or import Grafana dashboards that will display vSphere metrics.

In this post, we will use great Grafana dashboards created by Jorge de la Cruz.

Login to your Grafana and navigate to the Dashboard import section. Use Dashboard IDs to import.

On successful imports, you should start seeing data appearing on the dashboards.

The visualization may need your little extra effort to get perfect displays for your environment and specific metrics to be shown.

Check out other Grafana related articles available on our blog.

How to Monitor Zimbra Server with Grafana, Influxdb and Telegraf

How to Monitor Redis Server with Prometheus and Grafana in 5 minutes

Monitoring Ceph Cluster with Prometheus and Grafana

How to Monitor BIND DNS server with Prometheus and Grafana

Monitoring MySQL / MariaDB with Prometheus in five minutes

How to Monitor Apache Web Server with Prometheus and Grafana in 5 minutes