Greetings guys!. Here I present to you the most efficient and amazing way to Monitor your VMware ESXi infrastructure with Grafana, Telegraf, and InfluxDB. The setup is pretty straightforward and you should have your VMware metrics visualized on Grafana in less than 30 minutes. Our last VMware monitoring was on How To Monitor VMware ESXi Host Using LibreNMS.
This setup uses an official vSphere plugin for Telegraf to pull metrics from vCenter. This includes metrics for vSphere hosts compute(RAM&CPU), Networking, Datastores and Virtual Machines running on vSphere hypervisors. So let’s get started.
Step 1: Install InfluxDB and Grafana
All collected metrics are stored in InfluxDB database. Grafana will connect to InfluxDB to query and display metrics on its dashboards. You need to install both InfluxDB and Grafana before other stuff.
Once both InfluxDB and Grafana are installed, proceed to install and configure Telegraf which is a powerful metrics collector written in Go.
Step 2: Install and Configure Telegraf
If you used links on step 1 to install InfluxDB, the repository required for Telegraf installation was added. Just use the following commands to install Telegraf.
sudo yum -y install telegraf
sudo apt-get -y install telegraf
After installation, we need to configure Telegraf to pull Monitoring metrics from vCenter. Edit Telegraf main configuration file:
sudo vim /etc/telegraf/telegraf.conf
1. Add InfluxDB output storage backend where metrics will be stored.
# Configuration for sending metrics to InfluxDB [[outputs.influxdb]] urls = ["http://10.10.1.20:8086"] database = "vmware" timeout = "0s" username = "monitoring" password = "DBPassword"
10.10.1.20 with your InfluxDB server IP address. if you don’t have authentication enabled on InfluxDB, you can safely remove the
password line in the configuration.
vsphere input plugin for Telegraf. The complete configuration should look similar to this:
# Read metrics from VMware vCenter [[inputs.vsphere]] ## List of vCenter URLs to be monitored. These three lines must be uncommented ## and edited for the plugin to work. vcenters = [ "https://10.10.1.2/sdk" ] username = "[email protected]" password = "AdminPassword" # ## VMs ## Typical VM metrics (if omitted or empty, all metrics are collected) vm_metric_include = [ "cpu.demand.average", "cpu.idle.summation", "cpu.latency.average", "cpu.readiness.average", "cpu.ready.summation", "cpu.run.summation", "cpu.usagemhz.average", "cpu.used.summation", "cpu.wait.summation", "mem.active.average", "mem.granted.average", "mem.latency.average", "mem.swapin.average", "mem.swapinRate.average", "mem.swapout.average", "mem.swapoutRate.average", "mem.usage.average", "mem.vmmemctl.average", "net.bytesRx.average", "net.bytesTx.average", "net.droppedRx.summation", "net.droppedTx.summation", "net.usage.average", "power.power.average", "virtualDisk.numberReadAveraged.average", "virtualDisk.numberWriteAveraged.average", "virtualDisk.read.average", "virtualDisk.readOIO.latest", "virtualDisk.throughput.usage.average", "virtualDisk.totalReadLatency.average", "virtualDisk.totalWriteLatency.average", "virtualDisk.write.average", "virtualDisk.writeOIO.latest", "sys.uptime.latest", ] # vm_metric_exclude =  ## Nothing is excluded by default # vm_instances = true ## true by default # ## Hosts ## Typical host metrics (if omitted or empty, all metrics are collected) host_metric_include = [ "cpu.coreUtilization.average", "cpu.costop.summation", "cpu.demand.average", "cpu.idle.summation", "cpu.latency.average", "cpu.readiness.average", "cpu.ready.summation", "cpu.swapwait.summation", "cpu.usage.average", "cpu.usagemhz.average", "cpu.used.summation", "cpu.utilization.average", "cpu.wait.summation", "disk.deviceReadLatency.average", "disk.deviceWriteLatency.average", "disk.kernelReadLatency.average", "disk.kernelWriteLatency.average", "disk.numberReadAveraged.average", "disk.numberWriteAveraged.average", "disk.read.average", "disk.totalReadLatency.average", "disk.totalWriteLatency.average", "disk.write.average", "mem.active.average", "mem.latency.average", "mem.state.latest", "mem.swapin.average", "mem.swapinRate.average", "mem.swapout.average", "mem.swapoutRate.average", "mem.totalCapacity.average", "mem.usage.average", "mem.vmmemctl.average", "net.bytesRx.average", "net.bytesTx.average", "net.droppedRx.summation", "net.droppedTx.summation", "net.errorsRx.summation", "net.errorsTx.summation", "net.usage.average", "power.power.average", "storageAdapter.numberReadAveraged.average", "storageAdapter.numberWriteAveraged.average", "storageAdapter.read.average", "storageAdapter.write.average", "sys.uptime.latest", ] # host_metric_exclude =  ## Nothing excluded by default # host_instances = true ## true by default # ## Clusters cluster_metric_include =  ## if omitted or empty, all metrics are collected # cluster_metric_exclude =  ## Nothing excluded by default # cluster_instances = false ## false by default # ## Datastores datastore_metric_include =  ## if omitted or empty, all metrics are collected # datastore_metric_exclude =  ## Nothing excluded by default # datastore_instances = false ## false by default for Datastores only # ## Datacenters datacenter_metric_include =  ## if omitted or empty, all metrics are collected # datacenter_metric_exclude = [ "*" ] ## Datacenters are not collected by default. # datacenter_instances = false ## false by default for Datastores only # ## Plugin Settings ## separator character to use for measurement and field names (default: "_") # separator = "_" # ## number of objects to retreive per query for realtime resources (vms and hosts) ## set to 64 for vCenter 5.5 and 6.0 (default: 256) # max_query_objects = 256 # ## number of metrics to retreive per query for non-realtime resources (clusters and datastores) ## set to 64 for vCenter 5.5 and 6.0 (default: 256) # max_query_metrics = 256 # ## number of go routines to use for collection and discovery of objects and metrics # collect_concurrency = 1 # discover_concurrency = 1 # ## whether or not to force discovery of new objects on initial gather call before collecting metrics ## when true for large environments this may cause errors for time elapsed while collecting metrics ## when false (default) the first collection cycle may result in no or limited metrics while objects are discovered # force_discover_on_init = false # ## the interval before (re)discovering objects subject to metrics collection (default: 300s) # object_discovery_interval = "300s" # ## timeout applies to any of the api request made to vcenter # timeout = "60s" # ## Optional SSL Config # ssl_ca = "/path/to/cafile" # ssl_cert = "/path/to/certfile" # ssl_key = "/path/to/keyfile" ## Use SSL but skip chain & host verification insecure_skip_verify = true
The only variables to change on your end are:
- 10.10.1.2 should be replaced with the
- [email protected] should match your vCenter user account
- AdminPassword with the password to authenticate with
If your vCenter server has a self-signed certificate, make sure you turn
insecure_skip_verify flag to true.
insecure_skip_verify = true
Start and enable telegraf service after making the changes.
sudo systemctl restart telegraf
sudo systemctl enable telegraf
Step 3: Check InfluxDB Metrics
We need to confirm that our metrics are being pushed to InfluxDB and that we can see them.
Open InfluxDB shell:
$ influx -username 'username' -password 'StrongPassword'
Connected to http://localhost:8086 version 1.6.4
InfluxDB shell version: 1.6.4
- ‘username‘ – InfluxDB authentication username
- ‘StrongPassword‘ – InfluxDB password
Connected to http://localhost:8086 version 1.6.4
InfluxDB shell version: 1.6.4
vmware database we configured on telegraf.
> USE vmware
Using database vmware
Check if there is inflow of time series metrics.
> SHOW MEASUREMENTS name: measurements name ---- cpu disk diskio kernel mem processes swap system vsphere_cluster_clusterServices vsphere_cluster_mem vsphere_cluster_vmop vsphere_datacenter_vmop vsphere_datastore_datastore vsphere_datastore_disk vsphere_host_cpu vsphere_host_disk vsphere_host_mem vsphere_host_net vsphere_host_power vsphere_host_storageAdapter vsphere_host_sys vsphere_vm_cpu vsphere_vm_mem vsphere_vm_net vsphere_vm_power vsphere_vm_sys vsphere_vm_virtualDisk >
Step 3: Add InfluxDB Data Source to Grafana
Login to Grafana and add InfluxDB data source – Specify server IP, database name and authentication credentials if applicable.
Give it a name, choose type, specify server IP.
Provide database name and authentication credentials if applicable.
Save and test settings.
Step 4: Import Grafana Dashboards
We have configured all dependencies and test to be working. The last action is to create or import Grafana dashboards that will display vSphere metrics.
In this post, we will use great Grafana dashboards created by Jorge de la Cruz.
- Grafana vSphere Overview Dashboard – 8159
- Grafana vSphere Datastore Dashboard – 8162
- Grafana vSphere Hosts Dashboard – 8165
- Grafana vSphere VMs Dashboard – 8168
Login to your Grafana and navigate to the Dashboard import section. Use Dashboard IDs to import.
On successful imports, you should start seeing data appearing on the dashboards.
The visualization may need your little extra effort to get perfect displays for your environment and specific metrics to be shown.
Check out other Grafana related articles available on our blog.