Collecting GPU Metrics on Azure VMs with Telegraf

When working with GPU-enabled Azure virtual machines — such as those used for AI training, inference, or video rendering — you often need to monitor GPU utilization. While Azure Monitor provides basic metrics for CPU, memory, and disk, it doesn’t capture GPU-level data out of the box.

In this post, I’ll show you how to collect GPU metrics using nvidia-smi and Telegraf, and push them to Azure Monitor for analysis, alerting, or even autoscaling.

Contents

🧱 Use Case

Let’s say you’re running machine learning workloads on an Azure NC-series or ND-series VM, and want to:

Monitor GPU usage
Trigger autoscaling when GPU load is high or low
Visualize GPU metrics in Azure Workbooks
Set alerts when GPU memory is close to full

Azure Monitor doesn’t natively collect GPU metrics — but with Telegraf, you can.

🧰 Prerequisites

An Azure VM with GPU (e.g. NC, ND, NV series)
NVIDIA drivers and nvidia-smi installed
Register resource provider “Microsoft.Insights” in your subscription
Outbound internet access to Azure Monitor endpoints
Root / Admin access to install Telegraf

🛠 Step-by-Step Guide

1. Create Virtual Machine

Image with nvidia-smi preinstalled
VM size with GPU

Enable managed identity during deployment process

or later at the existing VMs function blade

2. Install Telegraf

Before you install telegraf I would recommend to test if the nvidia-smi tool is available.

Windows

Open PowerShell
Execute the following commands

# Set the URL for the PowerShell script
$scriptUrl = "https://raw.githubusercontent.com/vinil-v/gpu-monitoring/refs/heads/main/scripts/gpumon-win.ps1"
# Download and execute the script
Invoke-WebRequest -Uri $scriptUrl -UseBasicParsing | Invoke-Expression

This automatically download and install telegraf and upload the data to the monitoring section.

You can configure your telegraf.conf if you want specific data but by default GPU data will be sent to the Azure Metrics of your virtual machine. You can find the config file within the install path of telegraf.

To test your config use the following command:

telegraf --config <installpath>/telegraf.conf --test

Linux

Open your preferred shell
Execute the following commands

wget -q https://raw.githubusercontent.com/vinil-v/gpu-monitoring/refs/heads/main/scripts/gpumon-setup.sh -O gpumon-setup.sh
chmod +x gpumon-setup.sh
./gpumon-setup.sh

3. Review / Monitor

To receive the first data can take some time but should be found after 5-10 minutes.

Open the VM resource
Select the Monitoring / Metrics blade
Change the Metric Namespace to the namespace which you configured within your telegraf.conf
- By default its Telegraf/
Select Telegraf/nvidia-smi for GPU metrics

📈 Using Metrics in Azure

You can now:

Visualize GPU load over time as described before
Create alerts for thresholds (e.g., mem_used > 90%)
Build scaling rules (via Azure VMSS or Automation)

🚀 Bonus: Autoscaling by GPU Usage

While Azure doesn’t offer native GPU-based autoscaling, you can build a solution by:

Exporting GPU metrics to Log Analytics
Creating alerts that trigger an Azure Automation Runbook
That Runbook can scale your VMSS (or start/stop VMs)

Hint: Combine with Azure Monitor Alerts and Logic Apps for a low-code option.

You can also try to use the metrics data within your VMSS scaling setting.

📝 Summary

Telegraf provides a powerful workaround for collecting GPU metrics on Azure VMs. By leveraging nvidia-smi and Telegraf’s exec plugin, you can get deep GPU insights into Azure Monitor — enabling better dashboards, alerts, and even autoscaling decisions.

1 Min Read

Simon 05-07-2025

Using Terraform to Generate Local Zip Files

Terraform is most commonly used to provision infrastructure — VMs, storage accounts, networking, etc. But sometimes, your infrastructure code also…

Discover More

1 Min Read

Simon 12-07-2025

Azure Template Specs: Reusable ARM Templates

Infrastructure as Code (IaC) is undoubtedly the gold standard when it comes to modern cloud infrastructure. Tools like Bicep, Terraform,…

Discover More

1 Min Read

Simon 02-06-2025

Deallocate VM based on Activity Log

In many organizations, virtual machines in Azure are kept running even when they’re no longer needed – especially during business…

Discover More

1 Min Read

Simon 15-06-2025

Self Service VM & AVD Order via Web Form

In one of my projects, I developed an Azure-based solution that enables users to provision a virtual machine (VM) or…

Discover More

Or check our Popular Categories...

Collecting GPU Metrics on Azure VMs with Telegraf

🧱 Use Case

🧰 Prerequisites

🛠 Step-by-Step Guide

1. Create Virtual Machine

2. Install Telegraf

Windows

Linux

3. Review / Monitor

📈 Using Metrics in Azure

🚀 Bonus: Autoscaling by GPU Usage

📝 Summary

Azure Template Specs: Reusable ARM Templates

Simon

Leave a Reply Cancel reply

Using Terraform to Generate Local Zip Files

Azure Template Specs: Reusable ARM Templates

Deallocate VM based on Activity Log

Self Service VM & AVD Order via Web Form

Latest Post

Tags

Collecting GPU Metrics on Azure VMs with Telegraf

🧱 Use Case

🧰 Prerequisites

🛠 Step-by-Step Guide

1. Create Virtual Machine

2. Install Telegraf

Windows

Linux

3. Review / Monitor

📈 Using Metrics in Azure

🚀 Bonus: Autoscaling by GPU Usage

📝 Summary

Azure Template Specs: Reusable ARM Templates

Simon

Leave a Reply Cancel reply

Related Posts

Using Terraform to Generate Local Zip Files

Azure Template Specs: Reusable ARM Templates

Deallocate VM based on Activity Log

Self Service VM & AVD Order via Web Form

Latest Post

Tags