2

Collecting GPU Metrics on Azure VMs with Telegraf

When working with GPU-enabled Azure virtual machines — such as those used for AI training, inference, or video rendering —…

When working with GPU-enabled Azure virtual machines — such as those used for AI training, inference, or video rendering — you often need to monitor GPU utilization. While Azure Monitor provides basic metrics for CPU, memory, and disk, it doesn’t capture GPU-level data out of the box.

In this post, I’ll show you how to collect GPU metrics using nvidia-smi and Telegraf, and push them to Azure Monitor for analysis, alerting, or even autoscaling.

🧱 Use Case

Let’s say you’re running machine learning workloads on an Azure NC-series or ND-series VM, and want to:

  • Monitor GPU usage
  • Trigger autoscaling when GPU load is high or low
  • Visualize GPU metrics in Azure Workbooks
  • Set alerts when GPU memory is close to full

Azure Monitor doesn’t natively collect GPU metrics — but with Telegraf, you can.

🧰 Prerequisites

  • An Azure VM with GPU (e.g. NC, ND, NV series)
  • NVIDIA drivers and nvidia-smi installed
  • Register resource provider “Microsoft.Insights” in your subscription
  • Outbound internet access to Azure Monitor endpoints
  • Root / Admin access to install Telegraf

🛠 Step-by-Step Guide

1. Create Virtual Machine

  • Image with nvidia-smi preinstalled
  • VM size with GPU
  • Enable managed identity during deployment process
  • or later at the existing VMs function blade

2. Install Telegraf

Before you install telegraf I would recommend to test if the nvidia-smi tool is available.

Windows

  1. Open PowerShell
  2. Execute the following commands
# Set the URL for the PowerShell script
$scriptUrl = "https://raw.githubusercontent.com/vinil-v/gpu-monitoring/refs/heads/main/scripts/gpumon-win.ps1"
# Download and execute the script
Invoke-WebRequest -Uri $scriptUrl -UseBasicParsing | Invoke-Expression

This automatically download and install telegraf and upload the data to the monitoring section.

You can configure your telegraf.conf if you want specific data but by default GPU data will be sent to the Azure Metrics of your virtual machine. You can find the config file within the install path of telegraf.

To test your config use the following command:

telegraf --config <installpath>/telegraf.conf --test

Linux

  1. Open your preferred shell
  2. Execute the following commands
wget -q https://raw.githubusercontent.com/vinil-v/gpu-monitoring/refs/heads/main/scripts/gpumon-setup.sh -O gpumon-setup.sh
chmod +x gpumon-setup.sh
./gpumon-setup.sh

3. Review / Monitor

To receive the first data can take some time but should be found after 5-10 minutes.

  1. Open the VM resource
  2. Select the Monitoring / Metrics blade
  3. Change the Metric Namespace to the namespace which you configured within your telegraf.conf
    • By default its Telegraf/
  4. Select Telegraf/nvidia-smi for GPU metrics

📈 Using Metrics in Azure

You can now:

  • Visualize GPU load over time as described before
  • Create alerts for thresholds (e.g., mem_used > 90%)
  • Build scaling rules (via Azure VMSS or Automation)

🚀 Bonus: Autoscaling by GPU Usage

While Azure doesn’t offer native GPU-based autoscaling, you can build a solution by:

  1. Exporting GPU metrics to Log Analytics
  2. Creating alerts that trigger an Azure Automation Runbook
  3. That Runbook can scale your VMSS (or start/stop VMs)

Hint: Combine with Azure Monitor Alerts and Logic Apps for a low-code option.

You can also try to use the metrics data within your VMSS scaling setting.

📝 Summary

Telegraf provides a powerful workaround for collecting GPU metrics on Azure VMs. By leveraging nvidia-smi and Telegraf’s exec plugin, you can get deep GPU insights into Azure Monitor — enabling better dashboards, alerts, and even autoscaling decisions.

Simon

Cloud Engineer focused on Azure, Terraform & PowerShell. Passionate about automation, efficient solutions, and sharing real-world cloud projects and insights. ⸻ Let me know if you’d like to make it more casual, technical, or personalized.

Leave a Reply

Your email address will not be published. Required fields are marked *