Monitoring GPU temperatures with nvidia-smi and Check MK (OMD)
In the previous post on this subject we used code from Technische Universität Kaiserslautern to monitor our GPUs using OMD checkmk (now checkmk raw). With some new RTX2080s installed this broke, as the nvidia-smi check doesn’t report anything for ECC errors (rather than 0, as previous cards did). The solution was to remove the ECC checking completely.
The new scripts are:
On the client system in /usr/lib/check_mk_agent/local/ (or plugins/)
nvidia-smi
if which nvidia-smi >/dev/null; then
echo '<<<nvidia_smi>>>'
nvidia-smi -q -x > /tmp/.check_mk_nvidia_smi
cards=$(xml_grep --text_only 'nvidia_smi_log/attached_gpus' /tmp/.check_mk_nvidia_smi | tr -d ' ')
IFS=$'\n' names=($(xml_grep --text_only 'nvidia_smi_log/gpu/product_name' /tmp/.check_mk_nvidia_smi | tr -d ' '))
IFS=$'\n' fan_speed=($(xml_grep --text_only 'nvidia_smi_log/gpu/fan_speed' /tmp/.check_mk_nvidia_smi | tr -d ' '))
IFS=$'\n' gpu_utilization=($(xml_grep --text_only 'nvidia_smi_log/gpu/utilization/gpu_util' /tmp/.check_mk_nvidia_smi | tr -d ' '))
IFS=$'\n' mem_utilization=($(xml_grep --text_only 'nvidia_smi_log/gpu/utilization/memory_util' /tmp/.check_mk_nvidia_smi | tr -d ' '))
IFS=$'\n' temperature=($(xml_grep --text_only 'nvidia_smi_log/gpu/temperature/gpu_temp' /tmp/.check_mk_nvidia_smi | tr -d ' '))
IFS=$'\n' power_draw=($(xml_grep --text_only 'nvidia_smi_log/gpu/power_readings/power_draw' /tmp/.check_mk_nvidia_smi | tr -d ' '))
IFS=$'\n' power_limit=($(xml_grep --text_only 'nvidia_smi_log/gpu/power_readings/power_limit' /tmp/.check_mk_nvidia_smi | tr -d ' '))
for i in $(seq 1 $cards) ; do
index=$(($i - 1))
fan_speed[$index]=${fan_speed[$index]/\%/}
gpu_utilization[$index]=${gpu_utilization[$index]/\%/}
mem_utilization[$index]=${mem_utilization[$index]/\%/}
temperature[$index]=${temperature[$index]/C/}
power_draw[$index]=${power_draw[$index]/W/}
power_limit[$index]=${power_limit[$index]/W/}
echo "$index ${names[$index]} ${fan_speed[$index]} ${gpu_utilization[$index]} ${mem_utilization[$index]} ${temperature[$index]} ${power_draw[$index]} ${power_limit[$index]}"
done
fi
Don’t forget to make it executable! You also need xml_grep installed.
Read more...