Nvmlfan – thinkfan like demon to control nvidia GPU’s fans

I recently upgraded my home server, and as an indirect result, I now have three more PCIe slots than before. So, I decided to add a GPU for video transcoding and machine learning acceleration (though I’m still debating whether I really need it).

I bought an NVIDIA T600 but noticed some strange behavior: under load, it heats up to over 80°C. Even then, nvidia-smi reports that the fan is set below maximum speed. I tried adjusting the GPU’s target temperature and encountered two issues:

  1. The temperature setting immediately resets to default. (It turns out that persistence mode needs to be enabled to prevent this.)
  2. Even with persistence mode, the adjustment doesn’t work.

The GPU (or its drivers) ignores the target temperature setting, which is a problem. First, I don’t want anything running hot in my server. Second, and more importantly, it causes thermal stress.

I tried searching for something like thinkfan for GPUs but didn’t find anything useful. Most solutions for controlling the fan speed on NVIDIA GPUs seem to rely on nvidia-settings, which requires a functioning X server with NVIDIA drivers. That feels like overkill for me.

Fortunately, I discovered that it’s still possible to control the fans without an X server by using libnvidia-ml. After spending 15 minutes with ChatGPT and half a day making it work (half day and 15 min hating GO in total), I finally got it running.

The key difference with ThinkFan is the following: nvmlfan has two modes. The first is the standard curve mode, which is defined like this:

cards:
0:
mode: curve
curve:
- [ 60, 30 ]
- [ 65, 50]
- [ 75, 100]

this maps the GPU’s temperature to a corresponding fan speed. The curve specifies anchor points in the format [ temperature, fan_speed ], with values in between approximated linearly.

  • Temperatures below the first anchor point are mapped to the fan speed of the first point or the minimum speed allowed by the GPU BIOS/driver.
  • If the last anchor point’s temperature is below the GPU’s maximum threshold, the fan speed is linearly approximated from the last point’s value to 100%.

Note: Fan speeds are controlled as percentages, not RPM values.

The second mode do what nvidia-smi GPU target temperature should do (shame on you nvidia), in this mode nvmlfan tries to maintain constant temperature.

cards:
0:
mode: target
target: 65
pid: [ 20, 0.1, 0 ]

Of course, it won’t heat up the GPU if the temperature is below the target. However, it will actively counteract the heat generated under load, which helps minimize thermal stress.

Unfortunately, there are two drawbacks:

  1. I haven’t found a better way to achieve this than using a PID controller.
  2. Tuning PID parameters is notoriously difficult—it’s practically rocket science.

There are countless articles and entire books on how to tune PID parameters and even fucking discipline called control theory. Master PID tuning on your GPU and it will helps when you try to build a rocket.

There are no universal PID parameters that work for everyone (though, when properly tuned, they should be the same for identical models of GPUs). Fortunately, controlling GPU temperatures with a fan creates a relatively inertial system, so it’s less prone to oscillation.

In the [ 20, 0.1, 0 ] array:

  1. The first number (P) is the proportional parameter. It controls how much the fan speed changes when the error (the difference between the target temperature and the actual temperature) equals one. For example, with a target temperature of 65°C and an actual temperature of 70°C, the fan speed would be set to 100%. However, setting the fan to 100% counters the heat, causing the temperature to decrease. This, in turn, reduces the fan speed, which can lead to the temperature rising again, and so on. If the P parameter is too high, this cycle can cause the system to oscillate.
  2. The second number (I) is the integral parameter. Since the proportional component is quite rough, the integral component adjusts slowly over time to ensure the fan speed perfectly matches the load, maintaining the target temperature.
  3. The third number (D) is the derivative parameter. It reacts to the rate at which the temperature changes. For systems with significant inertia, such as this one, the derivative component can often be omitted. If you’re curious about how to use it effectively, you’ll need to dive into some control theory books.

The second drawback is that target mode can create additional noise by frequently changing the fan speed. Since the target speed is recalculated every second, the variations might be noticeable—especially if the system begins to oscillate.

With that said, here’s the repository for the project: https://github.com/IvanBayan/nvmlfan

PS
I take no responsibility if someone damages their GPU using this. Use it at your own risk.

How to just send logs from files to graylog2

That solution allows to read logs from file and just send them to remote syslog/graylog server. Logs will not influent on current syslog settings, you won’t need to filter them out of any syslog facility (like local7), all you need – the rsyslog (I’ve used v8).

My task was to send logs which wrote by java application (if I’m right log4j was used), they were rotated by logrotate with truncation, so few specific options were added.
I replaced %APP-NAME% in rsyslog’s template(RSYSLOG_SyslogProtocol23Format) to be able differentiate from which files log messages were read.

As for me, it’s better to write logs in format which allow them to be parsed easily or send them right to remote location , but if you need to do it quickly without modification of application it’s appropriate solution. Just copy config below in file like  /etc/rsyslog.d/99-graylog.conf and modify TARGET.ADDRESS, TARGET.PORT, app_ tag and File setting according to your environment.

module(load="imfile")

template(
name="SyslogProtocol23Format_modified" type="string"
string="<%PRI%>1 %TIMESTAMP:::date-rfc3339% %HOSTNAME% %syslogtag%%$.suffix% %PROCID% %MSGID% %STRUCTURED-DATA% %msg%\n"
)

ruleset(name="sendToLogserver") {
action(type="omfwd" Target="TARGET.ADDRESS" Port="TARGET.PORT" Template="SyslogProtocol23Format_modified")
}

ruleset(name="app_logs") {
set $.suffix=re_extract($!metadata!filename, "(.*)/([^/]*)", 0, 2, "unknown.log");
call sendToLogserver
stop
}

input(
type="imfile"
File="/var/log/app_logs/*.log"
Tag="app_"
Ruleset="app_logs"
freshStartTail="on"
addMetadata="on"
)

In my case application wrote multi-line log messages, so startmsg.regex was used. Also logs were rotated by logrotate with truncate method, additional option reopenOnTruncate was used. So my input section looked like:

input(
type="imfile"
File="/var/log/app_logs/*.log"
Tag="app_"
Ruleset="app_logs"
freshStartTail="on"
addMetadata="on"
startmsg.regex="^[0-9]{4}-[0-9]{2}-[0-9]{2} "
reopenOnTruncate="on"
)

Converting SNMP enumerations to Zabbix value mappings

Many of those, who tried to use Zabbix for monitoring SNMP capable devices faced with need of creating value mappings. It’s ok to create them by hands if mapping contain few values and you don’t have many metrics that uses ‘named-numbers’.
For those who have not had fortune to face with this, I will explain. Enumerations it’s some sort of agreement about how to code different states or types or something identical by using only integer values. For example let’s see on SNMPv2-MIB::snmpEnableAuthenTraps:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
% snmptranslate -Td SNMPv2-MIB::snmpEnableAuthenTraps
SNMPv2-MIB::snmpEnableAuthenTraps
snmpEnableAuthenTraps OBJECT-TYPE
 -- FROM SNMPv2-MIB
 SYNTAX INTEGER {enabled(1), disabled(2)} 
 MAX-ACCESS read-write
 STATUS current
 DESCRIPTION "Indicates whether the SNMP entity is permitted to
 generate authenticationFailure traps. The value of this
 object overrides any configuration information; as such,
 it provides a means whereby all authenticationFailure
 traps may be disabled.
 
Note that it is strongly recommended that this object
 be stored in non-volatile memory so that it remains
 constant across re-initializations of the network
 management system."
::= { iso(1) org(3) dod(6) internet(1) mgmt(2) mib-2(1) snmp(11) 30 }

Here you can see, that integer ‘1’  used to code ‘enabled’ and ‘2’ for ‘disabled’, so if you want to see in your zabbix human friendly ‘enabled/disable’, you need to create value in your zabbix mapping first. It’s not a difficult task, if your mapping small like this, but it’s pain in the ass if your mapping consist many values. For example IF-MIB::ifType consist of 254 values. For completeness i need to say, that prior zabbix 3.0 you had not legal way to automate it.

When i first time searching for solution, i found that script in feature request ZBXNEXT-1424
Unfortunately it will break your db, about it you can read here. In Zabbix 3.0  value mappings API was introduced, now you are able to import/export mappings in XML format or you can do it via RPC.

Looks like it’s time to a perl magic. Tadaam! Script that generate value mapping in XML format for specified OID. I placed it onto github: https://github.com/IvanBayan/Zabbix-oid2valuemapping here you will find requirements and examples of usage. In short you type in console something like this:

% perl ./oid2valuemapping.pl --oid SNMPv2-MIB::snmpEnableAuthenTraps

And it will generate something like this:

 <?xml version='1.0' standalone='yes'?>
<zabbix_export>
 <date>2016-08-26T14:51:09Z</date>
 <value_maps>
 <value_map>
 <name>snmpEnableAuthenTraps</name>
 <mappings>
 <mapping>
 <newvalue>disabled</newvalue>
 <value>2</value>
 </mapping>
 <mapping>
 <newvalue>enabled</newvalue>
 <value>1</value>
 </mapping>
 </mappings>
 </value_map>
 </value_maps>
 <version>3.0</version>
</zabbix_export>

You need only few additional modules for perl and configured snmp.

Dirty hack to add values mappings in Zabbix

“I’ll be brief.” ©
Here is two things about script published in ZBXNEXT-1424, first it can help you to automate creation of large mappings (and it’s cool), second it will broke your DB (not so cool, maaan).
When you will try to add mapping in broken DB you will see something like this:

poorzabbix

The “Error in query [INSERT INTO valuemaps (name,valuemapid) VALUES (‘Test mapping’,’50’)] [Duplicate entry ’50’ for key ‘PRIMARY’]” mean, that in table valuemaps you already have entry with valuemapid = 50. Why it happened i tell later after we fix DB.

To fix DB, you need to update few entries in table ‘idx‘, first update nextid where table_name = valuemaps:

mysql> update ids set nextid = (select max(valuemaps.valuemapid)+1 from valuemaps) where table_name = 'valuemaps';
Query OK, 1 row affected (0.22 sec)
Rows matched: 1 Changed: 1 Warnings: 0

Second update nextid for mappings:

mysql> update ids set nextid = (select max(mappings.mappingid)+1 from mappings) where table_name = 'mappings';
Query OK, 1 row affected (0.22 sec)
Rows matched: 1 Changed: 1 Warnings: 0

Here it is!

This happened because script does not update table idx. May be it’s ok for zabbix 2.0 that mentioned in feature request, but it’s broke database for zabbix 2.2 and newer. Unfortunately zabbix prior version 3.0 does not have API or ability to import mappings , so that script still useful.

Here is fixed script, i hope author will not offended at me:

#!/usr/bin/perl
 
use warnings;
use strict;
 
my $usage = "$0 valueMapName number newvalue [number2 newvalue2 [...]]
E.g.: 
 $0 'Alarm Status' 1 ok 2 unknown 3 stale 4 problem
 $0 'Aliveness' 0 dead 1 alive
";
 
my $valueMapName = shift() || die "No new valuemap name";
my @mapList = @ARGV;
die "No mappings given. Usage: $usage\n" if scalar(@mapList) == 0;
 
 
my $isEvenNumber = scalar(@mapList) % 2 == 0;
die "Must give mapping->value pairs. Usage: $usage\n" if not $isEvenNumber;
my %mappings = @mapList;
 
my $newValueMapId = int(qx/mysql -N -s -e 'select nextid from zabbix.ids where field_name = "valuemapid"'/) ||
die("Can't fetch max valuemapid\nUsage: $usage\n");
$newValueMapId++;
my $newMappingId = int(qx/mysql -N -s -e 'select nextid from zabbix.ids where field_name = "mappingid"'/) ||
die("Can't fetch max mappingid\nUsage: $usage\n");
$newMappingId++;
 
eval {
 my $valueMapCmd = qq/mysql -e "insert into zabbix.valuemaps (valuemapid, name) values ('$newValueMapId', '$valueMapName');"/;
 print "$valueMapCmd\n";
 system $valueMapCmd;
 eval {
 for my $from (keys %mappings) {
 my $to = $mappings{$from};
 my $mappingCmd= qq/mysql -e "insert into zabbix.mappings (mappingid, valuemapid, value, newvalue) values ('$newMappingId', '$newValueMapId', '$from', '$to');"/;
 print "$mappingCmd\n";
 system $mappingCmd;
 $newMappingId++;
 }
 };
 if ($@) {
 die "something went wrong inserting into mappings $@";
 }
};
if ($@) {
 die "something went wrong inserting into valuemaps $@";
}
 
my $valueMapUpdCmd = qq/mysql -e 'update zabbix.ids set nextid = "$newValueMapId" where field_name = "valuemapid";'/;
print "$valueMapUpdCmd\n";
system $valueMapUpdCmd;
$newMappingId--;
my $mappingUpdCmd = qq/mysql -e 'update zabbix.ids set nextid = "$newMappingId" where field_name = "mappingid";'/;
print "$mappingUpdCmd\n";
system $mappingUpdCmd;

 

Bug in munin

bug in munin traffic graphFew weeks i observed strange graphs for network usage produced by munin, i did not attach any importance to this. But few days ago when i seen again 600Mbit badwidth usage on host that had 10Mbit connection i remembered that before made some changes in munin.conf. I looked at config and found that changed directive ‘graph_period’ to ‘minute’ before(do not remember why). When i change it back to ‘second’, i got normal graphs.