I recently upgraded my home server, and as an indirect result, I now have three more PCIe slots than before. So, I decided to add a GPU for video transcoding and machine learning acceleration (though I’m still debating whether I really need it).
I bought an NVIDIA T600 but noticed some strange behavior: under load, it heats up to over 80°C. Even then, nvidia-smi
reports that the fan is set below maximum speed. I tried adjusting the GPU’s target temperature and encountered two issues:
- The temperature setting immediately resets to default. (It turns out that persistence mode needs to be enabled to prevent this.)
- Even with persistence mode, the adjustment doesn’t work.
The GPU (or its drivers) ignores the target temperature setting, which is a problem. First, I don’t want anything running hot in my server. Second, and more importantly, it causes thermal stress.
I tried searching for something like thinkfan
for GPUs but didn’t find anything useful. Most solutions for controlling the fan speed on NVIDIA GPUs seem to rely on nvidia-settings
, which requires a functioning X server with NVIDIA drivers. That feels like overkill for me.
Fortunately, I discovered that it’s still possible to control the fans without an X server by using libnvidia-ml
. After spending 15 minutes with ChatGPT and half a day making it work (half day and 15 min hating GO in total), I finally got it running.
The key difference with ThinkFan is the following: nvmlfan
has two modes. The first is the standard curve mode, which is defined like this:
cards: 0: mode: curve curve: - [ 60, 30 ] - [ 65, 50] - [ 75, 100] |
this maps the GPU’s temperature to a corresponding fan speed. The curve specifies anchor points in the format [ temperature, fan_speed ]
, with values in between approximated linearly.
- Temperatures below the first anchor point are mapped to the fan speed of the first point or the minimum speed allowed by the GPU BIOS/driver.
- If the last anchor point’s temperature is below the GPU’s maximum threshold, the fan speed is linearly approximated from the last point’s value to 100%.
Note: Fan speeds are controlled as percentages, not RPM values.
The second mode do what nvidia-smi GPU target temperature should do (shame on you nvidia), in this mode nvmlfan tries to maintain constant temperature.
cards: 0: mode: target target: 65 pid: [ 20, 0.1, 0 ] |
Of course, it won’t heat up the GPU if the temperature is below the target. However, it will actively counteract the heat generated under load, which helps minimize thermal stress.
Unfortunately, there are two drawbacks:
- I haven’t found a better way to achieve this than using a PID controller.
- Tuning PID parameters is notoriously difficult—it’s practically rocket science.
There are countless articles and entire books on how to tune PID parameters and even fucking discipline called control theory. Master PID tuning on your GPU and it will helps when you try to build a rocket.
There are no universal PID parameters that work for everyone (though, when properly tuned, they should be the same for identical models of GPUs). Fortunately, controlling GPU temperatures with a fan creates a relatively inertial system, so it’s less prone to oscillation.
In the [ 20, 0.1, 0 ]
array:
- The first number (P) is the proportional parameter. It controls how much the fan speed changes when the error (the difference between the target temperature and the actual temperature) equals one. For example, with a target temperature of 65°C and an actual temperature of 70°C, the fan speed would be set to 100%. However, setting the fan to 100% counters the heat, causing the temperature to decrease. This, in turn, reduces the fan speed, which can lead to the temperature rising again, and so on. If the P parameter is too high, this cycle can cause the system to oscillate.
- The second number (I) is the integral parameter. Since the proportional component is quite rough, the integral component adjusts slowly over time to ensure the fan speed perfectly matches the load, maintaining the target temperature.
- The third number (D) is the derivative parameter. It reacts to the rate at which the temperature changes. For systems with significant inertia, such as this one, the derivative component can often be omitted. If you’re curious about how to use it effectively, you’ll need to dive into some control theory books.
The second drawback is that target mode can create additional noise by frequently changing the fan speed. Since the target speed is recalculated every second, the variations might be noticeable—especially if the system begins to oscillate.
With that said, here’s the repository for the project: https://github.com/IvanBayan/nvmlfan
PS
I take no responsibility if someone damages their GPU using this. Use it at your own risk.