Nvmlfan – thinkfan like demon to control nvidia GPU’s fans

I recently upgraded my home server, and as an indirect result, I now have three more PCIe slots than before. So, I decided to add a GPU for video transcoding and machine learning acceleration (though I’m still debating whether I really need it).

I bought an NVIDIA T600 but noticed some strange behavior: under load, it heats up to over 80°C. Even then, nvidia-smi reports that the fan is set below maximum speed. I tried adjusting the GPU’s target temperature and encountered two issues:

  1. The temperature setting immediately resets to default. (It turns out that persistence mode needs to be enabled to prevent this.)
  2. Even with persistence mode, the adjustment doesn’t work.

The GPU (or its drivers) ignores the target temperature setting, which is a problem. First, I don’t want anything running hot in my server. Second, and more importantly, it causes thermal stress.

I tried searching for something like thinkfan for GPUs but didn’t find anything useful. Most solutions for controlling the fan speed on NVIDIA GPUs seem to rely on nvidia-settings, which requires a functioning X server with NVIDIA drivers. That feels like overkill for me.

Fortunately, I discovered that it’s still possible to control the fans without an X server by using libnvidia-ml. After spending 15 minutes with ChatGPT and half a day making it work (half day and 15 min hating GO in total), I finally got it running.

The key difference with ThinkFan is the following: nvmlfan has two modes. The first is the standard curve mode, which is defined like this:

cards:
0:
mode: curve
curve:
- [ 60, 30 ]
- [ 65, 50]
- [ 75, 100]

this maps the GPU’s temperature to a corresponding fan speed. The curve specifies anchor points in the format [ temperature, fan_speed ], with values in between approximated linearly.

  • Temperatures below the first anchor point are mapped to the fan speed of the first point or the minimum speed allowed by the GPU BIOS/driver.
  • If the last anchor point’s temperature is below the GPU’s maximum threshold, the fan speed is linearly approximated from the last point’s value to 100%.

Note: Fan speeds are controlled as percentages, not RPM values.

The second mode do what nvidia-smi GPU target temperature should do (shame on you nvidia), in this mode nvmlfan tries to maintain constant temperature.

cards:
0:
mode: target
target: 65
pid: [ 20, 0.1, 0 ]

Of course, it won’t heat up the GPU if the temperature is below the target. However, it will actively counteract the heat generated under load, which helps minimize thermal stress.

Unfortunately, there are two drawbacks:

  1. I haven’t found a better way to achieve this than using a PID controller.
  2. Tuning PID parameters is notoriously difficult—it’s practically rocket science.

There are countless articles and entire books on how to tune PID parameters and even fucking discipline called control theory. Master PID tuning on your GPU and it will helps when you try to build a rocket.

There are no universal PID parameters that work for everyone (though, when properly tuned, they should be the same for identical models of GPUs). Fortunately, controlling GPU temperatures with a fan creates a relatively inertial system, so it’s less prone to oscillation.

In the [ 20, 0.1, 0 ] array:

  1. The first number (P) is the proportional parameter. It controls how much the fan speed changes when the error (the difference between the target temperature and the actual temperature) equals one. For example, with a target temperature of 65°C and an actual temperature of 70°C, the fan speed would be set to 100%. However, setting the fan to 100% counters the heat, causing the temperature to decrease. This, in turn, reduces the fan speed, which can lead to the temperature rising again, and so on. If the P parameter is too high, this cycle can cause the system to oscillate.
  2. The second number (I) is the integral parameter. Since the proportional component is quite rough, the integral component adjusts slowly over time to ensure the fan speed perfectly matches the load, maintaining the target temperature.
  3. The third number (D) is the derivative parameter. It reacts to the rate at which the temperature changes. For systems with significant inertia, such as this one, the derivative component can often be omitted. If you’re curious about how to use it effectively, you’ll need to dive into some control theory books.

The second drawback is that target mode can create additional noise by frequently changing the fan speed. Since the target speed is recalculated every second, the variations might be noticeable—especially if the system begins to oscillate.

With that said, here’s the repository for the project: https://github.com/IvanBayan/nvmlfan

PS
I take no responsibility if someone damages their GPU using this. Use it at your own risk.

What no one tells you about argocd applicationset and argocd-image-updater

I had a simple task, automatically deploy the latest available ‘latest’ docker image in kubernetes, sounds simple, right?
Argocd + argocd-image-updater and the task solved, can I go drink coffee?
NO!

Almost every second howto says, that if you want automatically update image to the newest for specific tag, you just need to set image update-strategy: digest, job done. When I followed that advice I observed the next:
1. argocd-image-updater detects new images and happily reports that it is updated:

time="2024-11-22T01:22:09Z" level=info msg="Starting image update cycle, considering 2 annotated application(s) for update"
time="2024-11-22T01:22:10Z" level=info msg="Setting new image to registry.gitlab.com/example/code/app/app:latest@sha256:0ddfbecb19e71511a2c0f5ead7f8334de127816001adb3faa002ccbee713bfcc" alias=app application=dev-app image_name=example/code/app/app image_tag=dummy registry=registry.gitlab.com
time="2024-11-22T01:22:10Z" level=info msg="Successfully updated image 'registry.gitlab.com/example/code/app/app@dummy' to 'registry.gitlab.com/example/code/app/app:latest@sha256:0ddfbecb19e71511a2c0f5ead7f8334de127816001adb3faa002ccbee713bfcc', but pending spec update (dry run=false)" alias=app application=dev-app image_name=example/code/app/app image_tag=dummy registry=registry.gitlab.com
time="2024-11-22T01:22:10Z" level=info msg="Committing 1 parameter update(s) for application dev-app" application=dev-app
time="2024-11-22T01:22:10Z" level=info msg="Successfully updated the live application spec" application=dev-app
time="2024-11-22T01:22:10Z" level=info msg="Processing results: applications=2 images_considered=2 images_skipped=0 images_updated=1 errors=0"

2. Argocd happily reports that everything in sync
3. Image is not updated

I’ve tried to search half a day what’s I’m doing wrong without success.
First clue I found – new event every 2 minutes in argocd’s app:

And after a while if you check application resource in kubernetes you will see a thousand of deploys:

  - deployStartedAt: "2024-11-21T22:54:55Z"
    deployedAt: "2024-11-21T22:54:56Z"
    id: 1407
    initiatedBy:
      automated: true
    revision: 9ca777c6397102b7599fce31d05a6fe73f81954c
    source:
      helm:
        valueFiles:
        - dev-values.yaml
      path: .
      repoURL: https://gitlab.com/example/deploys/app.git
      targetRevision: dev
  - deployStartedAt: "2024-11-21T22:56:56Z"
    deployedAt: "2024-11-21T22:56:56Z"
    id: 1408
    initiatedBy:
      automated: true
    revision: 9ca777c6397102b7599fce31d05a6fe73f81954c
    source:
      helm:
        valueFiles:
        - dev-values.yaml
      path: .
      repoURL: https://gitlab.com/example/deploys/app.git
      targetRevision: dev

That maked me start asking right questions: something is changed, but what? And when image-updater updates image how is it doing that? And how do applicateionset controller works?

I will answer these questions from the end:
1. Application set controller generates application resources from templates. It automatically overwrite application resource  if it doesn’t match generated.
2. When write-back-method set to argocd, argocd-image-updater changes application resource.

Now it’s clear what happened, they just fight each other. Image-updater see new image and update application resource, applicationset controller see that application resource different from generated and overwrite this.

Ok, what to do next? ApplicationSet have “elegant” solution called ignoreApplicationDifferences, which allows to ignore differences between actual application and generated, but what should be ignored?
That the most complicated question. At moment of writing I was unable to find answer in documentation. I just no see easy way to find out what exactly image-updater changes in policy and what applicationset controller reverts back. Here is no diff between manifests and changes happens so quickly that I was unable to see manifests itself.  I also found nothing in logs (at least without enabling debug logs for argocd).
Thanks to this issue I learned about applicationset controller policies, so here is a way to forbid applicationset controller patch/update application resources. And when I changed it, I finally was able to see the diff:

...
spec:
  source:
    helm:
      parameters:
      - forceString: true
        name: image.name
        value: registry.gitlab.com/example/code/app/app
      - forceString: true
        name: image.tag
        value: latest@sha256:0ddfbecb19e71511a2c0f5ead7f8334de127816001adb3faa002ccbee713bfcc
      valueFiles:
      - dev-values.yaml
...

The answer to the last question is: the image updater adds Helm parameters (at least if you are using Helm).
The “elegant” solution, which I “really like,” is to ignore Helm parameters. Since I planned to decouple Helm variables from ArgoCD applications and don’t want to use parameters at all, there’s not much harm in ignoring them. Nevertheless, each time I think about it, it annoys me how ugly this approach feels:

kind: ApplicationSet
spec:
  ignoreApplicationDifferences:
  - jqPathExpressions:
      - .spec.source.helm.parameters

PS
To be fair, this issue does not affect those who use the write-back-method:  git. However, since I only need the newest image for the latest tag and don’t care about which specific latest image it is, I don’t need to save its hash in git. Moreover, I don’t want to have a commit each time someone builds a new image.

Debian dual boot with full encryption on LVM and enabled secure boot

Just a short note how to install Debian in dual boot with full encryption (without separate un-encypted boot).

I needed to preserve installed windows and didn’t wanted to touch bios settings, so the first needed thing is a un-allocated disk space.
During disk partitioning free space should be dedicated to encrypted partition.
After key is provided and partition initialized volume group and correspond logic volumes should be created on encrypted partition (it will be listed like /dev/nvme0n1p3_crypt).
At the latest stage of installation grub will be fail, to make grub seems installed you need to switch to second terminal and add GRUB_ENABLE_CRYPTODISK=y to /target/etc/grub:

echo GRUB_ENABLE_CRYPTODISK=y >> /target/etc/grub

Then repeat grub installation from menu.

The next steps I’ve made in recovery mode, because grub was continue insulting me with messages “error: Invalid passphrase. error: no such cryptodisk found.”

The most important step (and probably the only required) is to convert luks key from argon2i (which is turned out to be not supported by grub) to pbkdf2 with command:

cryptsetup luksConvertKey --pbkdf pbkdf2 /dev/nvme0n1p3

/dev/nvme0n1p3 should be the path to actual encrypted partition.

The last two steps are going in decrease of importance, probably they are not needed, but I’ve done them before key conversion, so not 100% sure.
Add cryptdevice=/dev/nvme0n1p3:lvm to the end of GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub.

Re-install and update grub:

mount -t efivarfs none /sys/firmware/efi/efivars/ && \
grub-install --target=x86_64-efi --uefi-secure-boot --force-extra-removable /dev/nvme0n1 && \
update-grub

I’m the most skeptical about the last step since it’s usually needed when PC can’t load grub because of enabled secure boot and incorrect installation. But a saw message that it doesn’t accept LUKS password, so grub was definitely loaded.

The story of how I spent the evening enableing TMC2208 spreadCycle on Creality 1.1.5 board

I have ender 5 which come with creality 1.1.5 board with one little surprise, Marlin’s linear advance doesn’t work on it (klipper seems not to be happy too).
The reason is TMC2208 drivers which are in default stealthChop mode which doesn’t work well with rapid speed and direction changes.

TMC2208 is highly configurable in comparison to old drivers like A4988, but it utilizes half-duplex serial interface. Also it has default configuration stored in OTP (one time programming memory) which again may be changed via serial interface. So, here is two options, connect TMC2208 to onboard microcontroller  and let Marlin/Klipper to configure TMC2208 or change OTP.
It’s not so easy to find spare pin on this board (at least I thought so), so I decided to change OTP register.

Serial interface is exposed on PIN14 (PDN_UART) of TMC2208 chip:
TMC2208 package

On popular stepstick type drivers which looks like this:

This pin is exposed and easily available, but it’s not the case. On Creality 1.1.5 board these drivers integrated.
I didn’t found the schematic for revision 1.1.5, but I’ve found PCB view of older revision. I’ve visually compared traces, vias, elements and designates around driver and found them very similar if not the same.

There is PCB view of extruder’s driver:

Extruder's driver

And there is a photo of board I have:

Creality 1.1.5 board

The needed PIN 14 is connected to PIN12 and 10K pull-up resistor.

To change OTP register I needed half-duplex serial and I had three most obvious options out of my head:

  • Use usb to serial adapter and join TX and RX lines
  • Use separate controller and do bing bang thing
  • Use onboard controller and just upload an arduino sketch to do the same (or even use TMC2208Stepper lib to just write OTP register)

I had no spare arduino around and wasn’t sure that will be able to get access to Marlin’s calibration stored in EEPROM and decided to use the first option (it didn’t work well and here is few different reason why which I will write at the end).

First you need ScriptCommunicator to send commands to TMC2208 from there: https://sourceforge.net/projects/scriptcommunicator/
Next, you need to get TMC2208.scez bundle from there: https://github.com/watterott/SilentStepStick/tree/master/ScriptCommunicator
Download them somewhere, they will be used later.

The solution for making half-duplex from usb to serial adapter which is in top of google result looks like that:

And here is my initial implementation:

Half-duplex implementation
Resistor is just pushed into headers which are connected to RX and TX, only wire connected to RX is used to communicate with TMC2208.
My first idea was to solder wire to R24 (I need to enable spreadCycle only for extruder’s driver) and use usb to serial adapter like this:

1st attempt to solder wire directly to R24

The whole construction (5V and GND were connected to ISP header’s pins 2 and 6 respectively):
FTDI to Creality board connection

When everything ready, there is time to open TMC2208.scez, I used the version for linux, so for me it was command like:

/PATH/TO/ScriptCommunicator.sh /PATH/TO/TMC2208.scez

But unfortunately it didn’t work. Each time I hit connect button I got a message “Sending failed. Check hardware connection and serial port.” First I tried to lower connection speed (TMC2208 automatically detects baudrate, 115200 was configured in TMC2208.scez), but without positive result. Next I was checking all the connections between FTDI, resistor and TMC chip – no success. Un-pluging VCC from FTDI and powering board with external PSU – no connection.

I started to think what can went wrong, the fact that old  board revision for A4988 drivers looks pretty similar made me think that creality just put new chip in place of old one and here is obvious candidate INDEX PIN(12) which is connected to PDN. According to datasheet  INDEX is digital output, so if it is push-pull, it will definitely mess with serial communication. Only option to fix it is to cut trace between them and solder wire directly to PDN. Luckily it’s just two layer board, so needed trace can be easily located on the back side:

Cut like that:

Back cut PDN to INDEX trace

And solder wire. Wire should be thin and soft otherwise there is a risk to peal off trace completely. Also it’s worth to check that here is no connectivity between wire and R24 after soldering:

Back of the board, wire soldered to PDN

I thought that I would finally be able to configure TMC, but to my surprise only change I observed was an checksum error message which I got time to time instead of “Sending failed”.
It was around 1:30 after midnight and I almost gave up, when recalled in the very last moment that I have CH341 based programmer. I give it a try and finally it worked:

Configurator finally connected
Only additional change I made, I powered board from external supply, because it was easier than searching for 5V on programmer:
CH341 connected to board

Next to change of OTP (step by step video may be foun there).

OTP bits can be changed once, that action is irreversible additional attention is needed there.

On “OTP Programmer” tab the byte #2 bit #7 should be written to enable spreadCycle mode. After that driver goes to disabled state, until “duration of slow decay phase” is configured to some value other than 0. For me it’s still opaque which value should be written, the SilentStepStick configurator suggests value 3, the same value used as default for stealthChop mode. Without having better ideas I wrote the same, first 4 bits of byte #1 controls  duration, to write value 3,  bit #0 and bit #1 should be written.
Complete sequence is below:

Byte 2 bit 7 Byte 1 bit 1 Byte 1 bit 0

To make sure that OTP configured correctly, it’s needed to click “Read all Registers” button on “Register Settings” tab (not sure why on my screenshot I have OTP_PWM_GRAD equals 2 probably I made screenshot after writing only byte #1 bit #1):

Read OTP bits

Or disconnect and connect to driver again, “Tuning” tab should have enabled spreadCycle and TOFF set to 3:

Mission complete

PS

Looking back, I see that here is not so much sense in changing OTP in that way or doing it at all.
First  making half-duplex serial just by connecting TX and RX with 1k resistor seems wrong. Atmel’s app not AVR947 suggest that it should looks like that:

Correct half-duplex joining

Which makes more sense and explains strange voltage around 2.8V I saw on PDN pin when I was troubleshooting FTDI. Possible explanations why FTDI didn’t work for me is that CH341 has different  threshold/voltage levels or has pull-up or my FTDI was partially damaged after series of unfortunate incidents.

Next if for some reason OTP should be changed, it’s easier to use MISO, MOSI or SCK pin from ISP pin header and make arduino sketch.

And finally, there I found that board has partially populated 3 PIN footprint, unused pin connected to pin #35 (PA2) of atmega installed on the board. Without  bltouch it’s the easiest option to have constant connection between controller and driver, which allows to use dynamic configuration. Even more with klipper it’s possible (but don’t know why) to have constant connection to each driver and even have bltouch by using SCK, MOSI, MISO (bye sdcard), BEEPER and PA2:

Unused GPIO

So far I have no bltouch, so even with configure OTP I’m going to solder a wire from PA2 to PDN just to have an option adjust driver configuration on the fly.

Thank for reading.

How to fix “Encryption credentials have expired” on xerox b215

Looks like I have new hobby  donated by xerox (if you can avoid greedy lying xerox, do it) – fixing my printer.
This time it just suddenly stopped to work with message “Encryption credentials have expired”. Previously I saw an option ‘Create new certificate’ on printer’s web page and my assumption was that probably certificate installed on printer was expired. At least I faced with that issues on embedded hardware like BMC’s many times, I tried to click on ‘Create new certificate’ button but it didn’t helped.
Let’s say thank you to xerox engineers and launch wireshark to figure out what happened. When I tried to resume print queue I saw communication on port 631 (IPP), which I able to decode as TLS in wireshark. openssl s_client shown expired certificate. Here is no option to uppload own key and certificate, but here is an option to downloads certificate signing request under Properties->Security->Machine Digital Certificate. So, I just created CA certificate:

$ openssl req -x509 -sha256 -days 3650 -newkey rsa:2048 -keyout rootCA.key -out rootCA.crt

Signed it using the next config:

$ cat > ./printer.conf << EOF
authorityKeyIdentifier=keyid,issuer
basicConstraints=CA:FALSE
subjectAltName = @alt_names
[alt_names]
DNS.1 = printer
DNS.2 = printer.local
IP.1 = 192.168.1.1
EOF
$ openssl x509 -req -CA rootCA.crt -CAkey rootCA.key -in PRINTER_request_sslCertificate.pem -out printer.crt -days 3649 -CAcreateserial -extfile printer.conf

And uploaded to printer.
Bonus point for SAN.

Make xerox b215 work with samba 4 again

Recently I bought xerox b215 (if you can, buy something other than xerox or hp) and wanted to make it scan to smb share. I already had configured samba in container using servercontainers/samba image.
So, it’s just to add another new share and configure user for scanner, right? Wrong!
It’ just didn’t worked. Thanks xerox’s engineers who decided not to burden end-user with diagnostic messages. It started scanning and after a second  returned back to the scan screen. Samba with log level 10 didn’t help me too, I just saw that client tried to connect and that all.
The tool which helped me is wireshark, I’ve found that after NTLMSSP_AUTH request from scanner samba sends STATUS_LOGON_FAILURE.

A little bit of “letsgoogleit” and voila ntlm auth = ntlmv1-permitted allowed me not to configure FTP for that lovely xerox.

Fix EFS dynamic provision on EKS

Probably it’s an obvious thing for people with more experience, but I spent an evening trying to figure out what’s wrong.

I have an EKS configured with terraform module terraform-aws-eks and IRSA configured like this:

module "efs_csi_irsa_role" {
  source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
 
  role_name             = "efs-csi"
  attach_efs_csi_policy = true
 
  oidc_providers = {
    ex = {
      provider_arn               = module.eks.oidc_provider_arn
      namespace_service_accounts = ["kube-system:efs-csi-controller-sa"]
    }
  }

At some point it started to work with static provisioning, but when I tried to use dynamic it stopped with the next errors in efs-csi-controller pod:

I1204 23:55:08.556870       1 controller.go:61] CreateVolume: called with args {Name:pvc-f725e33d-b1e5-44ff-a400-1f9ff8388296 CapacityRange:required_bytes:5368709120  VolumeCapabilities:[mount:&lt;&gt; access_mode: ] Parameters:map[basePath:/dynamic_provisioning csi.storage.k8s.io/pv/name:pvc-f725e33d-b1e5-44ff-a400-1f9ff8388296 csi.storage.k8s.io/pvc/name:efs-claim2 csi.storage.k8s.io/pvc/namespace:kva-prod directoryPerms:700 fileSystemId:fs-031e4372b15a36d5a gidRangeEnd:2000 gidRangeStart:1000 provisioningMode:efs-ap] Secrets:map[] VolumeContentSource: AccessibilityRequirements: XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I1204 23:55:08.556934       1 cloud.go:238] Calling DescribeFileSystems with input: {
  FileSystemId: "fs-031e4372b15a36d5a"
}
E1204 23:55:08.597320       1 driver.go:103] GRPC error: rpc error: code = Unauthenticated desc = Access Denied. Please ensure you have the right AWS permissions: Access denied

And here is what I missed, official documentation uses eksctl for IRSA:

eksctl create iamserviceaccount \
    --cluster my-cluster \
    --namespace kube-system \
    --name efs-csi-controller-sa \
    --attach-policy-arn arn:aws:iam::111122223333:policy/AmazonEKS_EFS_CSI_Driver_Policy \
    --approve \
    --region region-code

SA creation is disabled with helm:

helm upgrade -i aws-efs-csi-driver aws-efs-csi-driver/aws-efs-csi-driver \
    --namespace kube-system \
    --set image.repository=602401143452.dkr.ecr.region-code.amazonaws.com/eks/aws-efs-csi-driver \
    --set controller.serviceAccount.create=false \
    --set controller.serviceAccount.name=efs-csi-controller-sa

So I missed service annotation. The thing which have helped me to figure out what’s wrong (no it wasn’t careful reading of the documentation) was CloudTrail:

    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "EKYQJEOBHPAS7L:i-deadbeede490d57b1",
        "arn": "arn:aws:sts::111122223333:assumed-role/default_node_group-eks-node-group-20220727213424437600000003/i-deadbeede490d57b1",
        "accountId": "111122223333",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "EKYQJEOBHPAS7L",
                "arn": "arn:aws:iam::111122223333:role/default_node_group-eks-node-group-20220727213424437600000003",
                "accountId": "111122223333",
                "userName": "default_node_group-eks-node-group-20220727213424437600000003"
            },
            "webIdFederationData": {},
            "attributes": {
                "creationDate": "2022-12-04T23:20:40Z",
                "mfaAuthenticated": "false"
            },
            "ec2RoleDelivery": "2.0"
        }
    },
    "errorMessage": "User: arn:aws:sts::111122223333:assumed-role/default_node_group-eks-node-group-20220727213424437600000003/i-deadbeede490d57b1 is not authorized to perform: elasticfilesystem:DescribeFileSystems on the specified resource",

Assuming role as a node differently not what I expected.

If I have been more thoughtful I may ask myself what comment “## Enable if EKS IAM for SA is used” was doing in aws-efs-csi-driver’s values.yaml but I hadn’t.
Evening spent, lesson learned.

PS

And  that update of service account doesn’t lead to magical appear of  AWS_WEB_IDENTITY_TOKEN_FILE env in container is a thing that worth to remember.

PPS

Looks like static provisioning will work even with broken IRSA for EFS, since NFS which is under the hood of EFS not be bothered by IAM existence in any sense.