What no one tells you about argocd applicationset and argocd-image-updater

I had a simple task, automatically deploy the latest available ‘latest’ docker image in kubernetes, sounds simple, right?
Argocd + argocd-image-updater and the task solved, can I go drink coffee?
NO!

Almost every second howto says, that if you want automatically update image to the newest for specific tag, you just need to set image update-strategy: digest, job done. When I followed that advice I observed the next:
1. argocd-image-updater detects new images and happily reports that it is updated:

time="2024-11-22T01:22:09Z" level=info msg="Starting image update cycle, considering 2 annotated application(s) for update"
time="2024-11-22T01:22:10Z" level=info msg="Setting new image to registry.gitlab.com/example/code/app/app:latest@sha256:0ddfbecb19e71511a2c0f5ead7f8334de127816001adb3faa002ccbee713bfcc" alias=app application=dev-app image_name=example/code/app/app image_tag=dummy registry=registry.gitlab.com
time="2024-11-22T01:22:10Z" level=info msg="Successfully updated image 'registry.gitlab.com/example/code/app/app@dummy' to 'registry.gitlab.com/example/code/app/app:latest@sha256:0ddfbecb19e71511a2c0f5ead7f8334de127816001adb3faa002ccbee713bfcc', but pending spec update (dry run=false)" alias=app application=dev-app image_name=example/code/app/app image_tag=dummy registry=registry.gitlab.com
time="2024-11-22T01:22:10Z" level=info msg="Committing 1 parameter update(s) for application dev-app" application=dev-app
time="2024-11-22T01:22:10Z" level=info msg="Successfully updated the live application spec" application=dev-app
time="2024-11-22T01:22:10Z" level=info msg="Processing results: applications=2 images_considered=2 images_skipped=0 images_updated=1 errors=0"

2. Argocd happily reports that everything in sync
3. Image is not updated

I’ve tried to search half a day what’s I’m doing wrong without success.
First clue I found – new event every 2 minutes in argocd’s app:

And after a while if you check application resource in kubernetes you will see a thousand of deploys:

  - deployStartedAt: "2024-11-21T22:54:55Z"
    deployedAt: "2024-11-21T22:54:56Z"
    id: 1407
    initiatedBy:
      automated: true
    revision: 9ca777c6397102b7599fce31d05a6fe73f81954c
    source:
      helm:
        valueFiles:
        - dev-values.yaml
      path: .
      repoURL: https://gitlab.com/example/deploys/app.git
      targetRevision: dev
  - deployStartedAt: "2024-11-21T22:56:56Z"
    deployedAt: "2024-11-21T22:56:56Z"
    id: 1408
    initiatedBy:
      automated: true
    revision: 9ca777c6397102b7599fce31d05a6fe73f81954c
    source:
      helm:
        valueFiles:
        - dev-values.yaml
      path: .
      repoURL: https://gitlab.com/example/deploys/app.git
      targetRevision: dev

That maked me start asking right questions: something is changed, but what? And when image-updater updates image how is it doing that? And how do applicateionset controller works?

I will answer these questions from the end:
1. Application set controller generates application resources from templates. It automatically overwrite application resource  if it doesn’t match generated.
2. When write-back-method set to argocd, argocd-image-updater changes application resource.

Now it’s clear what happened, they just fight each other. Image-updater see new image and update application resource, applicationset controller see that application resource different from generated and overwrite this.

Ok, what to do next? ApplicationSet have “elegant” solution called ignoreApplicationDifferences, which allows to ignore differences between actual application and generated, but what should be ignored?
That the most complicated question. At moment of writing I was unable to find answer in documentation. I just no see easy way to find out what exactly image-updater changes in policy and what applicationset controller reverts back. Here is no diff between manifests and changes happens so quickly that I was unable to see manifests itself.  I also found nothing in logs (at least without enabling debug logs for argocd).
Thanks to this issue I learned about applicationset controller policies, so here is a way to forbid applicationset controller patch/update application resources. And when I changed it, I finally was able to see the diff:

...
spec:
  source:
    helm:
      parameters:
      - forceString: true
        name: image.name
        value: registry.gitlab.com/example/code/app/app
      - forceString: true
        name: image.tag
        value: latest@sha256:0ddfbecb19e71511a2c0f5ead7f8334de127816001adb3faa002ccbee713bfcc
      valueFiles:
      - dev-values.yaml
...

The answer to the last question is: the image updater adds Helm parameters (at least if you are using Helm).
The “elegant” solution, which I “really like,” is to ignore Helm parameters. Since I planned to decouple Helm variables from ArgoCD applications and don’t want to use parameters at all, there’s not much harm in ignoring them. Nevertheless, each time I think about it, it annoys me how ugly this approach feels:

kind: ApplicationSet
spec:
  ignoreApplicationDifferences:
  - jqPathExpressions:
      - .spec.source.helm.parameters

PS
To be fair, this issue does not affect those who use the write-back-method:  git. However, since I only need the newest image for the latest tag and don’t care about which specific latest image it is, I don’t need to save its hash in git. Moreover, I don’t want to have a commit each time someone builds a new image.

Fix EFS dynamic provision on EKS

Probably it’s an obvious thing for people with more experience, but I spent an evening trying to figure out what’s wrong.

I have an EKS configured with terraform module terraform-aws-eks and IRSA configured like this:

module "efs_csi_irsa_role" {
  source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
 
  role_name             = "efs-csi"
  attach_efs_csi_policy = true
 
  oidc_providers = {
    ex = {
      provider_arn               = module.eks.oidc_provider_arn
      namespace_service_accounts = ["kube-system:efs-csi-controller-sa"]
    }
  }

At some point it started to work with static provisioning, but when I tried to use dynamic it stopped with the next errors in efs-csi-controller pod:

I1204 23:55:08.556870       1 controller.go:61] CreateVolume: called with args {Name:pvc-f725e33d-b1e5-44ff-a400-1f9ff8388296 CapacityRange:required_bytes:5368709120  VolumeCapabilities:[mount:<> access_mode: ] Parameters:map[basePath:/dynamic_provisioning csi.storage.k8s.io/pv/name:pvc-f725e33d-b1e5-44ff-a400-1f9ff8388296 csi.storage.k8s.io/pvc/name:efs-claim2 csi.storage.k8s.io/pvc/namespace:kva-prod directoryPerms:700 fileSystemId:fs-031e4372b15a36d5a gidRangeEnd:2000 gidRangeStart:1000 provisioningMode:efs-ap] Secrets:map[] VolumeContentSource: AccessibilityRequirements: XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I1204 23:55:08.556934       1 cloud.go:238] Calling DescribeFileSystems with input: {
  FileSystemId: "fs-031e4372b15a36d5a"
}
E1204 23:55:08.597320       1 driver.go:103] GRPC error: rpc error: code = Unauthenticated desc = Access Denied. Please ensure you have the right AWS permissions: Access denied

And here is what I missed, official documentation uses eksctl for IRSA:

eksctl create iamserviceaccount \
    --cluster my-cluster \
    --namespace kube-system \
    --name efs-csi-controller-sa \
    --attach-policy-arn arn:aws:iam::111122223333:policy/AmazonEKS_EFS_CSI_Driver_Policy \
    --approve \
    --region region-code

SA creation is disabled with helm:

helm upgrade -i aws-efs-csi-driver aws-efs-csi-driver/aws-efs-csi-driver \
    --namespace kube-system \
    --set image.repository=602401143452.dkr.ecr.region-code.amazonaws.com/eks/aws-efs-csi-driver \
    --set controller.serviceAccount.create=false \
    --set controller.serviceAccount.name=efs-csi-controller-sa

So I missed service annotation. The thing which have helped me to figure out what’s wrong (no it wasn’t careful reading of the documentation) was CloudTrail:

    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "EKYQJEOBHPAS7L:i-deadbeede490d57b1",
        "arn": "arn:aws:sts::111122223333:assumed-role/default_node_group-eks-node-group-20220727213424437600000003/i-deadbeede490d57b1",
        "accountId": "111122223333",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "EKYQJEOBHPAS7L",
                "arn": "arn:aws:iam::111122223333:role/default_node_group-eks-node-group-20220727213424437600000003",
                "accountId": "111122223333",
                "userName": "default_node_group-eks-node-group-20220727213424437600000003"
            },
            "webIdFederationData": {},
            "attributes": {
                "creationDate": "2022-12-04T23:20:40Z",
                "mfaAuthenticated": "false"
            },
            "ec2RoleDelivery": "2.0"
        }
    },
    "errorMessage": "User: arn:aws:sts::111122223333:assumed-role/default_node_group-eks-node-group-20220727213424437600000003/i-deadbeede490d57b1 is not authorized to perform: elasticfilesystem:DescribeFileSystems on the specified resource",

Assuming role as a node differently not what I expected.

If I have been more thoughtful I may ask myself what comment “## Enable if EKS IAM for SA is used” was doing in aws-efs-csi-driver’s values.yaml but I hadn’t.
Evening spent, lesson learned.

PS

And  that update of service account doesn’t lead to magical appear of  AWS_WEB_IDENTITY_TOKEN_FILE env in container is a thing that worth to remember.

PPS

Looks like static provisioning will work even with broken IRSA for EFS, since NFS which is under the hood of EFS not be bothered by IAM existence in any sense.