Running ML Pipelines on Kubernetes with Argo Workflows

The Mental Model

If you’ve operated Airflow, cron jobs, or Jenkins pipelines, Argo Workflows will feel familiar. Each step in a pipeline is a container. Dependencies between steps form a DAG. Argo schedules, executes, retries, and archives the results.

The key difference: Argo is Kubernetes-native. Each step is a Kubernetes Pod. That means you get all of Kubernetes — resource limits, GPU scheduling, node affinity, service accounts — for free.

A Simple Training Pipeline

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: train-model
spec:
  entrypoint: pipeline
  templates:
    - name: pipeline
      dag:
        tasks:
          - name: preprocess
            template: preprocess-data
          - name: train
            template: train-model
            dependencies: [preprocess]
          - name: evaluate
            template: evaluate-model
            dependencies: [train]
          - name: register
            template: register-model
            dependencies: [evaluate]
            when: "{{tasks.evaluate.outputs.parameters.accuracy}} > 0.90"

The when condition on the register step is the interesting part — it implements a quality gate. Models that don’t hit the accuracy threshold never get registered, and therefore never get deployed.

Parameterized Workflows

For hyperparameter search, you can combine Argo Workflows with Argo’s withItems to fan out training runs in parallel:

- name: hyperparameter-search
  dag:
    tasks:
      - name: train-{{item.lr}}-{{item.batch}}
        template: train-model
        arguments:
          parameters:
            - name: learning-rate
              value: "{{item.lr}}"
            - name: batch-size
              value: "{{item.batch}}"
        withItems:
          - { lr: "0.001", batch: "32" }
          - { lr: "0.001", batch: "64" }
          - { lr: "0.0001", batch: "32" }

Comparing to Airflow

Argo Workflows wins on Kubernetes-nativeness: no Python DAG code to maintain, no Airflow scheduler, no separate metadata database. The trade-off is that complex Python-native orchestration logic is less ergonomic — Argo YAML can get verbose for deeply conditional logic.

For an MLOps platform that’s already Kubernetes-first, Argo is the right choice.