Designing Storage for ML Pipelines

Two Kinds of Storage

ML systems need two fundamentally different storage patterns, and conflating them causes operational pain.

Operational storage — fast, ephemeral, close to compute. Think: training data loaded into a pipeline, intermediate tensors, temp files during feature engineering. You want low latency, high throughput, and you don’t need durability guarantees. If this storage dies, you just re-run.

Artifact storage — durable, versioned, auditable. Model weights, dataset snapshots, experiment metadata. You need this to survive cluster restarts, to reproduce experiments six months later, and to audit what model was serving when an incident occurred.

MinIO as the Artifact Layer

MinIO gives you S3-compatible object storage that you can self-host. It integrates natively with MLflow (which uses S3 paths for artifact storage) and with DVC for dataset versioning.

# MinIO HelmRelease (abbreviated)
values:
  rootUser: minio-admin
  rootPassword: ${MINIO_PASSWORD}  # injected from Vault
  buckets:
    - name: mlflow-artifacts
      policy: none
    - name: datasets
      policy: none
    - name: model-registry
      policy: none

The Secrets Problem

The first footgun: MinIO credentials. If you put them in a ConfigMap or hardcode them in a HelmRelease values file, you’ve just checked credentials into Git.

The right answer is to use Vault (or at minimum Kubernetes Secrets referenced by ExternalSecrets). In Phase 3, I set up the credential injection pattern before standing up MinIO, so there was never a moment when credentials existed in plaintext.

Sizing for a Homelab

On a 16 GB node with limited disk, you have to be deliberate. I use a tiered approach:

Local PVC (fast NVMe) for operational storage
MinIO backed by a larger HDD for artifact storage
Lifecycle policies to prune old experiment artifacts automatically