Tracking ML Experiments Like Production Systems

The SRE Mental Model

When an SRE thinks about software releases, they think in terms of: what changed, who approved it, what version is running now, and how do I roll back?

ML experiments need exactly the same discipline, and most practitioners don’t apply it.

MLflow’s model registry maps almost perfectly:

Software Release	MLflow Equivalent
Git commit	Run ID + logged parameters
Artifact (Docker image)	Model artifact (weights + metadata)
Release candidate	Model in “Staging”
Production deploy	Model in “Production”
Rollback	Transition previous version back to Production

Setting Up MLflow with PostgreSQL Backend

The default MLflow setup uses a SQLite file — which breaks the moment you want to run multi-node training or access experiment history from multiple pods.

# MLflow tracking URI pointing at PostgreSQL
MLFLOW_TRACKING_URI=postgresql://mlflow:${MLFLOW_DB_PASSWORD}@postgres:5432/mlflow
MLFLOW_ARTIFACT_ROOT=s3://mlflow-artifacts/  # MinIO via S3 API

DVC for Dataset Versioning

MLflow tracks model artifacts. DVC tracks dataset versions. Together they give you full experiment reproducibility: you can recover exactly which dataset version, preprocessing code, and hyperparameters produced any given model.

# .dvc/config
[core]
    remote = minio
[remote "minio"]
    url = s3://datasets
    endpointurl = http://minio.mlops.svc:9000

The Discipline Problem

The tooling is easy to set up. The hard part is the discipline: every training run must log parameters, every dataset must be pinned to a DVC hash, every model must be registered before it’s deployed. That discipline comes from process — code review for ML pipelines, CI gates that fail if logging is missing.