The SRE Mental Model
When an SRE thinks about software releases, they think in terms of: what changed, who approved it, what version is running now, and how do I roll back?
ML experiments need exactly the same discipline, and most practitioners don’t apply it.
MLflow’s model registry maps almost perfectly:
| Software Release | MLflow Equivalent |
|---|---|
| Git commit | Run ID + logged parameters |
| Artifact (Docker image) | Model artifact (weights + metadata) |
| Release candidate | Model in “Staging” |
| Production deploy | Model in “Production” |
| Rollback | Transition previous version back to Production |
Setting Up MLflow with PostgreSQL Backend
The default MLflow setup uses a SQLite file — which breaks the moment you want to run multi-node training or access experiment history from multiple pods.
# MLflow tracking URI pointing at PostgreSQL
MLFLOW_TRACKING_URI=postgresql://mlflow:${MLFLOW_DB_PASSWORD}@postgres:5432/mlflow
MLFLOW_ARTIFACT_ROOT=s3://mlflow-artifacts/ # MinIO via S3 API
DVC for Dataset Versioning
MLflow tracks model artifacts. DVC tracks dataset versions. Together they give you full experiment reproducibility: you can recover exactly which dataset version, preprocessing code, and hyperparameters produced any given model.
# .dvc/config
[core]
remote = minio
[remote "minio"]
url = s3://datasets
endpointurl = http://minio.mlops.svc:9000
The Discipline Problem
The tooling is easy to set up. The hard part is the discipline: every training run must log parameters, every dataset must be pinned to a DVC hash, every model must be registered before it’s deployed. That discipline comes from process — code review for ML pipelines, CI gates that fail if logging is missing.