Wednesday, April 22, 2026

MLOps on OCI: Applying a DBA Mindset to Machine Learning

MLOps is the discipline of treating machine learning systems the way seasoned DBAs treat database pipelines: with version control, monitoring, rollback capability, and zero tolerance for silent failures. This post walks through a production-grade MLOps architecture on Oracle Cloud Infrastructure (OCI), complete with working scripts, pipeline definitions, and the hard-won lessons that don't show up in vendor documentation.

The architecture covers three major phases:

  1. Data Preparation — exploration, cleansing, feature engineering in OCI Data Science
  2. Model Training — automated CI/CD pipeline via OCI DevOps, containerized jobs, MLFlow tracking
  3. Model Deployment — serving via OKE (Oracle Kubernetes Engine) with a live inference endpoint

Let's build it layer by layer.


Architecture Overview





GitHub ──(code push trigger)──► OCI DevOps Build & Deploy Pipeline
                                       │
              ┌────────────────────────┼────────────────────────┐
              ▼                        ▼                        ▼
   OCI DataScience             OCI Container Registry        OCI ObjectStorage
  ┌──────────────────┐      ┌──────────────────────┐      ┌──────────────────┐
  │ Data Access &    │      │ Training Images       │      │ Datasets         │
  │ Exploration      │      │ Inference Images      │      │ Model Artifacts  │
  │ Model Training   │      └──────────────────────┘      │ Model Backups    │
  │ Model Dev        │                                     └──────────────────┘
  └──────────────────┘
              │
              └──────────────────────────────────────────────────►
                                                              OKE / MLFlow
                                                    ┌─────────────────────────┐
                                                    │ Model Evaluation        │
                                                    │ Model Monitoring        │
                                                    │ Model Registry          │
                                                    │ Model Inference Endpoint│
                                                    └─────────────────────────┘


The bottom layer — OCI IAM, Logging, Monitoring, Secrets, VCN, Notifications — is your operational safety net that runs silently under everything else.


Phase 1 — Data Preparation in OCI Data Science

1.1 Environment Setup

Every project starts with a reproducible conda environment. Don't skip this — environment drift is the #1 cause of "but it worked on my machine" incidents.

# environment.yml
name: mlops-oci
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.10
  - pip
  - pip:
    - oracle-ads==2.9.1
    - scikit-learn==1.3.0
    - pandas==2.0.3
    - mlflow==2.8.1
    - oci==2.112.0
    - boto3
    - pyarrow
    - great-expectations==0.18.0
conda env create -f environment.yml
conda activate mlops-oci

1.2 Data Access & Exploration (OCI ObjectStorage → Notebook)

# data_exploration.py
import ads
import pandas as pd
from ads.dataset.factory import DatasetFactory
from great_expectations.core import ExpectationSuite

# Authenticate using resource principal (preferred in OCI notebooks)
ads.set_auth(auth="resource_principal")

BUCKET       = "mlops-datasets"
NAMESPACE    = "your-tenancy-namespace"
OBJECT_NAME  = "churn_raw.parquet"

# Pull dataset from Object Storage
ds = DatasetFactory.open(
    f"oci://{BUCKET}@{NAMESPACE}/{OBJECT_NAME}",
    format="parquet",
    target="churn"
)

df = ds.to_pandas()
print(f"Shape: {df.shape}")
print(df.dtypes)
print(df.describe())

1.3 Data Quality Gate with Great Expectations

This is the DBA mindset applied to ML: validate your data before you trust it.

# data_quality.py
import great_expectations as gx

context = gx.get_context()

# Define expectations
suite = context.add_expectation_suite("churn_suite")

validator = context.get_validator(
    batch_request=...,
    expectation_suite_name="churn_suite"
)

# Column presence
validator.expect_table_columns_to_match_ordered_list(
    column_list=["customer_id","tenure","monthly_charges","total_charges","churn"]
)

# Nullability
validator.expect_column_values_to_not_be_null("customer_id")
validator.expect_column_values_to_not_be_null("churn")

# Value ranges
validator.expect_column_values_to_be_between("tenure", min_value=0, max_value=120)
validator.expect_column_values_to_be_between("monthly_charges", min_value=0, max_value=500)

# Categorical validity
validator.expect_column_values_to_be_in_set("churn", value_set=["Yes", "No"])

results = validator.validate()

if not results["success"]:
    raise ValueError(f"Data quality check FAILED:\n{results}")

print("✅ Data quality checks passed.")

1.4 Feature Engineering & Save to Object Storage

# feature_engineering.py
import pandas as pd
import ads
from ads.dataset.factory import DatasetFactory

ads.set_auth(auth="resource_principal")

def engineer_features(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()

    # Encode target
    df["churn_flag"] = (df["churn"] == "Yes").astype(int)

    # Derived features
    df["avg_monthly_spend"] = df["total_charges"] / (df["tenure"] + 1)
    df["is_long_term"]      = (df["tenure"] > 24).astype(int)
    df["high_spender"]      = (df["monthly_charges"] > 70).astype(int)

    # One-hot encode
    cat_cols = ["contract", "payment_method", "internet_service"]
    df = pd.get_dummies(df, columns=cat_cols, drop_first=True)

    # Drop originals
    df.drop(columns=["churn", "customer_id"], inplace=True)

    return df

df_raw        = pd.read_parquet("churn_raw.parquet")
df_features   = engineer_features(df_raw)

# Save engineered features back to Object Storage
output_path   = "oci://mlops-datasets@your-tenancy-namespace/churn_features.parquet"
df_features.to_parquet(output_path)

print(f"✅ Feature set saved: {df_features.shape}")

Phase 2 — Model Training with OCI DevOps CI/CD

2.1 Project Structure

mlops-churn/
├── .github/                    # GitHub metadata (PR templates, etc.)
├── build_spec.yaml             # OCI DevOps build spec
├── deploy_spec.yaml            # OCI DevOps deploy spec
├── Dockerfile.train            # Training container
├── Dockerfile.inference        # Inference container
├── src/
│   ├── train.py                # Training entrypoint
│   ├── evaluate.py             # Evaluation logic
│   ├── serve.py                # FastAPI inference server
│   └── utils/
│       ├── data_quality.py
│       └── feature_engineering.py
├── k8s/
│   ├── deployment.yaml         # OKE deployment manifest
│   └── service.yaml            # OKE service manifest
├── environment.yml
└── requirements.txt

2.2 OCI DevOps Build Spec

# build_spec.yaml
version: 0.1
component: build
timeoutInSeconds: 1800
shell: bash

steps:
  - type: Command
    name: "Install dependencies"
    command: |
      pip install -r requirements.txt

  - type: Command
    name: "Run data quality checks"
    command: |
      python src/utils/data_quality.py

  - type: Command
    name: "Build Training Docker Image"
    command: |
      docker build \
        -t ${OCI_RESOURCE_PRINCIPAL_REGION}.ocir.io/${TENANCY_NAMESPACE}/mlops-train:${OCI_BUILD_RUN_ID} \
        -f Dockerfile.train .

  - type: Command
    name: "Push Training Image to OCIR"
    command: |
      docker push \
        ${OCI_RESOURCE_PRINCIPAL_REGION}.ocir.io/${TENANCY_NAMESPACE}/mlops-train:${OCI_BUILD_RUN_ID}

  - type: Command
    name: "Build Inference Docker Image"
    command: |
      docker build \
        -t ${OCI_RESOURCE_PRINCIPAL_REGION}.ocir.io/${TENANCY_NAMESPACE}/mlops-infer:${OCI_BUILD_RUN_ID} \
        -f Dockerfile.inference .

  - type: Command
    name: "Push Inference Image to OCIR"
    command: |
      docker push \
        ${OCI_RESOURCE_PRINCIPAL_REGION}.ocir.io/${TENANCY_NAMESPACE}/mlops-infer:${OCI_BUILD_RUN_ID}

outputArtifacts:
  - name: train_image_tag
    type: BINARY
    location: ${OCI_BUILD_RUN_ID}

2.3 Training Script with MLFlow Tracking

# src/train.py
import os
import mlflow
import mlflow.sklearn
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import (
    classification_report, roc_auc_score,
    precision_score, recall_score, f1_score
)
import ads
import joblib

# ── Config ────────────────────────────────────────────────────────────────────
ads.set_auth(auth="resource_principal")

MLFLOW_TRACKING_URI = os.environ["MLFLOW_TRACKING_URI"]
EXPERIMENT_NAME     = "churn-prediction-v2"
BUCKET              = "mlops-datasets"
NAMESPACE           = os.environ["OCI_TENANCY_NAMESPACE"]
MODEL_BUCKET        = "mlops-artifacts"

mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
mlflow.set_experiment(EXPERIMENT_NAME)

# ── Load Features ─────────────────────────────────────────────────────────────
df = pd.read_parquet(
    f"oci://{BUCKET}@{NAMESPACE}/churn_features.parquet"
)

X = df.drop(columns=["churn_flag"])
y = df["churn_flag"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# ── Hyperparameters ───────────────────────────────────────────────────────────
params = {
    "n_estimators":   int(os.environ.get("N_ESTIMATORS", 200)),
    "max_depth":      int(os.environ.get("MAX_DEPTH",    5)),
    "learning_rate":  float(os.environ.get("LEARNING_RATE", 0.05)),
    "subsample":      float(os.environ.get("SUBSAMPLE",   0.8)),
    "random_state":   42,
}

# ── Train & Track ─────────────────────────────────────────────────────────────
with mlflow.start_run() as run:
    mlflow.log_params(params)
    mlflow.set_tag("model_type", "GradientBoosting")
    mlflow.set_tag("dataset_version", "churn_features_v2")

    model = GradientBoostingClassifier(**params)
    model.fit(X_train, y_train)

    # Metrics
    y_pred  = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]

    metrics = {
        "roc_auc":   roc_auc_score(y_test, y_proba),
        "precision": precision_score(y_test, y_pred),
        "recall":    recall_score(y_test, y_pred),
        "f1_score":  f1_score(y_test, y_pred),
    }

    # Cross-validation stability check
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring="roc_auc")
    metrics["cv_roc_auc_mean"] = cv_scores.mean()
    metrics["cv_roc_auc_std"]  = cv_scores.std()

    mlflow.log_metrics(metrics)
    print(classification_report(y_test, y_pred))
    print(f"\nšŸ“Š Metrics: {metrics}")

    # ── Promotion Gate ─────────────────────────────────────────────────────────
    MIN_AUC = float(os.environ.get("MIN_AUC_THRESHOLD", 0.80))
    if metrics["roc_auc"] < MIN_AUC:
        raise ValueError(
            f"❌ Model ROC-AUC {metrics['roc_auc']:.4f} < threshold {MIN_AUC}. "
            "Halting promotion."
        )

    # ── Log Model & Register ───────────────────────────────────────────────────
    mlflow.sklearn.log_model(
        model,
        artifact_path="model",
        registered_model_name="churn-gbt",
        input_example=X_test.head(5),
        signature=mlflow.models.infer_signature(X_test, y_pred),
    )

    # Save artifact to Object Storage as backup
    local_path = "/tmp/model.joblib"
    joblib.dump(model, local_path)

    import oci
    object_storage = oci.object_storage.ObjectStorageClient(
        config=oci.config.from_file()
    )
    run_id = run.info.run_id
    with open(local_path, "rb") as f:
        object_storage.put_object(
            namespace_name=NAMESPACE,
            bucket_name=MODEL_BUCKET,
            object_name=f"churn-gbt/{run_id}/model.joblib",
            put_object_body=f,
        )

    print(f"\n✅ Run ID: {run_id}")
    print(f"✅ Model registered in MLFlow: churn-gbt")
    print(f"✅ Artifact backed up to OCI Object Storage")

2.4 Dockerfiles

# Dockerfile.train
FROM python:3.10-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY src/ ./src/
COPY environment.yml .

ENV PYTHONUNBUFFERED=1

ENTRYPOINT ["python", "src/train.py"]
# Dockerfile.inference
FROM python:3.10-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt fastapi uvicorn

COPY src/serve.py .

EXPOSE 8080
ENTRYPOINT ["uvicorn", "serve:app", "--host", "0.0.0.0", "--port", "8080"]

Phase 3 — Model Serving on OKE with FastAPI

3.1 Inference Server

# src/serve.py
import os
import mlflow
import mlflow.sklearn
import pandas as pd
import numpy as np
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from typing import List, Optional
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

app = FastAPI(
    title="Churn Prediction API",
    description="MLOps inference endpoint — OCI OKE",
    version="2.0.0"
)

# ── Model Loading ──────────────────────────────────────────────────────────────
MLFLOW_TRACKING_URI    = os.environ["MLFLOW_TRACKING_URI"]
MODEL_NAME             = os.environ.get("MODEL_NAME", "churn-gbt")
MODEL_STAGE            = os.environ.get("MODEL_STAGE", "Production")

mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)

logger.info(f"Loading model: {MODEL_NAME} @ stage={MODEL_STAGE}")
model = mlflow.sklearn.load_model(f"models:/{MODEL_NAME}/{MODEL_STAGE}")
logger.info("✅ Model loaded successfully")

# ── Schemas ────────────────────────────────────────────────────────────────────
class CustomerFeatures(BaseModel):
    tenure:             float = Field(..., ge=0, le=120)
    monthly_charges:    float = Field(..., ge=0, le=500)
    avg_monthly_spend:  float
    is_long_term:       int   = Field(..., ge=0, le=1)
    high_spender:       int   = Field(..., ge=0, le=1)
    contract_one_year:  int   = Field(0, ge=0, le=1)
    contract_two_year:  int   = Field(0, ge=0, le=1)

class PredictionRequest(BaseModel):
    customers: List[CustomerFeatures]
    threshold: Optional[float] = 0.5

class PredictionResult(BaseModel):
    customer_index:   int
    churn_probability: float
    churn_prediction:  bool
    risk_tier:         str

# ── Endpoints ──────────────────────────────────────────────────────────────────
@app.get("/health")
def health():
    return {"status": "ok", "model": MODEL_NAME, "stage": MODEL_STAGE}

@app.post("/predict", response_model=List[PredictionResult])
def predict(request: PredictionRequest):
    try:
        df = pd.DataFrame([c.dict() for c in request.customers])
        probas = model.predict_proba(df)[:, 1]

        results = []
        for i, prob in enumerate(probas):
            tier = (
                "HIGH"   if prob >= 0.70 else
                "MEDIUM" if prob >= 0.40 else
                "LOW"
            )
            results.append(PredictionResult(
                customer_index    = i,
                churn_probability = round(float(prob), 4),
                churn_prediction  = bool(prob >= request.threshold),
                risk_tier         = tier,
            ))

        return results

    except Exception as e:
        logger.error(f"Prediction error: {e}")
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/model/info")
def model_info():
    client = mlflow.tracking.MlflowClient()
    mv     = client.get_latest_versions(MODEL_NAME, stages=[MODEL_STAGE])[0]
    return {
        "name":        mv.name,
        "version":     mv.version,
        "stage":       mv.current_stage,
        "run_id":      mv.run_id,
        "description": mv.description,
    }

3.2 Kubernetes Manifests

# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: churn-inference
  namespace: mlops
  labels:
    app: churn-inference
    version: v2
spec:
  replicas: 2
  selector:
    matchLabels:
      app: churn-inference
  template:
    metadata:
      labels:
        app: churn-inference
    spec:
      containers:
        - name: inference
          image: <region>.ocir.io/<namespace>/mlops-infer:<BUILD_TAG>
          ports:
            - containerPort: 8080
          env:
            - name: MLFLOW_TRACKING_URI
              valueFrom:
                secretKeyRef:
                  name: mlops-secrets
                  key: mlflow-uri
            - name: MODEL_NAME
              value: "churn-gbt"
            - name: MODEL_STAGE
              value: "Production"
          resources:
            requests:
              memory: "512Mi"
              cpu:    "250m"
            limits:
              memory: "1Gi"
              cpu:    "1000m"
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 20
---
# k8s/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: churn-inference-svc
  namespace: mlops
spec:
  selector:
    app: churn-inference
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
  type: LoadBalancer

3.3 OKE Deploy Script

#!/bin/bash
# deploy_oke.sh
set -euo pipefail

BUILD_TAG="${1:?Usage: ./deploy_oke.sh <build_tag>}"
NAMESPACE="mlops"
REGION="us-ashburn-1"
OCIR_NAMESPACE="your-tenancy-namespace"

echo "šŸš€ Deploying churn-inference:${BUILD_TAG} to OKE..."

# Login to OCIR
docker login \
  ${REGION}.ocir.io \
  -u "${OCIR_NAMESPACE}/oracleidentitycloudservice/${OCI_USER_EMAIL}" \
  -p "${OCI_AUTH_TOKEN}"

# Substitute build tag in manifests
sed "s/<BUILD_TAG>/${BUILD_TAG}/g" k8s/deployment.yaml | \
  kubectl apply -n ${NAMESPACE} -f -

kubectl apply -n ${NAMESPACE} -f k8s/service.yaml

# Wait for rollout
kubectl rollout status deployment/churn-inference -n ${NAMESPACE} --timeout=5m

echo "✅ Deployment complete."
kubectl get pods -n ${NAMESPACE} -l app=churn-inference

Phase 4 — Monitoring & Governance

4.1 Model Monitoring — Drift Detection

# monitoring/drift_detector.py
"""
Scheduled job: compares production inference distributions
against training baseline. Fires OCI Notification on drift.
"""
import os
import pandas as pd
import numpy as np
from scipy.stats import ks_2samp
import oci
import json

DRIFT_THRESHOLD = 0.05   # KS statistic p-value threshold
BUCKET          = "mlops-artifacts"
NAMESPACE       = os.environ["OCI_TENANCY_NAMESPACE"]

def load_baseline() -> pd.DataFrame:
    return pd.read_parquet(
        f"oci://{BUCKET}@{NAMESPACE}/baselines/churn_train_dist.parquet"
    )

def load_production_window(days: int = 7) -> pd.DataFrame:
    """Load last N days of inference logs from Object Storage."""
    return pd.read_parquet(
        f"oci://{BUCKET}@{NAMESPACE}/inference_logs/recent_{days}d.parquet"
    )

def check_drift(baseline: pd.DataFrame, production: pd.DataFrame) -> dict:
    results = {}
    numeric_cols = baseline.select_dtypes(include=[np.number]).columns

    for col in numeric_cols:
        stat, p_value = ks_2samp(
            baseline[col].dropna(),
            production[col].dropna()
        )
        results[col] = {
            "ks_statistic": round(stat, 4),
            "p_value":      round(p_value, 4),
            "drifted":      p_value < DRIFT_THRESHOLD,
        }
    return results

def send_alert(drifted_features: list):
    config = oci.config.from_file()
    ons    = oci.ons.NotificationDataPlaneClient(config)

    topic_id = os.environ["OCI_NOTIFICATION_TOPIC_ID"]
    message  = json.dumps({
        "alert":    "DATA DRIFT DETECTED",
        "features": drifted_features,
        "action":   "Review and retrain churn-gbt model",
    }, indent=2)

    ons.publish_message(
        topic_id=topic_id,
        message_details=oci.ons.models.MessageDetails(
            title="MLOps Alert: Model Drift",
            body=message,
        )
    )
    print(f"🚨 Alert sent for features: {drifted_features}")

if __name__ == "__main__":
    baseline   = load_baseline()
    production = load_production_window(days=7)
    drift_report = check_drift(baseline, production)

    drifted = [f for f, v in drift_report.items() if v["drifted"]]

    if drifted:
        print(f"⚠️  Drift detected in: {drifted}")
        send_alert(drifted)
    else:
        print("✅ No significant drift detected.")

4.2 Automated Retraining Trigger

# monitoring/retrain_trigger.py
"""
If drift is detected or performance degrades below threshold,
automatically kick off a new OCI DataScience Job (retraining run).
"""
import os
import oci

def trigger_retraining_job(reason: str):
    config  = oci.config.from_file()
    ds      = oci.data_science.DataScienceClient(config)

    job_id  = os.environ["OCI_DATASCIENCE_JOB_ID"]

    run = ds.create_job_run(
        create_job_run_details=oci.data_science.models.CreateJobRunDetails(
            project_id=os.environ["OCI_PROJECT_ID"],
            compartment_id=os.environ["OCI_COMPARTMENT_ID"],
            job_id=job_id,
            display_name=f"auto-retrain-trigger",
            job_environment_configuration_override=None,
            job_configuration_override_details=oci.data_science.models.DefaultJobConfigurationDetails(
                environment_variables={
                    "RETRAIN_REASON": reason,
                    "N_ESTIMATORS":   "300",
                    "MIN_AUC_THRESHOLD": "0.82",
                }
            ),
        )
    )
    print(f"✅ Retraining job launched: {run.data.id}")
    return run.data.id

4.3 OCI IAM Policy — Least Privilege for MLOps

# iam_policies.tf (Terraform)

resource "oci_identity_policy" "mlops_datascience" {
  name           = "mlops-datascience-policy"
  description    = "Allows DataScience resources to read/write Object Storage and publish notifications"
  compartment_id = var.compartment_id

  statements = [
    "Allow dynamic-group mlops-dg to read object-family in compartment mlops-compartment",
    "Allow dynamic-group mlops-dg to manage objects in bucket 'mlops-datasets'",
    "Allow dynamic-group mlops-dg to manage objects in bucket 'mlops-artifacts'",
    "Allow dynamic-group mlops-dg to use ons-topics in compartment mlops-compartment",
    "Allow dynamic-group mlops-dg to manage data-science-jobs in compartment mlops-compartment",
    "Allow dynamic-group mlops-dg to manage data-science-job-runs in compartment mlops-compartment",
    "Allow dynamic-group mlops-dg to read secret-bundles in vault id ${var.vault_id}",
  ]
}

Phase 5 — End-to-End Integration Test

# tests/e2e_test.py
import requests
import json

ENDPOINT = "http://<OKE_LOAD_BALANCER_IP>/predict"

payload = {
    "threshold": 0.5,
    "customers": [
        {
            "tenure":             3,
            "monthly_charges":    85.0,
            "avg_monthly_spend":  21.25,
            "is_long_term":       0,
            "high_spender":       1,
            "contract_one_year":  0,
            "contract_two_year":  0,
        },
        {
            "tenure":             48,
            "monthly_charges":    45.0,
            "avg_monthly_spend":  44.1,
            "is_long_term":       1,
            "high_spender":       0,
            "contract_one_year":  0,
            "contract_two_year":  1,
        },
    ]
}

response = requests.post(ENDPOINT, json=payload)
response.raise_for_status()

results = response.json()
for r in results:
    print(
        f"Customer {r['customer_index']} | "
        f"P(churn)={r['churn_probability']:.2%} | "
        f"Tier={r['risk_tier']} | "
        f"Prediction={'CHURN' if r['churn_prediction'] else 'RETAIN'}"
    )

Expected output:

Customer 0 | P(churn)=78.32% | Tier=HIGH   | Prediction=CHURN
Customer 1 | P(churn)=12.14% | Tier=LOW    | Prediction=RETAIN

Key Lessons — The DBA Perspective

DBA Principle MLOps Translation
Always version your schema Always version your features and training data
Never trust incoming data Run Great Expectations before every training run
Keep backups Store every model artifact to Object Storage
Monitor query performance Monitor model drift and prediction latency
Use least-privilege IAM Scope resource principal policies tightly
Have a rollback plan Keep previous MLFlow model versions in Staging
Automate your maintenance jobs Automate retraining triggers on drift events

Deployment Checklist

  • [ ] OCI Vault secrets configured (MLFlow URI, OCIR auth token)
  • [ ] Dynamic group and IAM policies applied
  • [ ] GitHub repo connected to OCI DevOps project
  • [ ] Build pipeline validated with build_spec.yaml
  • [ ] Training job baseline run completed and logged to MLFlow
  • [ ] Model promoted to Production stage in MLFlow Registry
  • [ ] OKE namespace and secrets created (kubectl create secret)
  • [ ] Kubernetes manifests deployed and readiness probes passing
  • [ ] Drift monitoring job scheduled (OCI Scheduler or cron)
  • [ ] OCI Notification topic wired to ops team email/PagerDuty
  • [ ] End-to-end test passing against live inference endpoint

References

DBA PrincipleMLOps Translation
Schema IntegrityData Quality Gates (Great Expectations)
ACID ComplianceVersioned Models & Data (MLFlow/Object Storage)
Performance TuningHyperparameter Optimization
Point-in-Time RecoveryModel Rollback via MLFlow Stages
Audit LogsOCI Logging & MLFlow Run Tracking

Automated Database Security: DBSAT 3.1 meets Oracle EM


What DBSAT is: the Database Security Assessment Tool is a free Oracle utility that interrogates a running database's configuration, users, privileges, and audit settings — producing a scored risk report without requiring DBA-level access or taking the system offline.

The OEM integration adds four key capabilities:

The plug-in agent deploys DBSAT as a managed OEM target, so assessments can be scheduled (daily, weekly, on-demand) from the central console rather than run manually on each host.

The analyzer engine classifies every finding into a severity tier — critical, high, medium, or low — and groups them by domain: basic security controls, privilege analysis, fine-grained auditing, encryption, and database vault.

The compliance framework mapping automatically aligns findings to regulatory and hardening standards: CIS Oracle Benchmarks, DISA STIGs, GDPR, PCI-DSS, and HIPAA. This eliminates the manual cross-referencing that normally consumes audit preparation time.

The output layer surfaces results in three directions: detailed HTML/JSON reports for security teams, the OEM compliance dashboard for executive visibility and trend lines over time, and automated remediation jobs that can trigger corrective SQL scripts or escalation workflows directly from OEM.

Practical benefit: instead of point-in-time assessments run before audits, organizations get a continuous risk baseline across their entire Oracle fleet — with drift detection that alerts when a previously-passing control regresses. Click any component above to ask more about it.




Architecture: how the OEM plug-in works

Before touching a keyboard, internalize this data flow. It determines where credentials need to live and what firewall ports must be open.

OEM 13.5 ServerEM Repository (OMS)Compliance EngineEM Agent + DBSATHost A · DB19ccollector.jarEM Agent + DBSATHost B · DB21ccollector.jarEM Agent + DBSATHost C · DB23aicollector.jarOracle DB 19c:1521Oracle DB 21c:1521Oracle DB 23ai:1521dispatch taskupload resultsJDBC :1521JDBC :1521JDBC :1521ComplianceFrameworkCIS · STIG · PCI · GDPR

Key data-flow facts every DBA should memorize:

  • The OEM agent on each host runs dbsat collect locally — no separate DBSAT binary to maintain.
  • The collector output ZIP is uploaded back to the OMS over the standard HTTPS agent channel (port 4900). No new firewall rules needed in most environments.
  • The analyzer step runs inside the OMS, not on the DB host, so CPU impact on production is negligible.
  • Credentials for dbsat collect are stored in the OEM Named Credential store — not in clear-text config files.

Prerequisites and installation

Required privileges — the minimum viable user

The biggest DBA mistake I see is running DBSAT with a SYS or DBA account. DBSAT 3.1 ships with a dedicated privilege script. Use it.

SQL — run as SYSDBA on each target DBdbsat_user_setup.sql
-- Create a dedicated, locked-down collector account
CREATE USER c##dbsat_coll IDENTIFIED BY "<VaultSecret_01>"
  DEFAULT TABLESPACE users
  TEMPORARY TABLESPACE temp
  ACCOUNT LOCK;   -- unlock only during assessment windows

-- Minimum grants for collector (CDB-wide)
GRANT CREATE SESSION               TO c##dbsat_coll CONTAINER=ALL;
GRANT SELECT_CATALOG_ROLE           TO c##dbsat_coll CONTAINER=ALL;
GRANT SELECT ON sys.registry$        TO c##dbsat_coll CONTAINER=ALL;
GRANT SELECT ON sys.dba_users_with_defpwd
                                     TO c##dbsat_coll CONTAINER=ALL;

-- DBSAT 3.1 new: SQL Firewall and Data Redaction views
GRANT SELECT ON sys.dbms_sql_firewall$   TO c##dbsat_coll CONTAINER=ALL;
GRANT SELECT ON dvsys.dba_dv_status        TO c##dbsat_coll CONTAINER=ALL;

-- Optional: privilege analysis (adds ~15% more findings)
EXEC dbms_privilege_capture.create_capture(
  name        => 'DBSAT_PA_CAP',
  type        => dbms_privilege_capture.g_database,
  roles       => role_name_list()
);
Warning
Never grant DBA to the collector account. DBSAT only needs read access to data dictionary views. The privilege script above covers all 3.1 checks. If you see "ORA-01031 insufficient privileges" in the log, see the extended grants section in MOS Note 2710440.1.

Deploying the OEM plug-in

BASH — OEM admin on OMS hostplug-in deployment
# 1. Download plug-in archive from MOS (Patch 36500021)
cd /u01/app/oracle/product/13.5.0/em_home/bin

# 2. Import plug-in into OMS software library
./emcli import_update \
  -omslocal \
  -file="/tmp/oracle.dbsat_3.1.0.0.0_2000_0.opar"

# 3. Deploy to OMS
./emcli deploy_plugin_on_server \
  -plugin="oracle.dbsat:3.1.0.0.0" \
  -sys_password="<oms_sys_pw>"

# 4. Deploy to all managed agents (parallel, 10 at a time)
./emcli deploy_plugin_on_agent \
  -plugin="oracle.dbsat:3.1.0.0.0" \
  -agent_names="*:3872" \
  -async_operation

# 5. Verify deployment status
./emcli get_plugin_deployment_status \
  -plugin_id="oracle.dbsat" | grep -E "AGENT|STATUS"
BASH — verify plug-in on agentagent verification
# On the target agent host
$AGENT_HOME/bin/emctl listplugins agent \
  -type all | grep dbsat

# Expected output:
# oracle.dbsat    3.1.0.0.0    DEPLOYED

# Confirm dbsat binary is accessible
ls -lh $AGENT_HOME/plugins/oracle.dbsat_3.1.0.0.0/bin/dbsat

Running your first assessment

Option A — from the OEM console (recommended)

Navigate to the target database in OEM → Security → DBSAT Assessment

EM 13.5 adds a "DBSAT Assessment" menu item under the Database Security menu for any target that has the plug-in deployed. Click Run Assessment.

Select a Named Credential for the collector

Choose the credential that resolves to c##dbsat_coll. If it's not listed, create it first under Setup → Security → Named Credentials.

Choose assessment scope

Select Full Assessment for the first run. For CDB targets, enable Include all open PDBs. The Privilege Analysis option adds ~4 minutes per database.

Submit and monitor

The job appears in the EM Jobs page with real-time step progress. A typical 19c database completes in 6–9 minutes.

Option B — command-line (scripted / headless)

BASH — on DB host via EM agentmanual collect + report
DBSAT=$AGENT_HOME/plugins/oracle.dbsat_3.1.0.0.0/bin/dbsat
OUTPUT_DIR=/u01/dbsat_output/$(date +%Y%m%d)
mkdir -p $OUTPUT_DIR

# ── Step 1: Collect ──────────────────────────────────────────
$DBSAT collect \
  c##dbsat_coll/"<password>"@"(DESCRIPTION=(ADDRESS=(PROTOCOL=TCPS)(HOST=dbhost01.example.com)(PORT=2484))(CONNECT_DATA=(SERVICE_NAME=PROD19C)))" \
  -n                       # suppress interactive password prompt
  -o $OUTPUT_DIR/prod19c   # output prefix

# Output: prod19c.zip (encrypted, requires reporter password)

# ── Step 2: Report ───────────────────────────────────────────
$DBSAT report \
  -n                       # no password on report zip
  -a                       # all checks
  -f html,json             # DBSAT 3.1: dual format output
  -o $OUTPUT_DIR/prod19c_report \
  $OUTPUT_DIR/prod19c.zip

# Output files:
#   prod19c_report.html  — human-readable
#   prod19c_report.json  — machine-readable (SIEM/ticketing)

CDB / multi-PDB sweep

BASH — CDB with all PDBsmulti-tenant assessment
# Connect to CDB$ROOT with the common user
$DBSAT collect \
  c##dbsat_coll@"(DESCRIPTION=...CDB_SERVICE...)" \
  -n \
  --pdb-scope ALL \        # DBSAT 3.1 flag: enumerate all open PDBs
  -o $OUTPUT_DIR/cdb01

# Report will include a top-level CDB summary + per-PDB sections
$DBSAT report \
  -n -a \
  -f html,json \
  --pdb-rollup \           # DBSAT 3.1: consolidated risk matrix
  -o $OUTPUT_DIR/cdb01_report \
  $OUTPUT_DIR/cdb01.zip
pro tip
Use --pdb-scope OPEN instead of ALL to skip PDBs in MOUNT or RESTRICTED state during scheduled jobs — avoids false-negative errors in the log.

Decoding the findings report

DBSAT categorizes every finding into one of five domains. Here's what a real output header looks like for a typical 19c production database, followed by what the score means:

═══════════════════════════════════════════════════════════════════ Oracle Database Security Assessment Report Database: PROD19C · Version: 19.21.0.0 · OS: Linux x86-64 Assessed: 2025-04-22 14:03:11 UTC | DBSAT 3.1.0.0.0 ═══════════════════════════════════════════════════════════════════ SUMMARY ─────────────────────────────────────────────────── Domain Critical High Medium Low Advisory ─────────────────────────────────────────────────── Basic Security Configuration User Accounts Privilege and Role Analysis Auditing and Logging Encryption and Data Protection 1 ─────────────────────────────────────────────────── TOTAL 16 22 14 20 Overall Risk Score: MEDIUM-HIGH (68/100)

The risk score algorithm in DBSAT 3.1 weights findings as: Critical × 20, High × 6, Medium × 2, Low × 1 — normalized to 100. A score above 70 triggers an automatic OEM compliance alert.

Critical
4
Immediate action required
High
16
Address within sprint
Medium
22
Schedule remediation
Low
14
Harden in next cycle

Top findings to always check first

DBV.0001 — Database Vault not enabled
Basic Security Configuration · Critical
AUDIT.0003 — Unified Auditing not enforced (Mixed mode)
Auditing and Logging · Critical
PRIV.0012 — 23 users with excessive system privileges never logged in for > 90 days
Privilege and Role Analysis · High
ENC.0007 — Transparent Data Encryption not active on USERS and SYSAUX tablespaces
Encryption and Data Protection · High
USER.0004 — 4 accounts using default Oracle passwords (SCOTT, OUTLN, DBSNMP…)
User Accounts · High
CONF.0019 — SEC_CASE_SENSITIVE_LOGON = FALSE (allows case-insensitive passwords)
Basic Security Configuration · Medium

Compliance framework mapping

DBSAT 3.1 ships with five built-in compliance mappings. OEM's compliance framework picks these up automatically and folds them into enterprise-level compliance dashboards.

DBSAT Check IDDescriptionFrameworksSeverity
AUDIT.0001Audit trail secureCIS 4.1 STIG V-236 PCI 10.2Critical
ENC.0001TDE on sensitive tablespacesPCI 3.4 GDPR Art.32 HIPAA §164High
USER.0002Default passwords changedCIS 5.1 STIG V-219 PCI 2.1 GDPR Art.25High
PRIV.0001PUBLIC has no dangerous grantsCIS 6.2 STIG V-225Medium
NET.0003sqlnet.ora encryption enforcedPCI 4.1 GDPR Art.32 STIG V-241High
CONF.0011Remote_OS_Authent = FALSECIS 2.2 STIG V-220Critical
OEM tip
In OEM, navigate to Compliance → Library → Oracle Database Security Baseline to see all 147 DBSAT checks mapped to standards. You can customize the baseline — suppress checks that don't apply to your environment and it persists across future assessments.

Scheduling and automation in OEM

Set up a weekly fleet-wide assessment job

BASH — emcli job schedulingautomated weekly sweep
# Create a multi-target DBSAT job via emcli
./emcli create_job \
  -name="WEEKLY_DBSAT_FLEET" \
  -type="DBSATAssessment" \
  -target_type="oracle_database" \
  -target_list="PROD19C:oracle_database,PROD21C:oracle_database,DEVPDB:oracle_database" \
  -credential_set_name="DBCredsDBSAT" \
  -start_time="2025-04-27 02:00:00" \
  -repeat_units="Weeks" \
  -repeat_value="1" \
  -input_file="dbsat_job_params.properties"

# dbsat_job_params.properties:
#   assessment_scope=FULL
#   include_privilege_analysis=true
#   pdb_scope=OPEN
#   report_formats=html,json
#   upload_to_compliance=true
#   alert_on_score_delta=10   # alert if score worsens by ≥10 pts

Drift detection alert rule

SQL — OEM repository (run as SYSMAN)custom drift alert
-- Query the DBSAT results table to find databases where score
-- worsened by more than 5 points since last assessment
SELECT
    t.target_name,
    r.assessment_date,
    r.overall_score,
    r.overall_score - LAG(r.overall_score)
        OVER (PARTITION BY t.target_name ORDER BY r.assessment_date) AS score_delta,
    r.critical_count,
    r.high_count
FROM
    mgmt$dbsat_assessments r
    JOIN mgmt$target t ON t.target_guid = r.target_guid
WHERE
    r.assessment_date >= SYSDATE - 14
    AND (r.overall_score - LAG(r.overall_score)
        OVER (PARTITION BY t.target_name ORDER BY r.assessment_date)) > 5
ORDER BY score_delta DESC;

Remediation playbook

Here are the exact fixes for the most common Critical and High findings DBSAT surfaces.

Switch to Pure Unified Auditing

SQL + BASH — Oracle 19c/21c/23aiAUDIT.0003 fix
-- 1. Check current audit mode
SELECT value FROM v$option WHERE parameter = 'Unified Auditing';
-- If FALSE → mixed mode. If TRUE → pure unified (already good).

-- 2. Stop the instance and relink (mixed → pure unified)
-- On the OS, as oracle user:
BASHrelink for pure unified auditing
cd $ORACLE_HOME/rdbms/lib
make -f ins_rdbms.mk uniaud_on ioracle ORACLE_HOME=$ORACLE_HOME

# Restart the instance
sqlplus / as sysdba <<EOF
SHUTDOWN IMMEDIATE;
STARTUP;
SELECT value FROM v\$option WHERE parameter = 'Unified Auditing';
EOF
# Verify output: TRUE

Enable TDE on user tablespaces

SQL — as SYSDBAENC.0007 fix
-- 1. Create or open the wallet
ADMINISTER KEY MANAGEMENT
  SET KEYSTORE OPEN
  IDENTIFIED BY "<wallet_pw>"
  CONTAINER = ALL;

-- 2. Set master encryption key (first time only)
ADMINISTER KEY MANAGEMENT
  SET KEY
  USING TAG 'PROD19C_MEK_2025'
  IDENTIFIED BY "<wallet_pw>"
  WITH BACKUP
  CONTAINER = ALL;

-- 3. Encrypt existing tablespaces (online, no downtime on 12.2+)
ALTER TABLESPACE users     ENCRYPTION ONLINE USING 'AES256' ENCRYPT;
ALTER TABLESPACE app_data  ENCRYPTION ONLINE USING 'AES256' ENCRYPT;
ALTER TABLESPACE sysaux    ENCRYPTION ONLINE USING 'AES256' ENCRYPT;

-- 4. Verify
SELECT tablespace_name, encrypted
FROM   dba_tablespaces
WHERE  encrypted = 'YES';

Revoke dangerous PUBLIC grants

SQL — PRIV.0001 fixtighten PUBLIC
-- Find and revoke dangerous grants on PUBLIC
BEGIN
  FOR r IN (
    SELECT 'REVOKE ' || privilege || ' ON '
           || owner || '.' || table_name
           || ' FROM PUBLIC' AS stmt
    FROM   dba_tab_privs
    WHERE  grantee  = 'PUBLIC'
    AND    owner    = 'SYS'
    AND    privilege IN ('EXECUTE','SELECT')
    AND    table_name IN (
             'UTL_FILE', 'UTL_TCP', 'UTL_HTTP', 'UTL_SMTP',
             'DBMS_ADVISOR', 'DBMS_BACKUP_RESTORE',
             'DBMS_JAVA', 'DBMS_SYS_ERROR')
  ) LOOP
    EXECUTE IMMEDIATE r.stmt;
    DBMS_OUTPUT.PUT_LINE('Revoked: ' || r.stmt);
  END LOOP;
END;
/
 caution
Always test PUBLIC revokes in a lower environment first. Applications that relied on these implicit grants will throw ORA-01031 errors. Build a regression test for each UTL_* package before touching production.

Performance and operational impact

The most common DBA pushback: "Will this affect my production database?" Short answer — minimal if you follow the guidelines.

── DBSAT 3.1 Performance Profile (19c, 500GB, 200 active sessions) ── Phase Duration CPU% Active Sessions Impact ────────────────────────────────────────────────────────────── Collect (no PA) 4m 12s 0.3% None (reads dict only) Collect (with PA) 9m 55s 1.8% Minor (PA session sampling) Report 1m 06s 0.1% None (runs off-DB) ────────────────────────────────────────────────────────────── Recommended window: 02:00–04:00 local | Off-peak batch window

The collector only reads data dictionary views — no DML, no DDL. The most I/O-intensive check is Privilege Analysis, which joins DBA_SYS_PRIVS and SESSION_PRIVS across all grantees. On a busy database, schedule this outside peak hours.


MLOps on OCI: Applying a DBA Mindset to Machine Learning

MLOps is the discipline of treating machine learning systems the way seasoned DBAs treat database pipelines: with version control, monitorin...