DataCraves logo DataCraves
← All use cases
Operations

Pipeline health monitoring with auto-rerun and root-cause hints

Your data pipelines fail at 3am. Your on-call engineer wakes up, spends 40 minutes finding which DAG broke, another 30 figuring out why, and a final 20 reassuring downstream stakeholders. Repeat 3x a week.

⚠️ Charts, numbers, and "client" examples on this page use illustrative mock data. Real client deployments are confidential.

The Problem

Data pipelines fail at 3 AM. The on-call data engineer wakes up, scrolls Slack to figure out which DAG broke, opens Airflow, scrolls logs for 40 minutes, finds it was an OOM in a single Spark task, restarts the job, waits 25 minutes for it to finish, then writes a Slack post for the morning team explaining which dashboards are stale. By 5 AM they're back in bed. Repeat three times a week.

Most pipeline-monitoring tools just send "job X failed" Slack alerts. That's the easy 5% of the problem. The hard 95% is everything that happens after:

⚠ Examples below use synthetic pipeline data.

The DataCraves Approach

The Pipeline Agent watches every job — Airflow, dbt, Dagster, Prefect, custom — and on every failure: classifies, retries the transient ones, opens an incident only for the real ones, and tells the team exactly which downstream artefacts are now stale. Quality checks run on every successful job too: schema parity, freshness vs SLA, row-count variance, null-rate variance. Failures here block downstream promotion.

The classifier

We classify failures using a small model trained on your last 6 months of run history. Features include: exit code, log-line clustering, time-of-day, recent commit activity, upstream lag, prior failure rate of this DAG. Output is one of {transient, real, suspicious}.

The Architecture

Operations architecture: DAGs → watcher → classifier → retry/incident
Diagram 1 — Architecture. Watcher consumes job events, classifier routes between auto-retry and human-paged incident.
Incident sequence: failure → history check → classify → retry → success → Slack
Diagram 2 — Incident handling sequence. Engineer is never paged for transient failures.
Quality checks: schema, freshness, rowcount, null rate
Diagram 3 — Quality gates on every successful job. Failures block downstream promotion.

The Dashboard

DAGs monitored
214
+ 18 LTM
SLA hit rate
99.2%
▲ 1.4 pp
Pages avoided (30d)
147
▼ 78%
Median MTTR
4m 12s
▼ 32m
Failures classified (30d)
85% of failures are transient — retry handles them.
SLA hit rate over time
Steady climb as auto-retry catches more transient cases.
Top failing DAGs (30d)
events_etl is the noisiest — flagged for refactor.
Page volume — before vs after
On-call sleeps now.
Latency vs cost (per DAG)
Top-right cluster = optimisation candidates.
Data freshness gauge
96% of gold tables fresh within SLA right now.

The Math / The Logic

Failure classifier features.

# classifier.py
def classify_failure(run, history, recent_commits):
    feats = {
        "exit_code": run.exit_code,
        "oom_pattern": "MemoryError" in run.stderr or run.exit_code == 137,
        "timeout_pattern": run.exit_code == 124,
        "log_cluster": nearest_cluster(run.stderr, history),
        "committed_today": any(c.dag == run.dag for c in recent_commits),
        "upstream_fresh": all(t.fresh for t in run.upstream_tables),
        "historical_self_heal_rate": history.self_heal_rate(run.dag),
    }
    p_transient = model.predict_proba(feats)["transient"]
    if p_transient > 0.75: return "transient"
    if p_transient < 0.25: return "real"
    return "suspicious"   # queue for human, don't page

Quality checks. Each gold table has a contract:

# contract.yml
table: events.checkout_completed
freshness_sla_min: 60
row_count_z_threshold: 3     # vs trailing 30d daily mean
null_rate_z_threshold: 2
required_columns:
  - { name: user_id,    type: string, nullable: False }
  - { name: amount_cents, type: int,    nullable: False }
  - { name: ts,         type: timestamp, nullable: False }
on_fail: block_downstream

Sample Output / Insight

📨 Slack #data-ops — Pipeline Agent · 03:47 IST

events.checkout_completed missed its 03:00 SLA by 47 min.

Root-cause hint: upstream stripe_webhook_raw was 38 minutes late landing (Stripe API lag, not on our side).

Auto-retry succeeded at 03:42. 3 dashboards were stale (sales-intel, customer-health, finance-recon) — all refreshed automatically.

No page issued. Logging this for the morning standup.

Classifier: transient (p = 0.91). Confidence based on 4 prior similar Stripe-lag events in the last 90 days, all self-healed.

ROI Math

Assumption to challenge: the 78% page reduction assumes your historical failure mix is >70% transient (most teams are). If your DAGs are mostly real failures (e.g., flaky upstream APIs nobody owns), expect 40-50% reduction instead.

Common Pitfalls

⚠ Auto-retrying real failures.
Aggressive retries on a real failure burn money and delay the fix. Tune classifier threshold conservatively at first.
⚠ Quality contracts as comments.
If your contract is a comment in a notebook, it's not a contract. Enforce in CI.
⚠ Notifying everyone on every failure.
Channel fatigue is real. Page only the DAG owner; broadcast only the downstream impact.
⚠ No DAG ownership table.
Without owners, every page goes to whoever's on-call, who then plays Slack telephone. Owner-per-DAG is the bedrock.
⚠ Skipping the morning digest.
Even if no human was paged overnight, send a 2-line "what happened" summary at 9 AM. Trust comes from visibility.
📊 Mock data, real patterns.
📈
Expected ROI
~70% reduction in manual triage time on data incidents

Based on pilot with a 200-DAG Airflow deployment; assumes ~40% of failures are transient and auto-recoverable.

Run this on your own data?

A 30-minute demo shows the agents working against your warehouse — not a pre-baked sandbox.

Book a demo →

Mock data, real patterns. Every visualization is synthetic to preserve client confidentiality.