Zum Inhalt springen

Dokumentation

Monitoring Runbook

Uptime-Checks, Error-Rates, Key-Metriken und Eskalations-Schritte. Living Document — aktualisiert bei jedem Incident.

System Status: Operational

Automatischer Check via Vercel. Manueller Check: alle Endpoints unten testen.

Health Checks

EndpointIntervalExpectedCritical
/5 min200Yes
/api/score5 min200Yes
/api/contact10 min200Yes
/api/analytics/event15 min200No
/api/dashboard/report1 h200No
/sitemap.xml1 h200No
/robots.txt1 h200No

Error Thresholds

5xx Rate

⚠️ >1%🔴 >5%

Action: Check Vercel logs → restart if persistent

429 Rate (Rate Limit)

⚠️ >2%🔴 >10%

Action: Review rate-limit config → increase limits

Response Time (p95)

⚠️ >2s🔴 >5s

Action: Check bundle size → review heavy API calls

LCP

⚠️ >2.5s🔴 >4s

Action: Optimize images → check CWV dashboard

CLS

⚠️ >0.1🔴 >0.25

Action: Review layout shifts → check dynamic content

Free-Check Completion

⚠️ <60%🔴 <40%

Action: Review wizard UX → check form errors

Contact Form Submit

⚠️ <30%🔴 <15%

Action: Check form validation → Turnstile issues

Escalation Steps

L1

Automated Detection

0–5 min
  1. Vercel monitoring triggers alert
  2. Error logged to console
  3. Automatic retry for transient failures
L2

Developer Review

5–15 min
  1. Check Vercel deployment logs
  2. Review error stack trace
  3. Verify ENV variables
  4. Test API endpoint manually
L3

Hotfix

15–30 min
  1. Fix code in main branch
  2. Push → Vercel auto-deploy
  3. Verify fix on production
  4. Monitor error rates for 15 min
L4

Rollback

30–45 min
  1. Vercel: Promote previous deployment
  2. Verify site stability
  3. Root-cause analysis
  4. Schedule proper fix
L5

User Communication

45+ min
  1. Status page update if applicable
  2. Respond to user reports
  3. Post-mortem within 24h
  4. Update runbook with learnings

Key Metrics

Free-Checks / Tag

~10→ target: 50+

Source: Analytics

Contact Form Submissions

~2→ target: 10+

Source: API logs

Average Profile Score

42→ target: 60+

Source: Free-Check data

Dashboard Active Users

~3→ target: 20+

Source: Vercel Analytics

Blog Organic Traffic

~50/mo→ target: 500+/mo

Source: Vercel Analytics

Bounce Rate

~65%→ target: <50%

Source: Vercel Analytics

Uptime (30d)

99.8%→ target: 99.9%+

Source: Vercel

Quick Commands

# Redeploy on Vercel

vercel --prod

# Check build locally

npx next build

# Verify sitemap

curl -s https://spotlight-service.de/sitemap.xml | head -20

# Test free-check API

curl -X POST https://spotlight-service.de/api/score -H "Content-Type: application/json" -d ''

Post-Mortem Template

## Post-Mortem: [Incident Title] **Date**: YYYY-MM-DD **Duration**: X min **Impact**: [Users affected / Revenue impact] ### Timeline - HH:MM — Alert triggered - HH:MM — Developer notified - HH:MM — Root cause identified - HH:MM — Fix deployed - HH:MM — Verified on production ### Root Cause [Description] ### Resolution [What was done] ### Action Items - [ ] [Prevention measure] - [ ] [Monitoring improvement] - [ ] [Runbook update]