Skip to content

feat(blogs): add "Evaluating and Monitoring LLM Apps" post#4921

Closed
vfanucci wants to merge 3 commits into
mainfrom
blog/evaluate-monitor-llm-apps-kestra
Closed

feat(blogs): add "Evaluating and Monitoring LLM Apps" post#4921
vfanucci wants to merge 3 commits into
mainfrom
blog/evaluate-monitor-llm-apps-kestra

Conversation

@vfanucci

@vfanucci vfanucci commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Summary

  • New blog post: the LLM eval-and-monitoring loop in Kestra — scheduled offline eval against a golden dataset (LLM-as-judge), CI/CD deploy gating, trigger-based online eval on sampled production traffic, Slack drift alerting + Pause for human review, and the governance/cost angle.
  • Frames Kestra as the orchestration layer around existing tracking tools (MLflow, Langfuse, LangSmith), not a replacement.
  • Closes the three-part LLM series (RAG → agents → eval).

Notes for reviewers

  • Frontmatter adapted to the blogs schema. Author set to Will Russell (same block as 2024-11-25-kestra-vs-jenkins).
  • Hero image is a placeholder (reuse of rag-with-gemini-and-langchain4j/main.jpg) so the build/preview passes — design can swap in the final visual later.
  • Body contains two <!-- SCREENSHOT: ... --> placeholders (eval flow topology, no-code dashboard) and one <!-- BLUEPRINT_URL: ... --> placeholder to fill in before publishing.
  • Outbound links reference the first two posts in the series (/blogs/orchestrate-rag-pipeline-kestra feat(blogs): add "How to Orchestrate a RAG Pipeline with Kestra" #4918, /blogs/orchestrate-ai-agents-kestra feat(blogs): add "Building Production-Ready AI Agents" post #4920) — land those first.

Test plan

🤖 Generated with Claude Code

Closes the LLM series with the eval-and-monitoring loop: scheduled
offline eval against a golden dataset with LLM-as-judge, CI/CD deploy
gating, trigger-based online eval on sampled production traffic, drift
alerting via Slack + Pause, and the governance/cost angle. Frames
Kestra as the orchestration layer around existing eval/observability
tools (MLflow, Langfuse, LangSmith), not a replacement.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@vfanucci

vfanucci commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

Warning : Review content, screenshots and Kestra flows before publishing

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

☁️ Cloudflare Worker Preview Deployed!

🔗 https://ks-blog-evaluate-monitor-llm-apps-docs.kestra-io.workers.dev
🔗 https://22eb66dd-docs.kestra-io.workers.dev

## 🔦 Lighthouse Benchmark

Tested: https://ks-blog-evaluate-monitor-llm-apps-docs.kestra-io.workers.dev on 2026-06-05 13:15 UTC
No baseline available — scores will appear after the first merge to main

Scores (0–100, higher is better)

Page Performance Accessibility Best Practices SEO
Home 77 82 56 85
Pricing 97 91 56 100
Enterprise 96 82 56 100
Cloud 88 86 56 100
About Us 86 91 56 100
Docs Landing 90 88 56 92
Contribute to Kestra (simple docs) 98 87 56 92
Flow (full featured docs) 93 90 56 92
Blog Index 60 90 56 100
Blog Post (sample) 90 87 56 100
VS Page (sample) 97 88 56 100
Plugins Landing 92 80 56 92
Plugin Page (sample) 95 87 56 100
Plugin Debug Page (sample) 96 87 56 100
Plugin Debug Return Page (sample) 93 87 59 100
Blueprints Landing 95 80 56 92
Blueprint Audit Logs CSV Export 66 86 56 100

Core Web Vitals (lower is better)

Page LCP FCP TBT CLS Speed Index
Home 1.59 s 0.74 s 326 ms 0.002 1.84 s
Pricing 1.17 s 0.59 s 16 ms 0.000 0.76 s
Enterprise 1.41 s 0.56 s 16 ms 0.000 0.83 s
Cloud 2.23 s 0.60 s 15 ms 0.000 0.98 s
About Us 2.59 s 0.64 s 41 ms 0.000 0.85 s
Docs Landing 1.58 s 0.50 s 156 ms 0.000 0.98 s
Contribute to Kestra (simple docs) 1.04 s 0.59 s 15 ms 0.000 0.83 s
Flow (full featured docs) 1.39 s 0.66 s 124 ms 0.000 1.22 s
Blog Index 6.22 s 1.41 s 122 ms 0.000 49.78 s
Blog Post (sample) 2.09 s 0.58 s 32 ms 0.000 0.81 s
VS Page (sample) 1.20 s 0.61 s 10 ms 0.000 0.79 s
Plugins Landing 1.17 s 0.59 s 63 ms 0.000 2.47 s
Plugin Page (sample) 1.00 s 0.50 s 103 ms 0.051 1.64 s
Plugin Debug Page (sample) 0.91 s 0.54 s 86 ms 0.001 1.56 s
Plugin Debug Return Page (sample) 1.12 s 0.61 s 112 ms 0.025 1.90 s
Blueprints Landing 1.29 s 0.72 s 37 ms 0.000 1.51 s
Blueprint Audit Logs CSV Export 1.11 s 0.63 s 206 ms 0.485 2.16 s
Legend

🟢 improved  ·  🔻 regressed  ·  (blank) no significant change
Score threshold: ±10 pts  ·  Metric threshold: ±30% of baseline

@wrussell1999 wrussell1999 self-requested a review June 4, 2026 13:00

@wrussell1999 wrussell1999 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This blog post looks at modules after the Kestra one. We don't cover evaluating or monitoring with Kestra in it so might be a bit of a stretch. Do we want to pivot this to something else? @vfanucci

@MilosPaunovic MilosPaunovic deleted the blog/evaluate-monitor-llm-apps-kestra branch June 5, 2026 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants