feat(blogs): add "Evaluating and Monitoring LLM Apps" post by vfanucci · Pull Request #4921 · kestra-io/docs

vfanucci · 2026-06-03T16:10:33Z

Summary

New blog post: the LLM eval-and-monitoring loop in Kestra — scheduled offline eval against a golden dataset (LLM-as-judge), CI/CD deploy gating, trigger-based online eval on sampled production traffic, Slack drift alerting + Pause for human review, and the governance/cost angle.
Frames Kestra as the orchestration layer around existing tracking tools (MLflow, Langfuse, LangSmith), not a replacement.
Closes the three-part LLM series (RAG → agents → eval).

Notes for reviewers

Frontmatter adapted to the blogs schema. Author set to Will Russell (same block as 2024-11-25-kestra-vs-jenkins).
Hero image is a placeholder (reuse of rag-with-gemini-and-langchain4j/main.jpg) so the build/preview passes — design can swap in the final visual later.
Body contains two  placeholders (eval flow topology, no-code dashboard) and one  placeholder to fill in before publishing.
Outbound links reference the first two posts in the series (/blogs/orchestrate-rag-pipeline-kestra feat(blogs): add "How to Orchestrate a RAG Pipeline with Kestra" #4918, /blogs/orchestrate-ai-agents-kestra feat(blogs): add "Building Production-Ready AI Agents" post #4920) — land those first.

Test plan

npm run build succeeds locally
Cloudflare preview renders the post (author avatar, code blocks, mermaid loop diagram, the comparison table)
Replace the three placeholders before merging
Land after feat(blogs): add "How to Orchestrate a RAG Pipeline with Kestra" #4918 and feat(blogs): add "Building Production-Ready AI Agents" post #4920 (series order)

🤖 Generated with Claude Code

Closes the LLM series with the eval-and-monitoring loop: scheduled offline eval against a golden dataset with LLM-as-judge, CI/CD deploy gating, trigger-based online eval on sampled production traffic, drift alerting via Slack + Pause, and the governance/cost angle. Frames Kestra as the orchestration layer around existing eval/observability tools (MLflow, Langfuse, LangSmith), not a replacement. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

vfanucci · 2026-06-03T16:13:14Z

Warning : Review content, screenshots and Kestra flows before publishing

github-actions · 2026-06-03T16:19:40Z

☁️ Cloudflare Worker Preview Deployed!

🔗 https://ks-blog-evaluate-monitor-llm-apps-docs.kestra-io.workers.dev
🔗 https://22eb66dd-docs.kestra-io.workers.dev

## 🔦 Lighthouse Benchmark

Tested: https://ks-blog-evaluate-monitor-llm-apps-docs.kestra-io.workers.dev on 2026-06-05 13:15 UTC
No baseline available — scores will appear after the first merge to main

Scores (0–100, higher is better)

Page	Performance	Accessibility	Best Practices	SEO
Home	77	82	56	85
Pricing	97	91	56	100
Enterprise	96	82	56	100
Cloud	88	86	56	100
About Us	86	91	56	100
Docs Landing	90	88	56	92
Contribute to Kestra (simple docs)	98	87	56	92
Flow (full featured docs)	93	90	56	92
Blog Index	60	90	56	100
Blog Post (sample)	90	87	56	100
VS Page (sample)	97	88	56	100
Plugins Landing	92	80	56	92
Plugin Page (sample)	95	87	56	100
Plugin Debug Page (sample)	96	87	56	100
Plugin Debug Return Page (sample)	93	87	59	100
Blueprints Landing	95	80	56	92
Blueprint Audit Logs CSV Export	66	86	56	100

Core Web Vitals (lower is better)

Page	LCP	FCP	TBT	CLS	Speed Index
Home	1.59 s	0.74 s	326 ms	0.002	1.84 s
Pricing	1.17 s	0.59 s	16 ms	0.000	0.76 s
Enterprise	1.41 s	0.56 s	16 ms	0.000	0.83 s
Cloud	2.23 s	0.60 s	15 ms	0.000	0.98 s
About Us	2.59 s	0.64 s	41 ms	0.000	0.85 s
Docs Landing	1.58 s	0.50 s	156 ms	0.000	0.98 s
Contribute to Kestra (simple docs)	1.04 s	0.59 s	15 ms	0.000	0.83 s
Flow (full featured docs)	1.39 s	0.66 s	124 ms	0.000	1.22 s
Blog Index	6.22 s	1.41 s	122 ms	0.000	49.78 s
Blog Post (sample)	2.09 s	0.58 s	32 ms	0.000	0.81 s
VS Page (sample)	1.20 s	0.61 s	10 ms	0.000	0.79 s
Plugins Landing	1.17 s	0.59 s	63 ms	0.000	2.47 s
Plugin Page (sample)	1.00 s	0.50 s	103 ms	0.051	1.64 s
Plugin Debug Page (sample)	0.91 s	0.54 s	86 ms	0.001	1.56 s
Plugin Debug Return Page (sample)	1.12 s	0.61 s	112 ms	0.025	1.90 s
Blueprints Landing	1.29 s	0.72 s	37 ms	0.000	1.51 s
Blueprint Audit Logs CSV Export	1.11 s	0.63 s	206 ms	0.485	2.16 s

Legend

🟢 improved · 🔻 regressed · (blank) no significant change
Score threshold: ±10 pts · Metric threshold: ±30% of baseline

wrussell1999

This blog post looks at modules after the Kestra one. We don't cover evaluating or monitoring with Kestra in it so might be a bit of a stretch. Do we want to pivot this to something else? @vfanucci

vfanucci deployed to Preview June 3, 2026 16:10 — with GitHub Actions View deployment

vfanucci assigned elliotgunn Jun 3, 2026

wrussell1999 self-requested a review June 4, 2026 13:00

Merge branch 'main' into blog/evaluate-monitor-llm-apps-kestra

c0ca49c

wrussell1999 deployed to Preview June 4, 2026 13:00 — with GitHub Actions View deployment

initial changes

5664544

wrussell1999 deployed to Preview June 5, 2026 13:02 — with GitHub Actions View deployment

wrussell1999 reviewed Jun 5, 2026

View reviewed changes

wrussell1999 closed this Jun 5, 2026

MilosPaunovic deleted the blog/evaluate-monitor-llm-apps-kestra branch June 5, 2026 17:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(blogs): add "Evaluating and Monitoring LLM Apps" post#4921

feat(blogs): add "Evaluating and Monitoring LLM Apps" post#4921
vfanucci wants to merge 3 commits into
mainfrom
blog/evaluate-monitor-llm-apps-kestra

vfanucci commented Jun 3, 2026

Uh oh!

vfanucci commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

wrussell1999 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

vfanucci commented Jun 3, 2026

Summary

Notes for reviewers

Test plan

Uh oh!

vfanucci commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

☁️ Cloudflare Worker Preview Deployed!

## 🔦 Lighthouse Benchmark

Scores (0–100, higher is better)

Core Web Vitals (lower is better)

Uh oh!

wrussell1999 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented Jun 3, 2026 •

edited

Loading