#eval

When prompts become shells: the tool registry is the attack surface

🖼️

0

When prompts become shells: the tool registry is the attack surface

DEV Community·Michael "Mike" K. Saleme·22 days ago

#3akndWh5

#security #cve #aiagents #tool #pattern #eval

On May 7, 2026, Microsoft published "When Prompts Become Shells: RCE vulnerabilities in AI agent...

15s

Eval Set Sizing: The Statistical Power Math Behind LLM A/B Tests

🖼️

0

Eval Set Sizing: The Statistical Power Math Behind LLM A/B Tests

DEV Community·Gabriel Anhaia·25 days ago

#44b5xMcI

#ai #observability #machinelearning #statistics #float #eval

How big should an LLM eval set be? The power-math for binomial accuracy, sequential testing, and stratification, and why 100 examples is a coin flip.

15s

Why I spun my benchmark into its own repo (and why every dev tool with a benchmark should)

🖼️

0

Why I spun my benchmark into its own repo (and why every dev tool with a benchmark should)

DEV Community·Nikita Groshin·27 days ago

#2rgtmVui

#opensource #benchmarking #devtools #benchmark #repo #tool

From Dev RSS Feed: Why I spun my benchmark into its own repo (and why every dev tool with a benchmark should)

15s

Anthropic Message Batching: When 50% Off Is Worth the Latency

🖼️

0

Anthropic Message Batching: When 50% Off Is Worth the Latency

DEV Community·Gabriel Anhaia·28 days ago

#LopdfHQx

#ai #anthropic #python #llm #batch #requests

Anthropic Batches API gives you half-price tokens with a 24h SLA. Here is when it earns its keep, and a Python script that runs 1,000 evals through it.

15s

🖼️

0

What changed in Iris v0.4.0

DEV Community·Ian Parent·about 1 month ago

#s3bHRjvB

#mcp #aiagents #observability #opensource #iris #every

Iris v0.4.0 ships today. It's the release where protocol-native eval crosses from "deterministic...

15s

Comparing c1186abbdd...50b389dd0e · r/morph

📰

0

Comparing c1186abbdd...50b389dd0e · r/morph

GitHub·r·about 1 month ago

#RFPt1YLN

#repo #handledialogclose #inputentered #tabselected #focusfirstlistmember #merge

View the full article

Create a free account to read full articles inline — no redirect to the original site.

Create account Log in

Skills Without Evals Are Just Markdown and Hope

🖼️

0

Skills Without Evals Are Just Markdown and Hope

DEV Community·Daniel Sogl·about 1 month ago

#Yk2NpdB7

#failure #where #claude #skill #skills #model

TL;DR. I built an Anthropic Agent Skill for @ngrx/signals and ran it through the full eval pipeline:...

15s

Comparing 786d21d842...f89bca481c · r/morph

📰

0

Comparing 786d21d842...f89bca481c · r/morph

GitHub·r·about 1 month ago

#edC2HFfv

#repo #handledialogclose #inputentered #tabselected #focusfirstlistmember #morph

View the full article

Create a free account to read full articles inline — no redirect to the original site.

Create account Log in

7 Platforms That Turn Agent Evals Into RL Training Data

🖼️

0

7 Platforms That Turn Agent Evals Into RL Training Data

DEV Community·Ethan·about 1 month ago

#LFNzvoN7

#agents #ai #llm #machinelearning #training #environment

Executive Summary Most teams evaluating AI agents hit the same wall. They can score their...

15s

go-eval: la pieza que faltaba para probar agentes en Go

🖼️

0

go-eval: la pieza que faltaba para probar agentes en Go

DEV Community·igcodinap·about 1 month ago

#ySGNf84b

#el #go #ai #testing #fullscreen #eval

Hace un tiempo empecé a sentir una incomodidad rara construyendo aplicaciones con LLMs en Go. Go...

15s

Langfuse Experiments Rebuild: What LLM Devs Need to Know (2026)

📰

0

Langfuse Experiments Rebuild: What LLM Devs Need to Know (2026)

DEV Community·BeanBean·about 1 month ago

#LVcHAynF

#fullstack #ai #webdev #langfuse #experiments #rebuild

Originally published on NextFuture On April 13, 2026, the Langfuse team shipped an experiments...

15s

Your RAG Eval Set Is Probably Wrong. The Test That Catches It.

📰

0

Your RAG Eval Set Is Probably Wrong. The Test That Catches It.

DEV Community·Gabriel Anhaia·about 1 month ago

#CZFQCZO0

#ai #rag #llm #eval #drift #queries

Three ways eval sets go wrong in production: leakage, drift, judge bias. Plus a 40-line drift detector you can ship today.

15s

I built a hiring platform that watches engineers work in a real CAD tool

📰

0

I built a hiring platform that watches engineers work in a real CAD tool

DEV Community·janardan·about 1 month ago

#N77zbdtz

#ai #webdev #opensource #architecture #eval #work

ai-eval-lab I got bored of UI work at my day job and wanted to build something. Ended up building a...

15s

5 RAG Failure Modes Nobody Warns You About in the Tutorials

📰

0

5 RAG Failure Modes Nobody Warns You About in the Tutorials

DEV Community·Gabriel Anhaia·about 1 month ago

#0RkOK81t

#ai #rag #llm #database #chunks #eval

The five RAG failures that survive your eval suite and break in production. Each one with a small mitigation snippet you can paste in today.

15s

Best LLM Observability Platforms for Anthropic and OpenAI Stacks (2026)

📰

0

Best LLM Observability Platforms for Anthropic and OpenAI Stacks (2026)

DEV Community·BeanBean·about 1 month ago

#UZHZw3eH

#how #fullstack #ai #webdev #anthropic #langfuse

Originally published on NextFuture Picking the best LLM observability tools used to mean choosing...

15s

Anthropic April 23 Postmortem: 3 Confounding Changes Behind Claude Code's Month-Long Quality Drop

📰

0

Anthropic April 23 Postmortem: 3 Confounding Changes Behind Claude Code's Month-Long Quality Drop

DEV Community·정상록·about 1 month ago

#IpKQoEUP

#anthropic #change #ai #code #claude #eval

From Dev Community: Anthropic April 23 Postmortem: 3 Confounding Changes Behind Claude Code's Month-Long Quality Drop

15s

AI Agent Testing Automation: Developer Workflows for 2026

📰

0

AI Agent Testing Automation: Developer Workflows for 2026

www.sitepoint.com·SitePoint Team·about 1 month ago

#fLI1L2aP

#x3c #toc #x26 #clip0_119_2072 #agent #testing

Comprehensive guide covering AI Agent Testing Automation: Developer Workflows for 2026 with practical implementation details.

15s

Menu

When prompts become shells: the tool registry is the attack surface

Eval Set Sizing: The Statistical Power Math Behind LLM A/B Tests

Why I spun my benchmark into its own repo (and why every dev tool with a benchmark should)

Anthropic Message Batching: When 50% Off Is Worth the Latency

What changed in Iris v0.4.0

Comparing c1186abbdd...50b389dd0e · r/morph

Skills Without Evals Are Just Markdown and Hope

Comparing 786d21d842...f89bca481c · r/morph

7 Platforms That Turn Agent Evals Into RL Training Data

go-eval: la pieza que faltaba para probar agentes en Go

Langfuse Experiments Rebuild: What LLM Devs Need to Know (2026)

Your RAG Eval Set Is Probably Wrong. The Test That Catches It.

I built a hiring platform that watches engineers work in a real CAD tool

5 RAG Failure Modes Nobody Warns You About in the Tutorials

Best LLM Observability Platforms for Anthropic and OpenAI Stacks (2026)

Anthropic April 23 Postmortem: 3 Confounding Changes Behind Claude Code's Month-Long Quality Drop

AI Agent Testing Automation: Developer Workflows for 2026