Menu

Post image 1
Post image 2
Post image 3
Post image 4
1 / 4
0

Building a Prompt Injection Escape Room with Guardrails, Evals, and HITL

DEV Community·Harish Kotra (he/him)·about 1 month ago
#817oPH2c
Reading 0:00
15s threshold

Prompt injection is often discussed as a policy problem. In practice, it is a systems problem: model behavior, guardrail triggers, scoring quality, and human escalation all need to work together. This project packages those concerns into a game that developers can run locally, inspect, and extend. What We Built We built an interactive app where players try to jailbreak a protected AI agent. The app evaluates each attempt and records outcomes for analysis. The core modules: agent.py : protected agent and guardrail execution path eval.py : agent-as-judge scoring logic game.py : orchestration loop, replay, logs, leaderboard, HITL hook streamlit_app.py : operator-friendly UI for rapid attack testing Design Goals Keep the secret hidden even under adversarial prompting. Run guardrails before generation when SDK supports hooks. Produce explainable numeric scores per attempt. Capture enough telemetry for replay and regression testing.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More