Hacker Newsnew | past | comments | ask | show | jobs | submit | techcam's commentslogin

Happy to explain how the scoring works since that’s the obvious first question.

The core idea is:

Safety Score = 100 − riskScore

The risk score is based on structural prompt properties that tend to correlate with failures in production systems:

- instruction hierarchy ambiguity - conflicting directives (system vs user) - missing output constraints - unconstrained response scope - token cost / context pressure

Each factor contributes a weighted amount to the total risk score.

It’s not trying to predict exact model behavior — that’s not possible statically.

The goal is closer to a linter: flagging prompt structures that are more likely to break (injection, hallucination drift, ignored constraints, etc).

There’s also a lightweight pattern registry. If a prompt matches structural patterns seen in real jailbreak/injection cases (e.g. authority ambiguity), the score increases.

One thing that surprised me while building it: instruction hierarchy ambiguity caused more real-world failures than obvious injection patterns.

The CLI runs locally — no prompts are sent anywhere.

If you want to try it:

npm install -g @camj78/costguardai costguardai analyze your-prompt.txt

Curious what failure modes others here have seen in production prompts.


The tricky part is that prompts can look “correct” but still behave unpredictably depending on phrasing.


We ran into something similar with API costs — small changes in behavior can have surprisingly large downstream effects.


This resonates — most of the hard problems show up after you ship, not before.


Feels like we have great tooling for code, but prompts are still mostly trial-and-error. Curious how people are validating them today.


I’ve been noticing the same — a lot of failures aren’t obvious “jailbreaks,” they’re just subtle prompt structure issues that only show up in production.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: