Benchmarking Self-Hosted LLMs for Offensive Security
I wrote up a benchmark of self-hosted models running through Ollama against OWASP Juice Shop:
a deliberately minimal harness with only http_request and encode_payload
tools, eight challenges across SQLi, JWT, LFI, and auth bypass, and 100 runs per model per
challenge. The goal was to see how local models handle single-step exploitation and tool
chaining when frontier APIs are off the table.