Benchmarking Self-Hosted LLMs for Offensive Security

14-04-2026

I wrote up a benchmark of self-hosted models running through Ollama against OWASP Juice Shop: a deliberately minimal harness with only http_request and encode_payload tools, eight challenges across SQLi, JWT, LFI, and auth bypass, and 100 runs per model per challenge. The goal was to see how local models handle single-step exploitation and tool chaining when frontier APIs are off the table.