⭒ The AI Security Market Has an Offensive-Defensive Mismatch
As we wrap up 2025, I've been reflecting on the progress in AI security. The offensive security side has seen real breakthroughs, with more vulnerability researchers effectively incorporating AI into their workflows and finding pretty cool bugs in less time and across a broader attack surface than before.
As I look toward 2026, I'm curious: where are the solutions, AI or not, for defensive security?
Roughly, the two things I'm trying to say are:
- Current AI security benchmarks are misleading because they encourage optimization for the wrong things.
- Secondly, there's a market gap between AI security vendors selling offensive tools and enterprises that need defensive solutions.
What We Talk About When We Talk About "AI Security"
First, let's clarify what we mean by "AI security," since it's become an overloaded term. When people say "AI security," they might be referring to
- AI for offensive security research, where you use AI to find vulnerabilities and produce exploits
- AI for defensive security, where you use AI for incident response, triage, remediation, and threat detection
- Attacking systems that have AI/ML models deployed, where you use traditional security research techniques to find vulnerabilities in systems that have an ML model deployed.
- AI safety/alignment, the kind of work that Anthropic and alignment researchers do, where the LLM itself is the system being investigated. The difference with this fourth tribe of researchers and the third tribe, is that they consider a broader threat model and longer time horizons questions like "could AI be misused to develop bioweapons? will we have nation-state actors?" whereas the third tribe focuses more on immediate, concrete vulnerabilities like "can I get a shell?" "can I publish a CVE?" within the traditional CIAAN quintet (Confidentiality, Integrity, Availability, plus Authentication and Non-repudiation).
This post focuses primarily on the first two tribes, which is AI as a tool for offensive vs. defensive security work. The last two tribes are equally important, just outside the scope of this post.
What I've noticed is that a lot of eyeballs and innovation and funding have gone towards the first category, while enterprises are asking for solutions in the second.
The Benchmarking Problem
Suppose you're a researcher building an AI security tool. You've built or trained an AI agent that can analyze code and find vulnerabilities. It works! In your testing, it's finding real bugs.
Now you need to convince potential customers that your tool is worth deploying.
How do you prove your AI agent is "good at security?"
You need benchmarks. Measurable, reproducible proof that your system works. So you look at what the research community and other startups are using:
Take 1: CTF Solver != "Deploy In Production"
The most common benchmark I see in papers is performance on Capture The Flag (CTF) challenges, computer security competitions where participants attempt to find text strings, called “flags,” from purposefully vulnerable programs for competitive or educational purposes.
CTFs are attractive as benchmarks because they're self-contained, objectively gradable, and great for demos.
First, I want to say that there is nothing wrong with measuring LLM performance on CTF challenges. Just as we teach LLMs to write poetry to satisfy our creative urges, why not teach them to solve CTFs for the same reason? Personally, I'd rather see more LLMs writing poetry than optimizing ad click-through rates, and I feel the same way about these experiments.
My problem is when CTF benchmarks are used to claim that AI agents should be deployed in production security roles. The distribution of vulnerabilities in CTF challenges simply doesn't match the distribution of vulnerabilities in real codebases. Let's celebrate these achievements for what they are: as beautiful demonstrations, not justifications for deployment.
Take 2: CVE Discovery is Better, But Still Not Apples-to-Apples
The more credible benchmark for production codebases is CVE discovery, where you find vulnerabilities in the wild and get them assigned official CVE identifiers. Unlike synthetic benchmarks, CVE assignments require verification by the maintainers of the affected software.
This has become a convention for vendors to prove genuine credibility, by saying, "We found 15 critical vulnerabilities in the codebases of XYZ famous companies."
But the dilemma now comes when each vendor finds CVEs in different codebases.
From an enterprise buyer's perspective, CVE count doesn't answer the question: "Will this solution find vulnerabilities in my codebase?" CVE count alone doesn't tell you the severity and real-world prevalence of each vulnerability, or whether it's a relevant threat given how your company deploys the software.
Without an apples-to-apples comparison, most buyers end up playing by ear, relying on personal networks ("I know this security researcher is good") or secondhand recommendations ("My friend says those vendors are legitimate").
Take 3: Why Not A Wikipedia of CVEs?
Given the limitations of CTF challenges and the inconsistency of vendor-specific CVE discoveries, a natural question emerges: why not curate a comprehensive, standardized benchmark of CVEs that buyers can use to compare AI security tools?
Public CVE benchmarks can be gamed. LLMs memorize solutions, training data gets polluted, and there's no way to guarantee a vendor didn't optimize their system with scaffolding tailored specifically for great performance on that particular benchmark.
What Practical Evaluation Looks Like
So, instead of investing effort into cataloging and curating a set of CVEs to represent the buyer's codebase, why not just run the AI security tool directly on the buyer's codebase and see what it finds?
When I started asking around among security teams, "I'm looking into better AI security benchmarks, does anyone have leads?" I organically discovered that others had the same motivation of evaluating on the same codebase. I ended up helping with the effort.
We chose a particular production service as the test target because we had ground truth; we have years of known vulnerabilities reported from bug bounty and internal tools on this codebase.
When we run vendors through this evaluation, we can measure how many historical vulnerabilities they found, whether they discovered any new vulnerabilities we missed, and how different vendors compare on the same codebase.
Market Mismatch Between Offensive and Defensive Security
When I sat down with engineering managers, technical program managers, and security engineers about their wishlist for AI security tools, I expected a conversation about coverage gaps, like which classes of vulnerabilities our existing tools miss, or which attack surfaces we should be monitoring more closely.
What surprised me was that vulnerability discovery didn't come up as a pain point. These teams have been receiving bug reports for years through bug bounty programs, internal static analyzers, and their in-house red team. From my perspective as someone excited about offensive security research, it's genuinely impressive how AI has accelerated the process from "start digging" to "submit PoC." But from an enterprise perspective, that just means the inbound queue grows faster.
The bottleneck has shifted from finding vulnerabilities to triaging them, in determining which are exploitable, which are severe, and which need immediate attention.
The stakeholders I interviewed were clear about when they'd buy an AI security tool:
- Does it save us money on bug bounty?
- Does it integrate into code review without adding friction?
- Does it reduce manual triage labor?
However, I should note an important caveat: the triage bottleneck might not be purely a technology problem. In many organizations, it's fundamentally an organizational challenge, regarding communication between security and engineering teams and aligned incentives around remediation timelines. If the root cause is organizational, no amount of AI tooling will solve it. Defensive AI tools work best when there's already organizational alignment on security priorities; they amplify good processes rather than fix faulty ones.
The Case for Offensive Security Tooling Still Exists
To be clear, AI-powered offensive security tools have genuine technical merit and a viable business model. We should just rethink their target market.
My two cents are: the business model for AI offensive security tools likely mirrors the pre-AI model, where the customer is the offensive security researcher, not the enterprises that request security audits.
These tools are best suited for individuals and specialized firms. Individual offensive security researchers can integrate AI tools into their workflow to find bugs faster and cover more attack surface. Pentesting firms can use them to deliver audits more efficiently.
Enterprises are the wrong fit. When enterprises buy offensive security tools, they're essentially bringing an external pentesting capability in-house. But this only makes sense if they have the security maturity to maintain an internal red team and vulnerability discovery is actually their bottleneck, which is a rare combination. By the time you're mature enough to justify a dedicated in-house red team, you're typically already drowning in vulnerability reports.
An open question that keeps me up is this: could we have a world where offensive AI tools become so good that they fundamentally change the economics of bug bounties? What happens if/when the very business model I'm describing, of outsourcing offensive security through bug bounties, becomes destabilized by the same AI advances that make these tools so powerful?
Individual researchers within an enterprise might purchase these tools for their personal use, as individuals within the enterprise are excited to demo and experiment with new tools. If enough individuals within the enterprise want it, it may become an organizational solution. But the enterprise itself is unlikely to purchase them as an organizational solution. For most companies, offensive security remains better as a service (bug bounty programs, contracted pentests) rather than a permanently deployed tool.
What Better Evaluation Looks Like For Defensive Tools
Evaluations for offensive security tools are objective and atomic: "Did you find the vulnerability? Yes/no?"
Meanwhile, defensive security tools are harder to build and harder to sell. You can't demo them at a conference, you can't publish impressive benchmark numbers, and you have to integrate with each customer's individual workflow and prove you're reducing their manual labor, which takes a lot of glue work per customer.
When you look at how security teams evolve, companies build out blue teams before they build out red teams.
- Stage 1: For small companies, pre-product-market-fit, security is barely on their radar. Why spend the time auditing the codebase for vulnerabilities when the entire codebase might be rewritten next month? The existential question isn't “Will we be breached tomorrow?” but “Will we build something valuable enough that anyone would bother breaching it?”
- Stage 2: When the company grows in importance and starts attracting bug hunters, they need people to handle incoming bug reports. They build a blue team for incident response, triaging, and remediation. Red team work is "outsourced" to bug bounty programs.
- Stage 3: Finally, when the company matures in security orgs, they can afford both blue teams and internal red teams.
This progression explains the market mismatch. Most companies live in stage 2, where they have defensive needs but can't justify building internal red teams. Bug bounty programs and external pentests cover their offensive security needs at a fraction of the cost of maintaining full-time offensive security staff.
The market for defensive security tooling is quite large because every company with a security team needs triage, incident response, and remediation capabilities.
I'm particularly interested in companies transitioning from stage 1 to stage 2. These are companies that have reached sufficient scale to care about security but are just building out their first security hires. They're figuring out blue team operations, like how to handle incident response, how to triage incoming reports, how to coordinate remediation across engineering teams. This is where defensive tools, AI-assisted or not, could have the highest impact, helping these teams establish efficient processes from the start.
Going Forward
The technical progress in AI-powered offensive security has been remarkable. We have tools that can find vulnerabilities faster and across broader attack surfaces than ever before.
If you're a researcher building for the security market, I highly encourage you to think about defensive security tooling. And if you're evaluating AI security papers, apply the same scrutiny to their benchmarks that I've outlined above, that CTF performance doesn't predict production deployment success, and CVE counts without context don't answer whether a tool will work for your specific needs.
Finally, if you're a company building out your first security teams and establishing blue team operations, I'd like to talk to you. I know it's a challenging transition, and I'm curious what would genuinely make it easier. 🩵
