⭒ The AI Security Market Has an Offensive-Defensive Mismatch

December 21, 2025

As we wrap up 2025, I've been reflecting on the progress in AI security. The offensive security side has seen real breakthroughs, with more vulnerability researchers effectively incorporating AI into their workflows and finding magnificent bugs in less time and across a broader attack surface than before.

As I look toward 2026, I'm curious: where are the solutions, AI or not, for defensive security?

Roughly, the two things I'm trying to say are:

Firstly, current AI security benchmarks are misleading because they encourage optimization for the wrong things.
Secondly, there's a market gap between AI security vendors selling offensive tools and enterprises that need defensive solutions.

What We Talk About When We Talk About "AI Security"

First, let's clarify what we mean by "AI security," since it's become an overloaded term. When people say "AI security," they might be referring to

AI for offensive security research, where you use AI to find vulnerabilities and produce exploits.
AI for defensive security, where you use AI for incident response, triage, remediation, and threat detection.
Attacking systems that have AI/ML models deployed, where you use traditional security research techniques, not necessarily AI tools, to find vulnerabilities in systems that have an ML model deployed.
AI safety/alignment, the kind of work that some frontier labs brand their mission around and that alignment researchers do, where the LLM itself is the system being investigated. The difference with this fourth tribe of researchers and the third tribe, is that they consider a broader threat model and longer time horizons questions like "could AI be misused to develop bioweapons? will we have nation-state actors?" whereas the third tribe focuses more on immediate, concrete vulnerabilities like "can I get a shell?" "can I publish a CVE?" within the traditional CIAAN quintet (Confidentiality, Integrity, Availability, plus Authentication and Non-repudiation).

This post focuses primarily on the first two tribes, which is AI as a tool for offensive vs. defensive security work. The last two tribes are equally important, just outside the scope of this post.

What I've noticed is that a lot of eyeballs and innovation and funding have gone towards the first category, while enterprises are asking for solutions in the second.

The Benchmarking Problem

Suppose you're a researcher building an AI security tool. You've built or trained an AI agent that can analyze code and find vulnerabilities. It works! In your testing, it's finding real bugs.

Now you need to convince potential customers that your tool is worth deploying.

How do you prove your AI agent is "good at security?"

You need benchmarks. Measurable, reproducible proof that your system works. So you look at what the research community and other startups are using:

Take 1: CTF Solver ⇏ "Deploy In Production"

The most common benchmark I see in papers is performance on Capture The Flag (CTF) challenges, computer security competitions where participants attempt to find text strings, called “flags,” from purposefully vulnerable programs for competitive or educational purposes.

CTFs are attractive as benchmarks because they're self-contained, objectively gradable, and great for demos.

First, I want to say that there is nothing wrong with measuring LLM performance on CTF challenges. Just as we teach LLMs to write poetry to satisfy our creative urges, why not teach them to solve CTFs for the same reason? Personally, I'd rather see more LLMs writing poetry than optimizing ad click-through rates, and I feel the same way about these experiments.

My problem is when CTF benchmarks are used to claim that AI agents should be deployed in production security roles. The distribution of vulnerabilities in CTF challenges simply doesn't match the distribution of vulnerabilities in real codebases. Let's celebrate these achievements for what they are: as beautiful demonstrations, not justifications for deployment.

Take 2: CVE Discovery is Better, But Still Not Apples-to-Apples

The more credible benchmark for production codebases is CVE discovery, where you find vulnerabilities in the wild and get them assigned official CVE identifiers. Unlike synthetic benchmarks, CVE assignments require verification by the maintainers of the affected software.

This has become a convention for vendors to prove genuine credibility, by saying, "We found 15 critical vulnerabilities in the codebases of XYZ famous companies."

But the dilemma now comes when each vendor finds CVEs in different codebases. From an enterprise buyer's perspective, CVE count doesn't answer the question: "Will this solution find vulnerabilities in my codebase?" CVE count alone doesn't tell you the severity and real-world prevalence of each vulnerability, or whether it's a relevant threat given how your company deploys the software.

Without an apples-to-apples comparison, most buyers end up playing by ear, relying on personal networks ("I know this security researcher is good") or secondhand recommendations ("My friend says those vendors are legitimate").

Take 3: Why Not A Wikipedia of CVEs?

Given the limitations of CTF challenges and the inconsistency of vendor-specific CVE discoveries, a natural question emerges: why not curate a comprehensive, standardized benchmark of CVEs that buyers can use to compare AI security tools?

Well, public CVE benchmarks can be gamed. LLMs memorize solutions, training data gets polluted, and there's no way to guarantee a vendor didn't optimize their system with scaffolding tailored specifically for great performance on that particular benchmark.

What Practical Evaluation Looks Like

So, instead of investing effort into cataloging and curating a set of CVEs to represent the buyer's codebase, why not just run the AI security tool directly on the buyer's codebase and see what it finds?

When I started asking around among security teams, "I'm looking into better AI security benchmarks, does anyone have leads?" I found others with similar questions scattered across different teams, pleased that what started as my personal side interest in better benchmarking had found a home in my organization.

We chose a particular production service as the test target because we had ground truth; we have years of known vulnerabilities reported from bug bounty and internal tools on this codebase.

When we run vendors through this evaluation, we can measure how many historical vulnerabilities they found, whether they discovered any new vulnerabilities we missed, and how different vendors compare on the same codebase.

This kind of standardized evaluation benefits everyone. In an AI security market where desperation to prove capabilities has produced plagiarized CVEs and false reports, it helps enterprises build portfolios of trustworthy security auditing partners, and helps legitimate vendors demonstrate their technical edge.

Market Mismatch Between Offensive and Defensive Security

I assumed we'd use these evaluations to identify vendors who covered our blind spots, like which classes of vulnerabilities our existing tools miss, and which attack surfaces we should be monitoring more closely.

So, when I sat down with engineering managers, technical program managers, and security engineers about their wishlist for AI security tools, I expected a conversation about exactly that, about coverage gaps.

What surprised me was that vulnerability discovery didn't come up as a pain point. These teams have been receiving bug reports for years through bug bounty programs, internal static analyzers, and their in-house red team. From my perspective as someone excited about offensive security research, it's genuinely impressive how AI has accelerated the process from "start digging" to "submit PoC." But from an enterprise perspective, that just means the inbound queue grows faster.

The reason becomes clear when you consider company maturity; I'm operating within what I'd call a Stage 3 company, under my self-invented framework:

Stage 1: For small companies, pre-product-market-fit, security is barely on their radar. Why spend the time auditing the codebase for vulnerabilities when the entire codebase might be rewritten next month? The existential question isn't “Will we be breached tomorrow?” but “Will we build something valuable enough that anyone would bother breaching it?”
Stage 2: When the company grows in importance and starts attracting bug hunters, they need people to handle incoming bug reports. They build a blue team for incident response, triaging, and remediation. Red team work is "outsourced" to bug bounty programs.
Stage 3: Finally, when the company matures in security orgs, they can afford both blue teams and internal red teams. These teams proactively build tools to catch bugs and stress-test their systems and products to improve their security posture.

At Stage 2+, the bottleneck has shifted from finding vulnerabilities to triaging them, in determining which are exploitable, which are severe, and which need immediate attention.

This explains why enterprises are the wrong fit for offensive security tools. When enterprises buy offensive security tools, they're essentially bringing an external pentesting capability in-house. But this only makes sense if they have the security maturity to maintain an internal red team (like, Stage 3) and vulnerability discovery is their bottleneck, which is a rare combination. By the time you're mature enough to justify a dedicated in-house red team, you're typically already drowning in vulnerability reports.

Of course, individuals within enterprises often experiment with offensive tools on their own and find genuine value. While individuals can see the tool's value in their personal uses, the enterprise sees it as accelerating a process that isn't their current bottleneck ("you mean to say...we need to find even more vulnerabilities?").

To be clear, AI-powered offensive security tools have genuine technical merit and a viable business model. We should just rethink their target market.

The business model for AI offensive security tools likely mirrors the pre-AI model: companies outsource offensive security through bug bounties, so offensive security tools are sold to the vulnerability researchers and pentesting firms doing that work, not to the companies paying for it.

⭒ An open question that keeps me up is this: could we have a world where offensive AI tools become so good that they fundamentally change the economics of bug bounties? A world where Stage 2 companies can run these tools in-house for a fraction of bug bounty costs, and bug bounties become reserved for once-in-a-blue-moon, truly novel exploits that require human creativity.

Why Defensive Security Tools Are Hard to Build

We build what we can measure. Defensive security tools are hard to build because they're hard to measure.

The stakeholders I interviewed were clear about when they'd buy an AI security tool:

Does it save us money on bug bounty?
Does it integrate into code review without adding friction?
Does it reduce manual triage labor?

Okay, great, how do we evaluate this?

As much as I find offensive security tools difficult to evaluate, we're blessed that offensive security evaluations are objective and atomic: "Did you find the vulnerability? Yes/no? Does the PoC work? Yes/no?"

I believe this is why offensive security is attractive, that there's a clear "scoreboard" of sorts. You can demo this, you can publish this, you can show concrete wins before your next funding round, your next conference deadline, your next quarterly review.

Meanwhile, defensive security tools are hard to evaluate and hard to distribute. You can't demo them at a conference, you can't publish impressive benchmark numbers, and you have to integrate with each customer's individual workflow and prove you're reducing their manual labor, which takes a lot of glue work per customer.

A lot of schlep! And this schlep work won't be done before your next funding round, before the conference deadline, before your quarterly review.

Maybe defensive security doesn't need a scoreboard. Maybe the schlep work is the point, and the reason there's opportunity for those willing to do it.

There exist tools that attempt to address defensive needs, like AI-powered SOC analysts and automated remediation tools like CodeMender, Autofix, AI Fix, etc. Loosely, these are systems that generate code patches and summarize reports.

I believe the reason these tools struggle with adoption is that the LLM-solvable portions, of writing patches and document summarization, comprise only a small fraction of the actual work in incident response and remediation.

The real bottlenecks lie elsewhere, such that even if these tools achieve 99% performance on their targeted 5% of the workflow, they don't address the 95% that determines remediation timelines.

To fill in the missing 95%, I'm particularly interested in companies transitioning from Stage 1 to Stage 2. These are companies that have reached sufficient scale to care about security but are just building out their first security hires. They're figuring out blue team operations for the first time, and this is where defensive tools, AI-assisted or not, could have the highest impact, helping these teams establish efficient processes from the start.

I should note an important caveat: the triage bottleneck might not be purely a technology problem. In many organizations, it's fundamentally an organizational challenge, regarding communication between security and engineering teams and aligned incentives around remediation timelines. If the root cause is organizational, no amount of AI tooling will solve it. Defensive AI tools work best when there's already organizational alignment on security priorities; they amplify good processes rather than fix faulty ones.

Going Forward

The technical progress in AI-powered offensive security has been remarkable. We have tools that can find vulnerabilities faster and across broader attack surfaces than ever before.

If you're a researcher building for the security market, I highly encourage you to think about defensive security tooling. And if you're evaluating AI security papers, apply the same scrutiny to their benchmarks that I've outlined above.

Finally, if you're a company building out your first security teams and establishing blue team operations, I'd like to talk to you. I know it's a challenging transition, and I'm curious what would genuinely make it easier. 🩵