⭒ The Open Frontier 0/N: Who, what, Qwen, and why?
This is the first post in the Open Frontier series. If you're jumping in from the Inference Engines series, welcome!
- The Open Frontier 0/N: Who, what, Qwen, and why? <- we are here
- The Open Frontier 0.5/N: I Tried to Jailbreak an MoE Model and Learned More About MoE Than Jailbreaking
- The Open Frontier 1/N: Mixture of Experts, Meet Speculative Decoding
- Prereqs: Inference Engines 3/N
- The Open Frontier 2/N: Weakest Link or Wisdom of Crowds? Is Mixture of Experts More or Less Robust to Jailbreaking?
- Prereqs: Open Frontier 0/N, 0.5/N
- Okay, spoiler for the question in the title: MoE is more robust. I was able to suppress refusal for the dense, very-not-MoE model, I'm just trying to get 2.5/N to work.
- The Open Frontier 2.5/N: How You Might Jailbreak a MoE Model (TBD)
- Still working on this one... if you've successfully identified and suppressed the parts of a MoE model that're responsible for refusal, I could use a second brain. Reach out, I'd love the company, and it's your post as much as mine if we get there!
- If I resolve it by July 2026, I'll just merge this into 2/N and delete this 2.5/N.
- The Open Frontier 3/N: Open, At What Cost?
Be me, frog in a well
If you've been following the inference engines series, you know we've spent the last few posts building a mental model of how inference engines work. Throughout the posts, we kept the motivation the same, which was serve many users.
Make good numbers go up, make bad numbers go down, and our users are happy bappy.
Now, there's a Chinese fable about a frog who lives at the bottom of a well, and I quite feel like that frog. As the frog, the well is my world, and within that world, life is simple and good. Make good numbers go up, make bad numbers go down, and somewhere out there, our users are happy bappy.
This makes me a happy bappy frog. :)

But as much as we love being happy bappy frogs at the bottom of the well, there's always a time when we must hop out, even if the sky makes us feel very small.
What I never quite explained from all that talk about inference engines is who we imagined running these inference engines. You might have pictured the infrastructure behind the closed doors of a lab like OpenAI or Anthropic, and that would be a completely reasonable assumption since those are the names everyone knows, and those are the products most people mean when they say "AI."
You wouldn't be completely wrong either, since what we talk about is relevant to model serving at scale anywhere.
However, there's a whole other world of infra I'd like to talk about, and that's the world of open source inference engines, the ones that teams reach for when they want to self-host a model. Teams who, for reasons of cost, privacy, customization, or some other guiding principle, want to run the model themselves on their own hardware, rather than route everything through a closed API and pay monies and training data to the Anthropics and OpenAIs of the world.
Which raises the question: self-host what?
Now, one thing I also never explained was why Qwen specifically. Throughout the inference engines series, Qwen was our target model, and I just sort of... assumed that was obvious, when it wasn't. So while we're hopping out of the well, let me give some context on Qwen too.
What is Qwen and why are people using it?
In short, Qwen is increasingly the model people reach for when they want frontier-level capability but don't want to pay frontier-level prices. Qwen is a family of large language models developed by Alibaba's DAMO Academy. Affectionately named Qwen after 千问 (qiānwèn, aka "thousand questions"), it's Chinese in origin, open-weight, and (!) surprisingly almost on par with frontier models for day-to-day reasoning tasks.
Two years ago, I tried self-hosting Qwen on Ollama; I pulled a tiny model, ran it on my CPU, and found it useful mostly for telling me knock knock jokes.
To be fair, I was running the wrong model on the wrong hardware. With a 0.5B parameter model on a CPU, I might as well have told Lewis Hamilton he can't drive because he got stuck behind a schoolbus.
The other analogy I considered was "handing Max Verstappen a tricycle and concluding he's not that fast," hahah. I've sat through enough Formula 1 conversations with past (...and present) coworkers to have material for an analogy, so glad it finally came in handy.
Today, the larger Qwen models, when you self-host on a decent GPU, are a very different story.
I know people who self-host Qwen specifically for code review and lightweight coding tasks. It's not going to out-think a senior engineer the same way that the burlier frontier models would, but it catches a lot of issues that would slip past the first pass of something like Claude Code. And they're quick to mention that it's noticeably faster too, that self-hosted Qwen responds much faster than a frontier API would.
For that use case, it's good enough, and we take "good enough for cheap and fast" over "better but slow and expensive" more often than you'd think, especially if you're running thousands of inference calls for an experiment and would rather not see every API call show up on the company credit card (well, just asking for a friend :P), or a company with data that can't leave their own servers.
Cost also isn't the only incentive for self-hosting Qwen. The other attractive part of Qwen is that you get access to the model weights.
Open weight models
When you use Claude or GPT-4, you're calling a model through an API. You send text to the nethers, and you get text back. You have no access to the model itself, its weights, none of that. The model lives on someone else's servers, and you use it on their terms.
Open-weight models like Qwen work differently. You can download the model itself, and you can run it on your hardware of choice, whether it be your own machine, your own GPU cluster, or your own cloud instance.
And crucially, you can modify it, probe at it, and study how it behaves under conditions that a closed API wouldn't expose.
This changes what's possible, in two interesting ways.
First, fine-tuning. If you have a specialized task, say, a niche like reverse engineering, or writing in a dialect so specialized it barely exists in the training data (like HDLs for FPGAs or a custom kernel DSLs), you can use a frontier model to generate training examples (with a human in the loop, of course, to filter and correct it to your liking), then teach those to your local copy of Qwen. You end up with a fast, cheap model that's good at the thing you need, without paying API costs every time you run it.
Second, interpretability research. If you're interested in mechanistic interpretability but don't yet work in a lab with model access, open-weight models are a great place to start. The hope is that what we learn from studying Qwen generalizes to the closed models we can't directly inspect.
Questions
Qwen, and open-weight models like it, raise some questions that I'd like to explore for the next posts.
My first question is about accessibility. The larger and more capable Qwen gets, the harder it is to run. Can we make a model this capable fast enough, and cheap enough, to be practical on self-hosted hardware? What does it take for us to get there?
My second question is about safety. Frontier labs spend enormous effort adding guardrails to their models, like refusals for things like bioweapons synthesis, vulnerability discovery for cybersecurity research, and drug manufacturing. But if a comparable model is freely downloadable and locally modifiable for any sufficiently motivated person with a GPU, what do those guardrails protect?
My third is a longer-term, longer-horizon question. Could a well-organized community, if it pooled compute and coordinated on training, produce an open model that rivals the largest models from closed labs? And if it did, what would that mean?
Although open models don't come with the same constraints imposed on labs to ship a product people will pay for, it doesn't mean they carry none. Open models still have funders, maintainers, and institutional backers, by virtue of the simple fact that a frontier training run costs tens of millions of dollars, and you have to pay that bill somehow.
This is a question I'd love to do in a separate post. Aside from Qwen, who else is in this Open Model Cinematic Universe? Who's building these open-weight models, how do they coordinate, who do they answer to for compute and funding, and how do those incentives shape the models they release?
These questions all deserve their own posts, and they'll get them. For now, this frog has some experiments to run. :P
