EPISODE 2026-06-22

AI:AM LIVE — June 22, 2026 — The State of AI Engineering and the Loop as the Moat: Shawn Wang (swyx)

The open tracked the Trump administration's export-control action against Fable and Mythos — Nathan opened with a weekend reflection calling for more cognitive empathy toward the administration's AI engagement — then recapped Dean Ball's move from the White House OSTP to OpenAI's new Strategic Futures team, and Prakash flagged GLM 5.2 from Zhipu as dominating weekend discourse as the first open-weights model that could pass as a daily driver. The show then ran a long-form interview with Shawn Wang (swyx) — who coined 'the AI Engineer' as a job title in 2023 — on the state of AI engineering in mid-2026: software factories superseding coding agents, FrontierCode and the 'unmergeable slop' problem in SWE-Bench, harness engineering as the new moat, and the AI Engineer World's Fair (June 29–July 2, San Francisco). The hosts closed with a live 'Guess the Markets' game — drawing questions from Polymarket, Kalshi, and Manifold on AI prediction markets — that revealed wide divergences between host intuitions and market consensus on topics ranging from Anthropic's Chatbot Arena dominance to Google's surprising 73% edge in math-model forecasts.

▶ Full show on YouTube 𝕏 Live broadcast

The June 22 show opened in the shadow of the Trump administration's ongoing export-control action against Fable and Mythos. Nathan led with a weekend reflection: after Judd Rosenblatt publicly called out the AI safety community for a lack of cognitive empathy toward the administration, Nathan concluded he had himself been somewhat uncharitable — even if export controls are the wrong tool, the administration is engaged and taking AI seriously. Prakash extended the theme with Warren Buffett's 29-year lag before investing in Google, illustrating just how vast the gap between insiders and outsiders remains. The opening also touched on Dean Ball's move from White House OSTP to lead OpenAI's new Strategic Futures team, and GLM 5.2 from Zhipu, which dominated weekend discourse as the first Chinese open-weights model widely described as a genuine daily driver.

The centerpiece was a long-form interview with Shawn Wang — swyx — who coined 'the AI Engineer' as a job title in his 2023 essay and now curates the AI Engineer World's Fair while doing evaluation research at Cognition. The conversation ranged from FrontierCode and the 'unmergeable slop' problem to software factories, harness engineering as the new moat, the Illuminati-group-chat read on synchronized lab IPOs, and career advice for a generation of engineers watching AI automate entry-level code. The show closed with a live 'Guess the Markets' segment — twelve AI prediction-market questions from Polymarket, Kalshi, and Manifold — that surfaced wide divergences between host intuitions and market consensus.

The rundown

11:25Opening31 min
Opening: Cognitive Empathy, Dean Ball to OpenAI, GLM 5.2Nathan opened with a weekend reflection calling for more cognitive empathy toward the Trump administration's AI engagement, recapped his Cognitive Revolution conversation with Dean Ball (who is joining OpenAI's new Strategic Futures team on July 6), and Prakash flagged GLM 5.2 from Zhipu as dominating weekend discourse as the first Chinese open-weights model widely described as a genuine daily driver.
Watch
As aired
Nathan and Prakash opened the June 22 show by processing the aftermath of the Trump administration's export-control action against Fable and Mythos. Nathan led with a weekend reflection: after Judd Rosenblatt publicly called out the AI safety community for a lack of cognitive empathy toward the Trump administration, Nathan concluded that he himself had been somewhat uncharitable. Even if the administration's tactics — export controls, a sledgehammer approach — are not ideal, they are engaged and taking AI seriously, and the right response from the safety community is to be constructive rather than dismissive. Prakash extended the theme by illustrating how vast the gap is between insiders and outsiders on AI: even Warren Buffett, who spent 29 years not investing in Google because he categorized it as a tech company rather than a billboard monopoly, only came around in 2026. The distance between the bubble and the rest of society suggests the administration's clumsy moves may simply reflect genuine unfamiliarity rather than bad faith.
Nathan then recapped his extended Cognitive Revolution conversation with Dean Ball, who is leaving OSTP to join OpenAI's new Strategic Futures team on July 6. Ball — the lead architect of America's AI Action Plan — believes the OpenAI role will be more consequential than his government work, in part because governing recursive self-improvement requires being close to where it is actually happening at the research level. Nathan noted the irony that Ball simultaneously advocates for broad diffusion over nationalization while acknowledging that effective frontier-AI policy may require inside access that is increasingly hard to obtain from outside. Speculation circulated over the weekend that Anthropic may have completed a new internal model (informally called 'Mythos 2'), raising uncomfortable questions about whether labs can meaningfully self-govern when their most powerful systems are not publicly disclosed.
Prakash then pivoted to GLM 5.2 from Zhipu (which he called Knowledge Atlas), the Chinese open-weights model released June 16 that dominated weekend discourse. A post by former Meta and Google DeepMind VP Matt Veloso — declaring it the first open model that passes as a daily driver — went to roughly 1.5 million views partly after Prakash amplified it, briefly spiking the company's Hong Kong stock. Elon Musk suggested Chinese models would catch US frontier models in 3–4 months; Zhipu's cofounder said it would happen sooner. Prakash argued the administration faces a narrow window — roughly 10 months from Mythos's February preview to a plausible Chinese parity point — and that its ability to slow things down is limited by public opinion, social media dynamics, and the sheer credibility of the researchers involved, including AlphaFold co-creator John Jumper, who recently joined Anthropic. Nathan offered two qualifications: AI is broadly unpopular with the public, which may give the administration more political cover to override Nobel Prize winners than Prakash assumed; and GLM 5.2 is credibly reported to carry significant Claude influence in its post-training, meaning a Fable ban could actually slow Chinese models' path to Fable-level performance as well.
Key moments
I think I have to say I was probably a little too harsh. We should probably be a little more constructive, a little more inclined to use this as an opportunity to engage and educate, and not ridicule because the level of understanding doesn't rise to match our own.
Nathan Labenz16:41
Unless you have access to the inside information on just how recursive self-improvement is really going where it's being pushed the hardest, it's pretty hard to make good policy recommendations from the outside if you're just reading the tea leaves.
Nathan Labenz31:04
This is one part of the 1% where he has some inside information. The administration banning Fable probably accelerated Anthropic, because all of the GPUs that would have been devoted to inference for Fable would then have been freed up to train the next version of the models.
Prakash35:44
What we covered
Cognitive empathy for the administration — are export controls the wrong tool but the right instinct? After Judd Rosenblatt called out the AI safety community for a lack of cognitive empathy toward the Trump administration, Nathan acknowledged on reflection that he had probably been too harsh. Even if export controls are a sledgehammer, the administration is engaged and taking AI seriously — and the constructive response from the safety community is to educate rather than ridicule. Prakash extended the point with Warren Buffett's 29-year lag before investing in Google, illustrating how vast the bubble-to-outside gap remains.
Dean Ball goes inside — a frontier lab hires the man who wrote the rules. Dean Ball — primary architect of the Trump administration's AI Action Plan and until last week a Senior Policy Adviser at White House OSTP — is joining OpenAI to lead a new team called Strategic Futures, reporting to chief strategy officer Jason Kwon. The stated mandate is to look 6–12 months out at catastrophic risk, recursive self-improvement, labor disruption, and lab–government relations. Nathan recapped his extended Cognitive Revolution episode conversation with Ball, noting Ball's view that the OpenAI role will be more consequential than his government work because effective frontier-AI policy may require inside access.
Dean Ball on the move (X) ↗
GLM 5.2 dominates weekend discourse — the first Chinese open-weights model called a genuine daily driver. Zhipu's GLM 5.2 (open weights, MIT license, released June 16) dominated the weekend feed after Matt Veloso — former Meta and Google DeepMind VP — declared it the first open model that passes as a daily driver, a post that briefly spiked Zhipu's Hong Kong stock. Elon Musk suggested Chinese models would reach US frontier-level in 3–4 months; Zhipu's cofounder said sooner. Nathan qualified that GLM 5.2 is credibly reported to carry significant Claude influence in its post-training, meaning the Fable ban could actually slow Chinese models' path to Fable-level performance.
Nathan Lambert on GLM 5.2 in coding harnesses (X) ↗
Full transcriptLightly edited · timestamps jump to YouTube
11:26
Prakash: Good morning. It is Monday, June 22, 9:02 AM. Nathan, good morning. How are you?
11:33
Nathan Labenz: Good morning, Prakash. I'm pretty good. I had a lovely little Father's Day yesterday with my family and my dad, and I'm feeling refreshed and ready for a five-day sprint this week. How about you?
11:49
Prakash: I'm good. And, surprisingly, we didn't have that much new news over the weekend. I saw Peter Wilderford — he had a comment: it's been eight days without any new AI news.
12:07
Nathan Labenz: Yeah. Well, we're all kind of waiting with bated breath to see what's going to happen with this whole Fable situation. I wouldn't say we got no news — we did get GLM 5.2, which is something we'll have a chance to talk about a little bit. But one thing I wanted to start off with: I've been thinking over the weekend about Judd Rosenblatt's comments from last Thursday, in which he called out the AI safety community broadly for a lack of cognitive empathy for the Trump administration.
12:52
Nathan Labenz: And there's a tendency — you could say in the AI safety community, you could say in the effective altruism spaces — to sometimes have a sort of circular firing squad dynamic and hold allies to unreasonably high standards. This happened a little bit with the Pope too, when the encyclical was dropped and people were saying, well, there's a lot good here, but he's not taking the possibility of AI consciousness seriously enough. And it was like, hey — count your blessings. You've got a Pope that's engaged. That's not to be taken for granted.
13:37
Nathan Labenz: And I think I probably have to say, on reflection, I consider myself a bit guilty of the offense that Judd was calling out. I'm still not sure exactly where I land on it, but I made a song for our weekly highlights episode that tries to channel the perspective of the Trump administration — even Trump himself, sung in the first person. And for how weird a mix of positions Anthropic has taken, it's certainly weird for anyone who's not been in the AI safety discourse for a long time and really understood the nuances of why they think this is so dangerous but feel they need to be the ones to do it.
14:23
Nathan Labenz: I think, yeah, a little more cognitive empathy for the administration right now. If you're not that deep in this stuff, and all of a sudden it's happening very quickly, and you've got a ton of other things on your plate — and this one company keeps saying that it's super dangerous and you need to do something, and then they put out a model and say it's too dangerous to release, and then they release it — I don't think that they've shown tremendous sophistication in terms of how well they've understood or handled it.
15:09
Nathan Labenz: But I do feel like maybe I was a bit harsh, and maybe I should be giving more credit where it's due. Even though the tactics used — the sledgehammer of export controls — doesn't really seem like the right way to go about this, they're engaged, they're doing something, they're taking it seriously. And maybe that is a clumsy but ultimately praiseworthy first step.
15:55
Nathan Labenz: That's where, after a weekend of thinking about it, I come out. And certainly a lot of questions still remain unanswered. I do not really want to see them take this deeper and deeper into the national security apparatus and make everything secret. I do think diffusion is going to be important — that was probably one of the biggest takeaways from my extended conversation with Dean Ball last week, which went out as a Cognitive Revolution episode. So you want to watch closely what they're doing going forward, and don't want to act like just because they're doing something that it's necessarily good. But we're in such early innings of the government really engaging and taking this seriously at all that—
16:41
Nathan Labenz: —I think I have to say I was probably a little too harsh, and we should be using this — to the degree that the AI safety community is serious about, hey, this thing is taking on a life of its own and the whole phenomenon risks getting out of control — we should probably be a little more constructive, a little more inclined to use this as an opportunity to engage and educate, and not ridicule because the level of understanding doesn't rise to match our own.
17:14
Prakash: I am often extremely shocked at how inside the bubble we are. Because even when I speak to other people in Silicon Valley who are software engineers, the distance between being fully inside the bubble and outside the bubble is vast. And it starts to get even bigger as you go further away from the bubble. At some point, you get to people who believe all of tech has generally been a scam for about twenty years.
17:59
Prakash: I've definitely come across many of those people. For example, I will give you one evident example: Warren Buffett. Warren Buffett did not invest in tech until he went into Apple around 2014. And then Charlie Munger came out three or four years later saying he regretted that they didn't invest in Google, because GEICO was paying Google billions of dollars a year in ad fees—
18:45
Prakash: —and they just didn't catch it. They didn't realize it was basically the same as a billboard business, except Google owned all the billboards. Because they categorized it as a technology company, they thought it could be disrupted by other technology. What they didn't realize was that Google was basically a billboard company, and they owned all the billboards. So it's very surprising that, very recently, Warren Buffett has made a big investment — $10 billion — in Google's latest $80 billion equity raising, and it looks like he's going to back them going forward. But it just shows you how—
19:30
Prakash: —much of a distance there is. Google was founded in 1998. It took Buffett basically 29 years to get over the idea that this is not really a technology company — it's a billboard company, and everything that existed in the physical world has now transitioned to the digital world. And I think that just shows you what kind of—
20:15
Prakash: —distance there is, and how much time it might take for the rest of society to acclimate to the idea that everything is going to move from heuristic-based software coded by engineers to AI running things without engineers doing much more than making up a few rules. It's very difficult, I think, for most people to grasp and accept that as a way the world will be run. It's always nice to have this idea that there's someone responsible making these rules, rather than the AI just putting things together and — whoa, that's your world. And I think that's part of the story of human disempowerment that people in the AI safety field express.
21:00
Prakash: We had a good guest last week, Liron Shapira. Liron was, as you pointed out, on the side that whatever the Trump administration did, he was happy — even if they might have been a bit ham-handed, he was happy that they'd come in and made a position, because he wanted a pause and he saw that as a legitimate way for one to happen. So how did you feel post the Liron segment? Was your opinion changed by Liron?
21:47
Nathan Labenz: For some reason, Judd's comments hit me harder. But I do listen to Liron quite a bit as well, and he's had the influence on me in the past of making sure I don't put my big-picture worries too far in the background. For whatever reason, I guess I've just had so many experiences with the Trump administration where I feel like they often have the wrong answer to the right question. And I do feel like that has happened enough times that I'm a little burned out on it.
22:32
Nathan Labenz: So the simple argument — hey, this may be a clumsy move, but at least it's some move — didn't quite resonate with me as much. But then the cognitive empathy argument resonated more, because it was like, okay, yeah, maybe it is on us to meet them where they're at and really try to be constructive as opposed to just being critical. Even if we can imagine better alternatives here — as Biden used to say, don't compare me to the Almighty, compare me to the alternative — we're in this moment and making the best of it. I think Judd kind of—
23:18
Nathan Labenz: —usefully chastised me for not being as constructive initially as I might have been.
23:30
Prakash: Let me segue here to a friend of the pod, Dean Ball, who has actually been very constructive in both his criticism and commentary of the administration and of the frontier labs. So you had a chance to sit down with Dean Ball, who was part of the Trump administration for some period as an adviser, and then left, went back to his Substack — and then, about three months later, last week, announced that he was joining OpenAI to head a team focused on strategic futures. So you had a chance to sit down with Dean.
24:16
Prakash: What is his perception? What is he about to do? What is his focus going to be?
24:22
Nathan Labenz: Well, he's really stepping into one of the more pressurized environments in the world today. And it was striking to hear him say that he thinks this role will be more important, more impactful, higher stakes than his work in the Trump administration. I think probably everybody knows this, but he was essentially the lead author of America's AI Action Plan last year — even though it's not signed in his name. And somehow he is walking on water. I mean, how many people can you name who went into the Trump administration, left, and came out with generally higher esteem than they went in with? That is not very common.
25:08
Nathan Labenz: But he managed to pull that off. And now he's actually not going to be in a day-to-day government relations role, which you might naively think — okay, this guy came out of the Trump administration, now he's going into industry, he's going to be sort of a lobbyist. He said no, it's not like that at all. They have a government affairs team that responds to all kinds of state bills and the many things coming at them from governments both in the US and around the world. And they're the ones really lobbying for and against specific policies.
25:54
Nathan Labenz: His new team — Dean's team, which is getting started now — is going to be more focused on frontier AI policy, which only sort of exists. There have been some moves at the state level that Dean and others have approved of, including transparency bills, things like the RAISE Act that Alex Bores championed in New York and which has now obviously become a big focal point in his congressional race. But there's not too much at the federal level, and especially not a lot when it comes to governing recursive self-improvement and governing internal deployments—
26:39
Nathan Labenz: —of the latest and greatest models. One interesting story that came out over the weekend — which I think is unverified, and my guess is it's somewhat overstated — was that Anthropic has finished another model, something like Mythos 2 or whatever. This popped up; I'm not super familiar with the person whose post I originally saw on it. I saw another commenter say this person is super credible. Certainly on timelines alone — if they had an early Mythos earlier this year, then six months later they probably should have a meaningful v2.
27:23
Nathan Labenz: Timeline-wise, that makes sense. From what I've heard from people at Anthropic, they still don't have Mythos or Fable access internally — it's still off for nearly all of them inside the company as of now. So this raises an interesting question: what has not been asked directly, and what I haven't been told, is whether there's another non-public model that they do have access to. But this is where it starts to get super interesting and—
28:08
Nathan Labenz: —fraught. Right? I mean, Dean, on one hand, said he doesn't want to see this stuff go all super secret. He doesn't want the government to try to monopolize AI. He wants to see broad diffusion. He doesn't want to see nationalization. We're going to need diffusion to counter the state's impulse to try to control this technology — because only if it's diffused will there be a broad enough constituency saying, hey, we don't want this to be nationalized, we need it. So that's one big force. But then this other force is: the government is going to have a hard time regulating a model that they don't necessarily know exists, or don't know the details of, and that's not being—
28:53
Nathan Labenz: —put out publicly. So there's certainly some — and again, I have no idea if this is even true, although it stands to reason — but I have no idea what's going on internally at Anthropic. Would they even deploy a Mythos 2 internally while having Mythos 1 and Fable off for their own internal users? That would be a bit of a sleight of hand. It would definitely be a little weird. But I do think we're headed into weird times. So Dean's going to be at the center of all this, trying to figure out: what should we do? How does society — how does OpenAI — make good decisions in a regime—
29:38
Nathan Labenz: —where we don't know how fast this recursive self-improvement thing is going to go? We don't know how far it will go. We don't know how the government's going to react if we bring forward more powerful models. We may not — if we're told to take them down, we sort of feel like we have to. Do we want to risk that with our new models that are even better, that we really want to use because we're hitting some sort of 8x coding output escape velocity? It is getting a little bit uncomfortably AI 2027-like in some of these ways. We covered a lot of ground in that conversation — more than two and a half hours. He did comment a bit about OpenAI's—
30:24
Nathan Labenz: —relationship with Alex Bores. He had some interesting thoughts on the corrigibility-versus-character debate, which — despite the fact that he's going to OpenAI — he intuitively said he comes down on the side of character as the more promising path to getting durably and robustly good behavior from AIs going forward. But he also said one big thing: he hasn't started in the job yet. And a big part of the reason he feels he needs to go in and take this role — this has been a huge theme for us the last few weeks as well — is that unless you have access to the inside information—
31:04
Nathan Labenz: —on just how recursive self-improvement is really going where it's being pushed the hardest, it's pretty hard to make good policy recommendations from the outside if you're just reading the tea leaves or reading between the lines of what is publicly disclosed. So on many issues, he gave a pretty mature view that he doesn't expect to change much in the immediate term. But on questions like how does one govern RSI, a big part of his perspective was: well, I've got to get closer to RSI as it's actually happening at the research level. I'm going to have to be in close contact with researchers—
31:49
Nathan Labenz: —to really know what the shape of this thing is. And only with that kind of information will I be able to do a good job making policy recommendations. So it's kind of another one of these things where everything is collapsing into not that large a number of entities. Watching them closely is one thing, but being inside — in his mind — is really what it takes to be informed enough to hopefully move the needle with good policy analysis. I thought that was pretty much in keeping with everything we've been saying the last couple of weeks, but definitely a stark reminder that—
32:34
Nathan Labenz: —yeah, this is getting to be a kind of small game board in some ways.
32:42
Prakash: So let me segue a little bit before coming back to that topic. First: we have GLM 5.2. It is an open-weights model from a Chinese company, released roughly about a week ago. And over the weekend, people finally started to get access to it and use it in deployment. The first thing that happened was we have a former VP at Meta and Google DeepMind, Matt Veloso.
33:28
Prakash: He goes all day using GLM 5.2, didn't miss much, and declares it the first open model that passes the bar as a daily driver — and says: "Things are not going to be the same. Damn. Now I want to buy some serious hardware." So this goes out on Friday, and I remark on it roughly Saturday morning. At that point it has maybe 50 views or so. The post goes to 800,000 views on my side, about 1.5 million on his. The next thing that happens is this is Knowledge Atlas Technology, which is the—
34:13
Prakash: —creator of the GLM 5.2 model. And this is their stock price over the last five days in Hong Kong. You can see the spike going up on June 18, where we started making remarks. We don't trade on Hong Kong stock exchanges, but it's just a note on how quickly things move. The next thing that happens is Elon says within three to four months the Chinese models will catch up. Knowledge Atlas's co-founder chimes in and tells Elon it's not going to take that long.
34:59
Prakash: So we're at the stage now where we are probably going to see a Chinese model by end of year — which is not that long, five to six months — that gets to the Mythos-Fable level. So you have roughly February when Mythos preview was announced, and by December you're going to get a Chinese model out. The administration realistically only has about this ten-month window from announcement on the US side to when not only the government but corporations in the US need to be fully secured. Because a lot of the government in the US runs on Amazon and Google—
35:44
Prakash: —on these massive hyperscaler clouds. So you can't afford to say the NSA is just going to secure their operations and leave the rest of GovCloud and AWS unsecured. Going back to who announced the next version of Mythos — it's a poster by the name of Andrew Curran. He covers AI developments quite closely. 99% of what he announces is stuff in the current press that people aren't paying attention to. This is one part of the 1% where he has some inside information, and he does announce that he doesn't have a source he can reveal. But on the whole, it is not out of the bounds of what we'd expect. I had remarked also that the administration banning Fable probably accelerated Anthropic, because all of the GPUs that would have been devoted to inference for Fable — we're talking millions and millions of GPUs — would then have been freed up to train the next version of the models. And the final point is: does—
37:14
Prakash: —it matter that the teams internally don't have access to Mythos and Fable, when they probably have access to some next version — whatever they're calling it — which is more advanced? Does it matter if they have access to those things if they can, for example, create synthetic data to train the next model? What really is the administration going to be able to do when you have, just last week, a Nobel Prize winner — the co-creator of AlphaFold with Demis Hassabis — John Jumper, who was doing more of the biological work, jump to Anthropic. How long do you think a US administration can hold out in public opinion against a bunch of Nobel Prize winners?
38:00
Prakash: It's very difficult — if a Nobel Prize winner says I want to do this research and you're going to say I'm banning you — I think public opinion, the Supreme Court's opinion, it's just not going to fly. So I think the administration is actually fairly limited in what it can do—
38:45
Prakash: —especially in the time of social media. This administration in particular is very keyed to social media. You can influence their decisions and policy making by posting on Twitter. So I think we're in a stage where there are some people in the administration who would like to slow things down to give the government more time to adjust, but there just isn't that much time in this ten-month window. And if they don't act quickly, events will outpace them. And I expect events to outpace them as they already are.
39:30
Nathan Labenz: Yeah. I mean, I think events will definitely be likely to outpace them — I agree with that. I think two little qualifications I'll offer on your statements there, and then I see Swix in the green room. Okay, we'll have Swix in just a second. The first is: don't forget that AI is super unpopular.
39:59
Prakash: Yep.
40:00
Nathan Labenz: So I am not so sure that they will feel like they can't override a Nobel Prize winner. I mean, Demis Hassabis is a Nobel Prize winner, and I think they're not afraid to bully some scientists — we've seen that over time. Especially if the public is like, yeah, this AI stuff freaks us out, we don't really like it anyway. And now these companies are all saying it's super dangerous. I'm not so sure they won't be able to hold the line for longer. And the other thing to keep in mind: it's pretty credibly claimed that GLM 5.2 has a significant amount of Claude in it—
40:45
Nathan Labenz: —in the distillation sense. It's been described as thinking that it is Claude. When you ask what model are you, it sometimes says it's Claude. It supposedly has the distinctive Claude voice. I haven't used it enough to have a real point of view on that. But I think it's not entirely obvious that the Chinese models will surpass the American models if the American models are frozen at the same point in time. Given enough time, they will — but to the degree that they are doing a lot of the post-training and refinement that makes them really useful as a daily knowledge-work driver — if—
41:30
Nathan Labenz: —that's coming from Claude, then if there's no Fable, it stands to reason that their timeline to achieve Fable-like performance could also be significantly pushed back. So a lot of fog of war there, obviously, but I do think those two things both strengthen the durability, perhaps, of the administration's current position.
41:57
Prakash: We will certainly see. Without further ado, let me bring Swix up on stage.
42:12Interview83 min
swyx — The State of AI Engineering: Software Factories, FrontierCode, and the Loop as the MoatShawn Wang (swyx)Shawn Wang (swyx) — who coined 'the AI Engineer' in 2023 and now curates the AI Engineer World's Fair while doing evaluation research at Cognition — joined for a 77-minute conversation on software factories superseding coding agents, the FrontierCode benchmark and the 'unmergeable slop' problem in SWE-Bench, harness engineering as the new competitive moat, the synchronized lab IPO 'Illuminati group chat' read, and career advice for engineers navigating a hard junior job market.
Watch
As aired
Shawn Wang (swyx) — the person who named 'the AI Engineer' as a job title three years ago — joined Nathan and Prakash for a wide-ranging conversation anchored in his dual roles as curator of the AI Engineer World's Fair and researcher at Cognition. The opening act covered the state of the art as swyx sees it across the 30 tracks he curated for this year's World's Fair: software factories have replaced coding agents as the organizing frame, the RAG track is now 'search and retrieval,' and continual learning sits at an unresolved fork between weight-update partisans and systems-side pragmatists who prize controllability. The Frontier Code benchmark — swyx's baby at Cognition, inspired by METR's finding that roughly half of SWE-Bench-passing PRs are unmergeable slop — anchored a long sequence on how to actually measure production-ready code quality. The results show Fable (Claude's highest tier) scoring roughly 2.5× higher than Opus at less than 2× the token cost, making the economics compelling for hard problems even if commodity work doesn't justify it.
From benchmarks the conversation widened into ecosystem structure: the adviser/router model pattern (OpenRouter Fusion, Sakana Fugu) is clever but theoretically limited because a weak model doesn't know what it doesn't know; cloud infrastructure is being rebuilt for agents at a 10–15× scaling stress level that is straining GitHub and the entire SaaS stack; and enterprises are beginning to vibe-code their own internal systems of record rather than pay for a dozen $20/month subscriptions with siloed chatbots bolted on. swyx described his own team using three parallel vibe-coded speaker-reconciliation trackers and finding that the redundancy caught more errors than a single authoritative system would have. The market-structure section ended on the Apple-vs-Android read: OpenAI and Anthropic are going vertically integrated, Microsoft is positioning as the open ecosystem, and startups like Cognition and Tasklet have to find differentiated wedges or risk acquisition.
The back half turned to macro anxieties and career advice. swyx was unusually candid about what he called the 'Illuminati group chat' read on the synchronized lab IPOs: the bear case is that insiders with private information are distributing to retail at peak AI hype, the same pattern he's seen twice before in tech cycles. He tempered it with an honest bull case — agent adoption at consumer scale is probably still 100× underpenetrated — but flagged semiconductor supply as the physical ceiling. On Google he gave a nuanced defense: hardware-and-data destiny, a VO/text-diffusion research portfolio that rivals still can't match, and the best native data position for a personal operating agent — but a real product-gene deficit. The segment closed on the CS enrollment collapse (Stanford −42%, Berkeley −61%) and swyx's career advice to young engineers: the job market for juniors is genuinely hard, but the 'members of technical staff' at every frontier lab are 21 years old — demonstrate taste, ship interesting things, and practice 'learning in public,' the career meme that predates his AI work and that he considers his most durable contribution.
Key moments
The model is no longer just the product. Even Greg Brockman said it — it's model and harness that is the product. We on the systems side have been saying that for a while, and it used to be cope. Now it's less cope.
Shawn Wang (swyx)46:54
About 50% of SWE-Bench code that passes the test is completely unmergeable — it's so low quality. Yeah, technically you'll pass, but you modified files you weren't supposed to touch, or you cheated on the test. We want to guide the evolution of models toward maintainable code and against slop.
Shawn Wang (swyx)1:01:40
The bear case is everyone with insider information is selling, retail is buying, and this is as far as AI progress goes for the next five years. The bull case is that we seriously underappreciate the burning power of agents deployed to the whole world. Those are the two worlds we're choosing between.
Shawn Wang (swyx)1:41:43
Questions asked
48:26What's the fundamental split in the continual learning space, and what do enterprises actually want from it?
swyx described a deep split between 'model people' who want to update weights and 'systems people' who prefer storing and retrieving via controllable databases — they don't respect each other. The model side offers full internalization of learned knowledge but at the cost of debuggability; the systems side gives enterprises the control they need to delete bad facts, audit what the agent knows, and avoid cross-contamination. Enterprises ultimately want cheap, perfect, and private — and that triad skews heavily toward the systems side today. The POC market with banks and Fortune 500 companies is real but still early, with the key question being whether startups can prove out weight-update approaches before their runway or their customers' patience runs out.
59:24What is Frontier Code trying to measure that SWE-Bench can't, and what do the results tell us about Fable versus Opus?
swyx explained that Frontier Code was built in response to METR's finding that roughly half of SWE-Bench-passing PRs would not actually be merged by real maintainers — 'unmergeable slop.' Frontier Code grades on would-you-merge-this rubrics (scope discipline, style adherence, regression safety, parallelizability) using tasks that are fully out-of-sample from any training set and sourced from production-realistic codebases in multiple languages. The result is that you stop getting a one-to-two percent differentiation between models and instead see clear stratification. On the hardest tier, Fable scores roughly 2.5× higher than Opus at less than 2× the token cost, which Nathan characterized as a compelling value proposition. swyx added that on commodity problems most people actually don't need Fable, but for the hardest engineering tasks the economics clearly favor moving up.
1:21:03Is everyone in the coding-agent space building the same thing, and how does this market shake out?
swyx said yes, in an early market everything converges toward the same commodity problem — right now 'building the Devin competitor' is the shared obsession, just as earlier everyone built VS Code forks. He compared it to the luxury handbag market: it looks homogeneous from outside, but mature markets develop clear segments. He expects a few winner-take-most platforms and characterized the structural dynamic as Apple (OpenAI, Anthropic — vertically integrated) versus Android (Microsoft — open ecosystem). Smaller startups like Cognition and Tasklet need to find genuinely differentiated wedges or risk acquisition; the key starting point question is who owns auth, because whoever owns the irreplaceable prime position (like Rippling owning the org chart) can expand from there.
1:40:58What worries you about the synchronized lab IPOs, and what's your honest bear case?
swyx was unusually candid: he sees a pattern he's observed twice before in tech where all the good companies IPO at the same time, and two years later the market is in a deep slide — he called it reading the Illuminati's 'tell.' The bear case is that insiders with private information about the limits of current scaling are distributing to retail at peak hype: Fable-class models represent something close to the ceiling of transformer scaling, we're at a quasi-ban on even more capable models, and deploying the current generation into enterprise is already priced in, so what's the next 5× catalyst? He tempered this with the bull case — agent adoption at true consumer scale is probably 100× underpenetrated from here — but flagged semiconductor supply (Taiwan sold out the next two years of capacity) as a real physical ceiling that the bull case has to answer for.
1:56:41What's your advice to a CS student or new grad entering this market, where Stanford and Berkeley CS enrollment is down 40-60%?
swyx separated two phenomena: the market for junior engineers is genuinely hard right now, and some concern is warranted. But he pointed out that every 'member of technical staff' at OpenAI, Cognition, and the other frontier labs is 21 years old — the demand for the very best young engineers has never been higher. His core advice was to demonstrate taste and execution, not just coding ability: ship interesting projects that make people say 'I want to use that,' read and learn from the emerging genre of 'how I got into a frontier lab' blog posts, and attend leading industry conferences. Most importantly, practice 'learning in public' — his pre-AI career meme, now a seven-part blog series — which is how you earn mentors without cold-asking them for coffee, by doing visible work and responding publicly to experts who can see your improvement slope.
1:34:10Is the trend of teams vibe-coding their own internal systems of record real, and is it a problem?
swyx confirmed it emphatically: his own team uses a self-coded GitHub clone and a self-coded Slack clone, and they had three people running parallel vibe-coded speaker-reconciliation trackers during the World's Fair prep — finding it actually caught more errors than a single authoritative system would have, because each person's different point of view surfaced different issues. He acknowledged the risks (data loss, outages, missing audit logs, weak authorization), but said the models are getting good enough to guide you through these pitfalls as you discover them. The bigger structural observation is that the entire SaaS economy was built on being 'the system of record plus a chatbot on top,' and that model is increasingly being replaced by sovereign personal agents — he pointed to OpenClaw's model (all data synced to the individual, no per-SaaS silos) as the direction the market is heading.
Related
swyx on X ↗AI Engineer World's Fair ↗Latent Space podcast ↗FrontierCode benchmark (Latent Space) ↗
Full transcriptLightly edited · timestamps jump to YouTube
42:13
Prakash: With us today is Shawn Wang, universally known as swyx on the internet across the developer ecosystem. He's a cornerstone figure in the transition from traditional software development to what he effectively named and defined as AI engineering. Before organizing the global community around artificial intelligence, he was a quantitative finance analyst who taught himself to code, eventually leading developer experience and infrastructure design at major technology companies, including AWS, Netlify, and Temporal. Today he is the founder and primary organizer of the AI Engineer World's Fair, an event that kicks off next week at the Moscone Center in San Francisco, bringing together over 6,000 developers, researchers, and enterprise leaders.
42:58
Furthermore, swyx is the co-host of Latent Space, an immensely popular technical podcast that serves as the definitive audio record of how frontier AI is actually being built in the trenches today. His front-row seat is to the most critical debate in the technology sector — how the industry stops AI from generating unmaintainable, low-quality code (what he calls slop) and starts building reliable, autonomous engineering teams. Operating as an adviser to Cognition, the company behind the autonomous coding agent Devin, his technical stance focuses strictly on the craft of implementation.
43:44
He advocates that the raw intelligence of an AI model is insufficient on its own. What matters now is harness engineering — the complex infrastructure, memory systems, and security guardrails required to turn raw AI outputs into robust, production-ready software systems. swyx, welcome to the show.
44:06
swyx: First of all, what an intro. I hope that was Claude — I don't know if it's more insulting if it was AI-written or not. But that was very well done, very well researched. You even talked about the finance side.
44:20
Prakash: Shawn, one thing I've always found while following you on Twitter is that you are really at the forefront of the latest developments in AI engineering. So what are you looking at right now? What are you concerned about?
44:38
swyx: 'Concern' is a strong word — that invokes safety and all that. I'm mostly an optimist around the potential of AI and how much runway we have left. I have a few themes I'm curating. Basically, once a year, I get to be dictator of AI engineering because I run the World's Fair — it's a very big show. We have 30 tracks this year, and all those selections are mine. I don't just choose speakers — I choose which tracks to feature, because that effectively sets the agenda. Things like auto-research are very popular. We used to have a coding agents track; that's now been killed in favor of software factories. We killed the RAG track last year, and now it's search and retrieval. We've split GPUs into pre-training, mid-training, post-training, and inference. We're also focusing a lot on data quality as part of pre-training — your model is only as good as the stuff you train on, and you can train a lot more efficiently with better datasets. People actually want to buy and sell better datasets, including synthetic environments. And maybe the last theme I'll leave you with is memory and continual learning — obviously a very popular topic right now.
46:54
I think what I care most about in the near term is exactly what you read in the intro — the systems stuff, how to put together the harness. Most recently even Greg Brockman said, yeah, the model is no longer just the product; it's model and harness that is the product. We on the systems side have been saying that for a while, and it used to be cope — now it's less cope, still a little. There are things I am worried about that we can talk about, like the IPOs of all the labs, and what happens next year once everyone has shipped Fable-class models in every lab and then — what? I think that's something very interesting to dive into.
47:52
Nathan Labenz: Let's do them one by one. What are you seeing on the continual learning front that you think is most interesting? I've been watching it for a long time, waiting for something to tip. I thought nest learning might suggest Google was going to take the baton for a minute — obviously that hasn't quite happened. As you scout the frontier of continual learning, what do you see?
48:26
swyx: A fun meta point I double-check every time I talk to a frontier-lab person: what gets published and what doesn't. For the last two to three years, every time people have talked about Google, the overwhelming consensus is that if a paper is good or an idea is good, it does not get published. So keep that in mind when you read anything published from Google. On continual learning and memory, there's a fundamental split between the models people and the systems people. The big question is: do you update model weights or not? Is a glorified RAG system in another format actually memory? Is it continual learning? When I write a skill and later call it again, it does learn — but it's zero-gradient. It's in-context learning, not machine learning.
49:57
The more machine-learning end of the spectrum says: update model weights. The less machine-learning end says: store and retrieve. This all comes down to how controllable and interpretable you want your memory to be. You're going to recall bad facts. You're going to want to forget things. Maximum control is no weight updates — you control what enters the system, so you can delete, monitor, and debug. But for full internalization of learned things, you probably do have to train on it. That's a whole discipline — Trajectory AI, Engram, Adaptation Labs — these are all speakers at the conference. First half: people who update weights. Second half: systems people. They don't like each other. The model people don't view the systems people as legit; the systems people say, well, you'll never have a memory system you actually understand because you're just updating weights — it's just continued pre-training. Both sides make fair points.
51:24
Nathan Labenz: What do you think the enterprise wants? I'm very drawn to the weights side because it seems like there's some eureka moment there that could be more transformative. But I could imagine VPs of engineering saying, yeah, that all sounds great, but I do value being able to open the file and see what's going on. What attitudes do you think customers have toward this?
52:23
swyx: Quite simply, it just takes one security incident where you leak information you weren't supposed to because you trained on customer data — or your information was somehow exposed to your teammates'. Enterprises want cheap, perfect, and private — those are the three things. That skews toward the systems side today. It's not blocking companies like Engram or Trajectory from doing good POCs with large enterprises, but POC stage still means a few million dollars of evaluation, and the question is can they figure it out before the money or the patience runs out. The beauty is you can run your traditional harness system and then run a shadow system with the online learning approach and A/B test. All those problems are solvable — they're open research questions. We're unlikely to get papers on this because it's so valuable as an open problem for startups. And the underlying dynamic is that context length is the slowest Moore's Law in the industry — we've gone from roughly 1,000 tokens to 1,000,000 in three years, which is actually slow by any other measure, and I don't think it scales the same way. So we do need to figure out the weight-update question eventually.
55:13
Prakash: To what extent do you see the balance between the hunt for capability versus the hunt for cost? In the last three to six months, especially as more capable and much more expensive models have appeared, there's been the question of whether enterprises are getting ROI on these tokens. CEOs ask: I'm already paying my developers — why do I have to buy them tokens? Are startups shifting from 'deliver more capability' to 'be more cost-effective'?
56:14
swyx: Startups should max capability, whereas enterprises should probably max cost-effectiveness. That feeling has definitely shifted in the last year, mostly because of the publicity around token-maxing issues from Meta, Uber, and others. But there's always going to be a core group of influencers, experimenters, and R&D people who headline through capability-maxing — and that continues to lead the industry regardless of the efficiency focus. Efficiency will always be a topic we want to improve, but we also want to understand the best capabilities we can get. That's why, even if Fable were very slow and expensive, we'd still explore as much as we can on it — it's probably a preview of what's affordable in two to three years. And I like that we can try all these things via the API, because there's a darker future where Anthropic doesn't release Fable via API and you can only access it through Claude Code. To the extent that model labs remain committed to an open API ecosystem, that matters — you should be applying AI to more than just coding and B2B SaaS.
58:12
Nathan Labenz: You just put out Frontier Code — and I know you were significantly involved in its development. I see it tying a few threads together: it's the latest yardstick for automation of software engineering; the results seem to strongly suggest high willingness to pay for a Fable-class model based on success rate per dollar; and you alluded to the RL-environment cottage industry. A benchmark and an RL training environment are, if you squint, more similar than different in the role they play. What were you really trying to do and measure with Frontier Code?
59:24
swyx: Frontier Code is my baby at Cognition. There have been three generations of evals there — Cognition Golden, then Junior Dev, and when I joined I was a strong advocate for publishing externally from day one. The research team and contractors did most of the hard work; we partnered with METR and also did a lot in-house. I named it Frontier Code mostly because it's inspired by Frontier Math, and I realized Epoch wasn't going to do code — they're just doing math — so I figured we'd do the code version and try to make it the last eval you'd need. The interesting thing you observe about RL environments and benchmarks collapsing is that it actually started with SWE-Bench — the first agentic benchmark where you have an open-ended task and can use whatever tools you want. Frontier Code basically adds rubrics and updates the distribution of tasks toward something more production-realistic: not just Django issues but C++, Python, Java, and more.
1:00:55
The reason we're so excited about Frontier Code is that you stop being able to articulate differences in model quality with saturated benchmarks like SWE-Bench — you get at most a one to two percent bump, and how much of that is memorization? Frontier Code tasks are all out-of-sample; they're not in training sets. They're heavily rubric-graded. We found that SWE-Bench and similar benchmarks allow a lot of false positives — models can cheat on them the same way they do reward hacks during training. We have an internal catalog of twenty different ways models cheat. We translated that into rubrics and shipped them as part of the eval. We want to judge models not just on whether they can pass the test, but whether they write code we would actually merge. METR had this blog post showing that about 50% of SWE-Bench-passing code is completely unmergeable — technically it passes, but the quality is so low it's unacceptable. Did you modify files you weren't supposed to touch? Did you cheat on the test? Did you adhere to code style? We want to guide model evolution toward maintainable code and against slop.
1:03:11
The results really do show significant uplift that Fable and other upcoming models have over the current Opus class, which is very exciting — the fact that we now have a benchmark that can clearly articulate where a model is passing versus failing, and entire classes of problems now being solved. I'm still advising on Frontier Code now, and we have the roadmap planned for the next two years for it to continue as a leading benchmark.
1:03:49
Prakash: When do you think Frontier Code gets saturated?
1:03:59
swyx: Never — by design. Frontier Code 2026 will be saturated by the end of this year; my estimate is we'll probably hit around 80% by then. That's expected — it's based on open-source repos that will eventually get trained on. So the answer is simple: annual cadences. Frontier Code 2027, 2028, and so on. Every year we just move the goalpost. This year the defining theme is rubrics for code quality. Next year my candidate is security — we want people to write secure code. Every year we can have a defining theme, similar to what I do for the World's Fair, except here we set the agenda through benchmarks. The other thing I'm very keen on is private held-out sets — private evals. Cognition has Goldman Sachs, Citi, JPMorgan, and a lot of the Fortune 500 creating evals that reflect their actual problems but are not public. That's the work that remains to be done: Frontier Code Finance, Frontier Code Retail, Frontier Code Telecom. That's how agent labs like Cognition translate industry problems into guidance for model labs.
1:06:22
Nathan Labenz: Is there any worry that by emphasizing taste, readability, and maintainability from the perspective of human repo maintainers, we potentially move away from the Move 37 upside of AI? In math, if you proved a previously unsolved problem — even with a hundred-page incomprehensible proof — you'd call it a win. Here the vibe is different. Do you worry we're putting a lid on the absolute ceiling of performance by anchoring to human preferences?
1:07:56
swyx: Absolutely — this is a very good subtle point. Mathematicians actually say the same thing about AI-generated Lean theorem proofs: they're too detailed, no human would actually write them, and while they're systematically provable, they don't contribute to knowledge in the way a clean proof does. I do think that at some point you should just step away from looking at lines of code and say: does this work as expected? That's great — except to the point where other agents also need to read your AI-generated code and work together on it. And except to the point where you have very critical code that you're liable to the SEC for, or to healthcare authorities. 'I vibe-coded this thing; I don't know what's going on inside' — is that an acceptable excuse? Probably not for the next thirty years. But for most people, you can get by with: here's a black box, do whatever you want inside, but the inputs and outputs must meet quality standards, and I expect the code to be structured so my other agents can maintain and parallelize it.
1:09:26
I have a skill I call Kakuna where I only harden code — because that's the only skill I want it to know — and it also tries to make code parallelizable, because most models, if you let them run today, will produce monster files with six thousand lines of code that are completely unmanageable. Some governance is needed, but that will ease off over time. And I think people maybe anchor too much on Move 37 as a dramatic changing point. Really, the water temperature is just slowly rising around us. There has been a sharp increase in agentic coding — we've seen it at Cognition, you've seen it everywhere — but it should be managed well. You shouldn't just go full YOLO into 'I no longer care about code quality at all.'
1:10:54
Nathan Labenz: Let me check whether I'm reading the Frontier Code graphs right. My basic takeaway: Fable gets something like 25% on would-merge tasks versus Opus in the high single digits, at less than a 2× cost increase. I look at that and think: more than twice the success rate for less than 2× the cost — sounds like I should move all my tokens to Fable. Is it that simple, or is there more nuance?
1:11:58
swyx: Model and trajectory efficiency are very underrated topics. Other labs are doing good work improving them. One thing Frontier Code shows is that previous-generation models don't give you much uplift per extra reasoning effort — you're spending a lot more tokens without getting that much better. So keep them on the lower reasoning setting. But with Fable high or Fable max you're getting a lot more uplift, which is fantastic. On the hardest problems, absolutely use it. But for most people working on commodity problems, you don't need Fable.
1:12:48
Prakash: What do you think of the adviser-model structure — where a less capable model identifies when it has a problem it probably can't solve and calls a more capable adviser model? OpenRouter Fusion, Sakana Fugu — several firms are pushing this idea. What do you think of that kind of harnessing?
1:13:25
swyx: OpenRouter Fusion, Sakana Fugu — this probably extends back to any model-routing problem; you're just giving it a different name. Even companies like Martian, and the group at Berkeley, had similar ideas earlier. Walden, one of the co-founders of Cognition, had an early version called SmartFriend that shipped in Windsurf — not that many people use Windsurf, so it doesn't get talked about much. The interesting theoretical limitation of the adviser strategy is that the dumb model doesn't know what the smart model can do — it only knows the rough shape. So when you're asked a question more complex than your intelligence can handle, you don't even know you should call up to the smart model. You just answer it straightforwardly. In practice, for cost reasons, you want to start with the dumb model and escalate to the smart one as a tool call. It isn't as satisfying and it doesn't solve the theoretical problem. Will everyone be using it in production? No. Will it be a trend? Maybe for three months — then you get the next Claude with more adaptive routing built in, and you're better served by a single model trained end-to-end for adaptive thinking than a system-level ensemble.
1:15:42
I'm not toeing the Cognition party line here — Cognition would love there to be a multi-model future where we can be the model router rather than a single lab. But the model labs are very incentivized to do this all in-house. OpenAI acquired Statsig precisely for this reason. Your hands are tied if you're using any closed model API — you can't affect the model choice.
1:16:37
Prakash: Precisely because they're incentivized to bring it all inside their own boundary.
1:16:43
swyx: Absolutely. You just want the god model and just — figure it out. Who else wants anything other than that? Only tinkerers will be like, oh, I'm using a Frankenstein model with Opus here and Gemini plugged in at precisely this point. Sure, but that'll never be an industry thing.
1:17:11
Nathan Labenz: Andrew Lee from Tasklet said something that has echoed in my head since then. He sees basically everybody building the same thing and also sees only about three sorts of software surviving: the model companies, some neutral orchestration layer, and maybe a third category. I could make the case that Cognition is playing at that neutral layer — helping you avoid lock-in to one model provider so you have option value and recourse. How far do you see that convergence going? Is everybody building truly the same thing, or is there meaningful divergence in the winner circle? And how much of a problem is price discrimination — I'm seeing more than 10× per-token price differences between first-party subscriptions and API pricing. Could you imagine a government price-neutrality rule actually helping?
1:19:47
swyx: On price controls — no, that's never going to happen, especially not in this government. At that point just nationalize the labs. Short of that, people will revolt; any country will take the talent. I don't see that happening. On your first question about convergence — as far as an early market is concerned, yes, everything becomes very commoditized and competitive. For a while everyone was building their VS Code fork; now everyone's building their background agent and trying to beat Devin. That's as it should be — we haven't really figured out the market segments yet. But look: luxury handbags — there are five of them and they're all great with differences that most of us don't even know. They target different market segments. Right now we're all making the same bag. Eventually the market matures and you get clear segment leaders.
1:22:33
There will probably be a couple of winner-take-most companies that fit most people. Those will effectively be the new enterprise tech giants. Cognition is obviously trying to be one of them. The interesting structural question is who is the Apple — vertically integrated — versus who is the Android — the open ecosystem. OpenAI and Anthropic want to be the Apple. Microsoft wants to be the more open one; I had my chat with Satya at Microsoft Build, and then he published that article that went super viral — having a CEO posting like that on X is something I never thought could happen. The smaller startups that come in as disruptive incumbents will have to take a different tactic, because the big players will exercise their brand and moat advantage. The Tasklets of the world will have to do something very different, otherwise they'll just be acquired.
1:24:55
Prakash: Looking at startups versus large firms — Cursor Origin is being described as a Git replacement, basically. The reasoning is that GitHub itself can't manage the exponential increase in commits and activity that agents are generating. How does this shake out? Our software infrastructure wasn't built for agents. Where does it go as agents become the overwhelming majority of traffic volume on the internet?
1:25:58
swyx: I had the CEO of GitHub on the pod recently and point-blank asked him how come they're not handling all this volume. He gave a reasonable answer: it's the largest scaling in every dimension they've ever done — a 10× to 15× increase in every dimension simultaneously. But the broader point is that all cloud infrastructure basically needs to be rebuilt for agents. We have a fundamental assumption of identity as human, of API tokens being obtained through a UI — all that needs to go away. And that's before we even talk about the scaling side. The scaling side means a lot more instantly forkable sandboxes and file systems for agents to operate in. It also means a lot more fraud and spam.
1:26:43
My worry is that people start to wall off parts of the internet. You get not only a dead internet but closed walled-garden internets — like in China, where you live in the Baidu universe versus the Tencent universe. Cloudflare saying 'we'll just ban agents that aren't part of Cloudflare's ecosystem' — I'm not picking on Cloudflare specifically, the same will happen with OpenAI in the Vercel lands — that's not the ethos of the open internet that I want. What Graphite and Cursor Origin are doing is fantastic and needed. But there's a structural mismatch: anyone good enough at GitHub who could solve the problem can go to Cursor and make effectively 5 million dollars a year doing similar work versus a good Microsoft-level salary. GitHub has somehow gotten itself into a position where the startup is preferred to the incumbent — an interesting moment for an open-source-founded company.
1:28:54
swyx: It's very unfair. The degree of scaling is staggering — 14× on commits, 10× in every other dimension of parallelism. The sandbox companies, the E2Bs and Daytonas of the world, are growing at least 50% month-on-month for the last year-plus. And it's a stupid amount of slop spewing from agents. I run a large newsletter, and every time I send it out I get replies from Claudes reading the mail and trying to reply to me — I know there's no human on the other side, but I have to manually go and block these Claudes because they're clogging my inbox. So then I need an agent on my inbox to read their agents' emails. This is all one giant recursive mess. I've stopped reading emails — it's terrible. And we haven't given agents money yet. Imagine what happens when they all have wallets, stablecoins, and are buying and selling, and we don't know what's going on.
1:30:50
Nathan Labenz: Mine does actually have a Mercury credit card with a modest amount on it.
1:30:54
swyx: What has it done?
1:30:55
Nathan Labenz: Set up some things like the 0.xyz Swiss army knife into lots of other tools. Buying groceries for me and stuff like that — but it hasn't taken on anything dramatic yet.
1:31:13
swyx: Hook it up to 'rent a human,' see what it does. If you give humans as tools to an agent you can do a lot more unhinged things — and you have to pay for it, obviously, and give the human a budget too. There are all these levels of delegation going on: agent to human to agent to human. I think that should be doable. If I wasn't running a conference right now I would absolutely go explore that.
1:31:48
Prakash: Give an agent a human as a tool — inverting the normal process.
1:31:54
swyx: We've done this forever. The first time I called a human via API was Amazon Mechanical Turk. As a developer you're trained to think API calls are cheap — until you've run hundreds of thousands of dollars in code and realize if you don't dry-run this first you'll waste a hundred grand. Mechanical Turk, Crowdflower, Rent-a-Human — even Scale AI was like that at some point.
1:32:36
Nathan Labenz: A few things I'd love your take on. One is: to what degree are we seeing people rebuild systems of record in-house? For my own agent setup, when it came time to have a messaging system between me and my several agents, I ended up writing my own — I didn't go to Slack, I didn't go to Telegram. I let Claude write a custom solution. Sure, Slack has tons of features I don't have, but I also don't really need them. To what degree are people saying: we don't need 90% of GitHub's features, we don't need 90% of Slack, we can in-house some of these functions we used to pay for? Is that a real trend?
1:34:10
swyx: Not only that — I have my own GitHub clone that I'm using. I have my own Slack clone. Just before this call I got off a call with my team where three of us had each vibe-coded our own source-of-truth trackers for our speaker reconciliation. We were all checking our own clankers and agreeing on what's going on — it sounds wasteful to have three separate systems of record, but actually we all caught different things because we had different points of view. Between the three of us, we caught more than any one of us would have individually. That seems like the right thing: wasteful in the small, more precise in the large. Yes, people are replacing SaaS with internal vibe-coded versions. There will be dramatic data losses and outages as a result. But people need to be more careful about keeping and auditing data — most haven't been trained to write audit logs, rollback procedures, or proper authorization. We'll speed-run those lessons along the way.
1:35:40
What's interesting to me is that the top founders I've talked to are saying companies should have a sovereign system of record, not individual SaaS silos. A lot of the SaaS economy is built on 'we will be the system of record for this kind of data — your meeting notes, your calendar, your email — and then we'll slap a chatbot on top.' But that chatbot has no memory of anything else you do, no integration with your other preferences. What OpenClaw has instead is: this is my personal agent, all the data is synced to me, and I decide what to do with it. The reckoning is: how many systems of record should there be versus how many $20-a-month subscriptions are you paying? What's really interesting or innovative is that Salesforce and Mark Benioff are out there saying everything will be available via API — you can just take it. That's very forward-thinking. Either he does that or the Salesforce killer does. He's heading off the existential issue. More people should think about what the business model looks like beyond just sitting on data and bolting a chatbot on.
1:37:55
Prakash: Is there an opportunity for a system-of-record vendor to actually move into the business itself? For example, Veeva provides a CRM for pharmaceutical sales. Could Veeva, as an expert on pharma workflows and systems, actually move directly into pharmaceutical sales? Could they leverage that position and the AI chatbot to start doing the actual selling?
1:38:37
swyx: It's going to be so messy. If I'm a pharma IT administrator, they're just going to try to upsell me to this new thing, and I'll say 'no, just stay in your lane.' Everyone is going to try to be the operating system of some vertical, and there will be a thousand operating systems and most of them will lose. What I've observed is that whoever owns auth tends to win. Whoever owns the thing that cannot be moved — the undefeatable prime position — that's where you want to start. Take Rippling: it was famously a compound startup that wanted to do everything. But they picked the right starting point: we will own your org chart. Once you know everyone in the company it is so easy to add SSO, payroll, IT admin, and anything else on top. I've had moments where I thought you want to own developers, or own front-end — but actually, owning auth is better.
1:40:40
Nathan Labenz: I want to go back because I definitely want to get your take on the IPO thing. You said earlier there's going to be a lot of liquidity. What are you watching or concerned might happen as a result?
1:40:58
swyx: I'm worried about all the Illuminati collectively deciding this is the time to exit. For years it was: no, private is good, I don't have to answer to the SEC or do quarterly reports. And suddenly it's all changing. It's suspicious. There's a group chat somewhere. And by the way, this has happened twice in my career already — suddenly all the good companies are IPOing, and then two years later the market is in a deep slide. Nobody knows the future, but the powers that be have all decided now is the time. The bear case is: everyone with insider information is selling, retail is buying, and this is as far as AI progress goes for the next five years. We've scaled LLMs to the max — Fable is about it, we now have a quasi-ban on Fable-class models — so what do you have to look forward to apart from deploying the current generation into enterprise, which is already priced in? That's the bear case. Then you're back to needing new model architectures beyond traditional transformers, which is open-ended and uncertain.
1:43:20
Nathan Labenz: What odds would you actually put on the bear case?
1:43:24
swyx: I haven't voiced this publicly, and it's somewhat conspiracy-theory territory — these are savvy players who also genuinely believe they're going to revolutionize the industry. So I'll give you the upside: the bull case is that you, I, and even the people running the frontier labs seriously underappreciate the burning power of agents deployed at scale to the whole world. What if it ran your whole life? Your mother's life? Your neighbor's life? You're not seeing that yet, and it's probably 100× from here. If Anthropic is earning something like $50 billion right now, 100× is $5 trillion in revenue — then stocks keep going. But I temper that with: you just casually forecasted a 100× increase in compute consumption. Where is that compute coming from? Taiwan just sold out the next two years of semiconductor capacity. You need ten years to spin up a semiconductor industry in America — that's not happening soon. So there are all these multifactor constraints on the timing. On balance, I think you should be at least a bit worried that the big run-up has happened, and you're now either permanently wealthy or in the permanent underclass.
1:45:38
Prakash: Let's say the IPOs happen and some people retire permanently. For the ones who don't retire — what do you think are the ideas that are feasible for them to pursue with large amounts of capital and no restrictions on research direction? I assume they're not going into coding agents at this point. What would a frontier-lab alum with millions of dollars invest or research going forward?
1:46:19
swyx: The categories are pretty obvious. The first immediate one is robotics: you solved the software world — now what about the physical world? What's the TAM on that? Don't even bother running the calculation, just do it. Beyond that: medicine. I was at the Midjourney Medical launch event — eye-opening. Maybe it doesn't work out, but as far as worthwhile things to work on after your cat-picture generator, going into 'we will do CT scans quickly and cheaply and make it sexy' — that's great. More people should solve health: travel to the stars, do the grand quest apart from coding B2B SaaS. Unfortunately coding B2B SaaS is what we'll solve first, but the new rich should dedicate resources to the new grand quest — cure cancer, solve aging, understand the human body. We all know what the open problems are, we just don't know how to solve them. When you have infinite money and funding and all this intelligence to deploy, you can actually go do it. What David Holz did with his Midjourney money is existence proof that you can be rich and still try to do meaningful things. Same for Zuckerberg and Priscilla Chan funding Biohub — the largest charitable donation to science in history. In a thousand years no one will remember Facebook, but we'll know so much more about the human body because of Biohub. Facebook will just have been a funding mechanism for Biohub.
1:48:41
Nathan Labenz: Let's get some diseases cured. I love it.
1:48:45
swyx: Diseases is number one, but just fully understanding the human body. Even on my own microscale: the fact that I don't know how to operate my own body but I know how to run a server — that's so ridiculous. Health advice is as simple as 'drink water,' and we can't even do that consistently. What business do we have doing anything else when we can't fix ourselves? Abundance means we start working on Star Trek problems, not Star Wars problems. A lot of people in San Francisco are actually there already. The crowd at the Midjourney Medical launch was — there's a SpaceX investor who's a billionaire from SpaceX, and now he's all about healthcare. Brian Johnson built a Stripe competitor no one cares about, but now he's the health guy. Honestly, not a bad pivot.
1:51:01
Nathan Labenz: I've got one more, and then last we'll pitch the conference — which I've attended and it was excellent. Google — what's up?
1:51:17
swyx: I love that your question is just 'Google — what's up?' The departures of John Jumper and others — bad timing, all coming so closely together. But I do think hardware and data are destiny, and on those two dimensions Google is still winning by a huge margin. They were first to invest in TPUs and LM research. There's an org issue they need to navigate. Hiring people like Logan Kilpatrick to run AI Studio — having a good product face — they just don't have that strong a product gene. Early hits like NotebookLM, and then the moment someone has a hit they leave. That's not healthy. But hiring people like Varun and the research team, having Sergey come back to work on it — that bodes very well for R&D. I'm actually pretty happy with 2.5 Flash. They're not leading, but what's the issue? They're still a frontier lab.
1:52:48
I do think they're the most primed to deliver the personal operating agent that everyone wants — call it Scout, or Scar, or Spark. I interviewed Jeff Dean and he said: I want a personal agent that has access to all my emails and all my calendar. Who has that natively? It's Google. Everyone else is just syncing from Google. Apple's not going to do it — honestly, you should ask 'what's up with Apple' more than 'what's up with Google.' OpenAI has basically shut down all side projects — Sora is on the back burner, they're just doing coding and enterprise. But Google still has the funds and appetite to do Veo, to do text diffusion (a huge success at AIE Europe, by the way), and to do Gemini Nano. They are still doing fundamental research that will set them up for the next ten to twenty years. At Google's weight class — Facebook, Microsoft — who else? I've made a strong case and I'll leave it there. They want their models to be better and their products to be better. They know it. They're working through it.
1:55:05
Prakash: One last question. This year, Stanford computer science graduates are down 42%, Berkeley down 61%. What's driving that, and what advice would you give to a CS student or recent graduate today?
1:55:21
swyx: Shocking numbers. Has anyone analyzed the breakdowns — are they dropping out, or are people not signing up in the first place?
1:55:27
Prakash: At Stanford, roughly two-thirds to 75% of graduates used to have a CS minor or major — a lot of people were doing co-term dual degrees with CS as a safety degree alongside their primary. What's happened is those peripheral CS people have dropped the secondary degree. They're still taking some CS classes but not getting the full credential. Core CS people are still in core CS. So it's really the peripheral people who've opted out. Given your vantage point on what's coming in the next four to five years — what advice would you give to someone entering the workforce right now?
1:56:41
swyx: I think there are two separate phenomena here. The market for junior engineers is genuinely challenging — not that many companies are hiring juniors right now, so some concern is warranted. That said, all the highly compensated members of technical staff at OpenAI, Cognition, and the rest — they're all 21. I'm the oldest person in those rooms. If you're in computer science, people do believe in you and want to hire you, but you have to demonstrate that you are the right person. Do interesting work. Demonstrate taste: research taste, project taste, the execution ability to ship something interesting that makes people say, 'oh, what's that? I want to use that.' It's not coding ability or LeetCode-solving ability that's in short supply — it's execution and taste. The demand is super high. People are still doing on-campus recruiting. There's an emerging genre of blog posts by people who got into frontier labs, writing up how they did it — I read all of those, and they're a really good guide.
1:58:12
I would attend leading industry conferences and see what people are actually talking about and working on and what problems they're trying to solve.
1:58:22
Prakash: Is the leading industry conference in the room with us right now?
1:58:25
swyx: Maybe. And I've been known since before this AI thing for what I call 'learning in public' — periodically collecting everything you're thinking about and processing, and putting it out there. Mostly as notes to yourself at first, but eventually you start being seen as an authority. A lot of people are worried about looking too junior or being wrong in public. But there will be no better teacher than being wrong in public. People want to see the slope — they don't actually care that much about the level you debut at. If you can see the slope and see the work, people want to coach you, because then you're giving them the reward of their coaching paying off. I have a whole blog post series on this — go search 'learn in public.' Seven posts about how you start your career, how you get mentors without asking them for coffee to pick their brain. That model is outdated. You get mentors by working on interesting things, responding to them, and picking up what they put down. That's my best career advice so far, and it's helped a lot of people.
2:00:04
Nathan Labenz: You've been super generous with your time, and it's always great to get your takes. You're definitely one of the most plugged-in insiders who gets to interact with people across all these different kinds of organizations and perspectives. The synthesis you provide is really valuable. Let's close on the event — tell us again when, where, and I believe we have a code to share with the audience.
2:00:40
swyx: I don't actually remember what code I gave you guys.
2:00:44
Nathan Labenz: Let me pull it up. It starts literally a week from today.
2:00:45
swyx: Yes — it's next week in San Francisco. It's the AI Engineer World's Fair — the largest version of the AI Engineer conference. Basically twelve simultaneous conferences running side by side; it's a buffet where you can pick and choose. The big party of everyone you've seen online. We have poster sessions for researchers for the first time, but we also have 'poster sessions with an A' — where people print out their hot-take tweets and stand in front of them and defend them against all comers. Should be quite fun. This is also the first World's Fair with a World Cup attached to it, so there are about 40 side events, all the companies taking over San Francisco. What means the most to me personally is that I get to curate a group where you can see a material difference in the quality of discussion because of the speakers and curation — you get a very different kind of conversation with attendees than you would at a normal conference. The people actively working on the projects you've all seen will be here. And this is also the three-year anniversary of 'The Rise of the AI Engineer' article. When I started this whole thing, I only had an inkling it would become something like this someday. To get here in three years is a miracle — Neuroscience has been around for forty years, and the fact that we're basically at that conference level for our sponsors and speakers is a real blessing.
2:02:51
Nathan Labenz: You've ridden the exponential as well as anyone. I've had the chance to attend in the past and it really is an event where the conversation feels up-to-the-minute. That is not to be taken for granted — I'm always asking whether I'll get three-months-ago warmed-over talks or whether people are living on the very edge of what's possible now. That is the great strength of the AI Engineer series, and it comes from you being so plugged in and on the edge yourself. The code is THANKSAIAM — use that when you register for the conference. Thank you, swyx — this has been a lot of fun, really enjoyed the conversation and some great takes as always.
2:03:43
swyx: There needs to be a benevolent dictator — which is me — who can say: we're going to throw this track out and put this one in, just because it's cool. I just yeeted our MCP track and put in auto-research, because that's what we all want to talk about right now. It doesn't mean MCP is dead — it's part of the context engineering track rather than a headline feature. But a year ago it was absolutely the headline; we were the first to focus on MCP in New York. Now we're the first to have a full auto-research track. That's how we stay ahead.
2:04:37
swyx: Thanks, guys. Hope to see you out there.
2:04:39
Nathan Labenz: And, yeah — hope to connect in person finally after all this time online. Keep up the great work, and check out the AI Engineer World's Fair, everybody.
2:05:13Closing57 min
Close: Guess the MarketsAfter the swyx interview ran nearly 77 minutes, Nathan and Prakash closed with a live 'Guess the Markets' segment — twelve AI prediction-market questions from Polymarket, Kalshi, and Manifold, neither host having seen the questions or answers in advance. Highlights included both hosts underestimating Google's 73% edge in math-model forecasts, Prakash guessing 80% on a formal AGI announcement before 2028 (market: 45%), and both guessing 80% on the 1550+ Chatbot Arena Elo question (market: 18%). The segment closed with a reminder about the AIAM discount code for the AI Engineer World's Fair.
Watch
As aired
After wrapping their guest interview, Nathan and Prakash launched a live segment they dubbed 'Guess the Markets' — a game Nathan had built using Claude Opus, which scraped Polymarket, Kalshi, and Manifold to compile 12 AI-related prediction-market questions. Neither host had seen the questions or answers in advance, so they gave live probability estimates before revealing the market consensus. Topics spanned Anthropic holding the top-ranked model through 2026 (both guessed ~55–60%; market said 63.5%), which company would lead in math benchmarks (both said OpenAI ~50%, but the market gave Google 73%, surprising them both given the LM Arena TextArena math category), whether any company would formally announce AGI before 2028 (Nathan: 10%, Prakash: 80%, market split the difference at 45%), and the ARC-AGI grand prize being claimed before January 2027 (Nathan: 15%, Prakash: 40%, market in the middle).
The hosts then worked through questions on OpenAI's IPO prospects by year-end (Nathan 40%, Prakash 15%, market 54% — Prakash argued the IPO window had already closed for 2026), whether Anthropic would surpass OpenAI's valuation (both leaned yes, market at 82%), NVIDIA retaining the world's largest market cap (both around 70–80%, market 73.5%), Anthropic's valuation surpassing Bitcoin's market cap (both ~60–75%, market 66.5%), and whether any AI model would hit a 1550+ Chatbot Arena Elo score in 2026 — where both guessed 80% but the market said only 18%, prompting a discussion about Arena score saturation and how marginal gains of 4–5 Elo points per quarter make 1550 many years away at the current pace. The IPO question sparked a detailed digression from Prakash on how traditional underwriting works — price discovery, lockup periods, allocation books — and why even cash-rich companies like OpenAI would still prefer a managed IPO over a direct listing.
The final two 'bonus' questions covered whether a Chinese model would top the LM Arena overall leaderboard at any point in 2026 (Nathan 1%, Prakash 20%, market 9.5% — they discussed LM Arena gaming incentives and noted the current best Chinese model, GLM 5.1, sits roughly 30 points below the top) and whether OpenAI would publicly release GPT-6 by year-end (both around 50%, market 71%). Wrapping up, Nathan called it a fun experiment worth refining — he floated the idea of posting the quiz on the show website so viewers could benchmark themselves — and Prakash suggested using squared-error or logarithmic scoring instead of raw absolute error. They closed with a reminder about the AIAM discount code for the AI Engineer World's Fair and signed off for the day.
Key moments
Is Anthropic more valuable than Bitcoin? Yes — I can say that definitively in terms of actually making money, having real use cases, impact on the world. Why wouldn't it happen?
Nathan Labenz2:35:59
When you want to raise money and you want to control the share price and control the narrative, you're going to want to do the IPO. That's really what's going on. Even Elon, with his unlimited capital-raising ability, has found it more advantageous to maintain that price control.
Prakash2:49:05
The questions people are posing with any runway are being answered in the affirmative — things like this are tending to happen rather than not happen.
Nathan Labenz2:39:04
Full transcriptLightly edited · timestamps jump to YouTube
2:05:13
Nathan Labenz: That was great. I really enjoyed talking to him — he's genuinely as plugged in and connected as they come. And of course the Earthscape podcast is fantastic.
2:05:25
Prakash: Perhaps the most plugged-in person we've had on the show. Up to the minute — literally up to the minute.
2:05:34
Nathan Labenz: How are you feeling energy-wise? We went longer with Swix than planned. I've got a fun activity for us — we could do it now or save it for another gap later in the week. What's your feeling?
2:05:51
Prakash: I'm up for it. If you'd like to do it, let's go.
2:05:55
Nathan Labenz: Alright, let's try it. This is definitely an experiment — to put it mildly — but let's give it a go. Can we share the screen?
2:06:11
Nathan Labenz: Okay, cool. So here's the game. I had Claude Opus — one silent tear, because I had to use Opus for this — go out and look across the different prediction and forecasting market sites: Polymarket, Kalshi, Manifold. I also want to make sure we get Metaculus access in the future. Basically I had it find some interesting questions and then create this little quiz site. I haven't looked at the questions or the answers. So we're going to have a real-time friendly competition to see who can guess the current market values on these key AI-related questions. This is a shameless gimmick borrowed from Bill Simmons and Cousin Sal, who have for many years done NFL 'guess the lines.' This is the AI version — guess the current Polymarket or Manifold value. We'll see how well calibrated we are, and whether we're inspired to actually trade in some of these markets.
2:07:49
Prakash: Where do I enter my numbers? How does this work?
2:07:54
Nathan Labenz: It's hosted on my local computer right now, so I'll have to enter numbers for you, and we'll see how well the scoring works. We'll have an interesting discussion either way. Alright — 12 questions. Question 1: Will Anthropic hold the number-1 ranked AI model at the end of 2026, as determined by LM Arena's Chatbot Arena?
2:08:40
Nathan Labenz: My brain says yes — they're the most likely to hold it — but the odds? I'll say 60%. What's your number?
2:09:02
Prakash: 55.
2:09:03
Nathan Labenz: 55 — just sneaking in under. Let's see what we've got. The market is at 63.5%.
2:09:12
Nathan Labenz: Interesting. We're both pretty close. I don't think that screams that either of us should dive in and start trading. Question 2: Which company will have the number-1 math model at the end of June 2026? I'll note we should probably have longer time horizons, but this also raises the question of who even has the top math model today. Anthropic had really strong results with Fable on Frontier Math Tier 4 — I had predicted low-sixties as the success rate for the best model at year-end, and already we're in the high eighties. OpenAI is also in the high eighties with 5.5 Pro, if I recall. My gut says OpenAI probably has the best model today, though some of this isn't public. I'll say 50% OpenAI, 30% Anthropic, 10% Google, 10% the field.
2:11:15
Prakash: Very similar — I had 50, 30, 15, and 5. So 50 OpenAI, 30 Anthropic, 15 Google, 5 the field.
2:11:46
Prakash: I'm going to guess the Anthropic numbers come out higher on the markets. Let's see.
2:11:55
Nathan Labenz: Google — 73%. Okay, that's startling. How differently those inside the bubble versus outside evaluate things.
2:12:10
Prakash: Startling how differently the insider view versus the outside view evaluates.
2:12:23
Nathan Labenz: Yeah, this is really bizarre. This is an LM Arena evaluation, so presumably it's Google's Gemini Flash that's up there high — maybe speed is one big reason Flash would be rated so highly. It's not that we've seen 3.5 Flash solve open problems — but if it's the best at answering everyday math questions, speed alone could be driving those head-to-head bake-off wins. Fable is basically right in line with the others, but something feels a little strange about that result. We'll note that feedback for the next round.
2:14:12
Nathan Labenz: Question 3: Will any company achieve AGI before January 1, 2028? This begs the question of what we mean by that — the company apparently has to announce it. I don't think we're ever going to see a moment where the big labs say 'we've achieved AGI' — Tyler Cowen called it at o3, Dario called it at some point, and my guess is they'll keep saying 'we've still got more work to do.' I'll say 10%.
2:16:40
Prakash: I'm going to rules-lawyer this and go with 80%. There will be sufficient market pressure — companies like SSI or Thinking Machines need to announce something more than just coding agents to stay alive. So I think someone will announce.
2:16:56
Nathan Labenz: 45% — exactly splitting the difference. This is often the problem with these questions: it's really hard to write good resolution criteria. The market says any 'company whose existence is verifiable and whose financing or valuation is covered by at least two major business news sources.' To me that means we're talking about maybe 10–20 Neo-lab-type companies that have raised $100M seed rounds chasing recursive self-improvement. My guess is most of the weight comes from that second tier of company that has more incentive to go for a news cycle with that kind of announcement.
2:18:39
Prakash: I mean, Thinking Machines or SSI stays alive by just doing coding agents — they're going to have to announce something more than that to stay alive, or get bought out. So yeah, there will be market pressure. But you're right, it's right in the middle — rules-lawyer territory.
2:19:16
Nathan Labenz: To be continued. Question 4: Will the ARC-AGI grand prize be claimed before January 2027? It pays out when an open-source solution scores greater than 85% on the private ARC-AGI evaluation set within strict compute and efficiency limits.
2:19:43
Prakash: Which ARC-AGI is this — 1, 2, or 3?
2:19:43
Nathan Labenz: I believe it's ARC-AGI 1. Frontier models have surpassed 85% accuracy, but the compute caps are very low — something like 10 cents of compute per puzzle total. The cost of running the big models far exceeds that. They also want it running in a little container in their environment. So frontier models are there on accuracy but the efficiency constraint is still very much binding.
2:20:45
Prakash: And we only have 6 months left. I'll say 40%.
2:20:56
Nathan Labenz: I was initially thinking 10 — I haven't heard much recently about small efficient models getting close to the threshold. But since you went first, I'll go up to 15%. I think you might be winning this one, and this might be one where I'd actually auto-trade on the market.
2:21:42
Nathan Labenz: It's again almost exactly in the middle. So it is ARC-AGI 1 — frontier models are in range on accuracy, but the compute caps are still very much binding. Only 84 traders and 19,000 Manifold bucks, so not super liquid.
2:24:26
Prakash: I feel like someone's going to auto-research a small model solver that just cracks it.
2:24:36
Nathan Labenz: Yeah. And there's always a little bit of 'hide the compute' going on — the auto-research compute is doing a pretraining step on the model itself, and then the inference compute looks cheap.
2:25:04
Prakash: Sleight of hand on the compute steps.
2:25:08
Nathan Labenz: Exactly. Okay, through 4 questions — incredibly tight race. Question 5: Will OpenAI complete an IPO by December 31, 2026?
2:25:23
Prakash: 15 percent.
2:25:25
Nathan Labenz: Wow, really interesting. I was going to go significantly higher — it seems like they're talking about it, and how much longer can they keep raising these $100 billion private rounds? Eventually they'll run out of people able to write those checks. My first impulse was 50; since you went first I'll drop to 40. Let's see.
2:26:18
Nathan Labenz: 54%. Alright. That's probably not enough for me to trade, but it might be enough for you.
2:26:27
Nathan Labenz: Why do you think they're so unlikely to do it this year?
2:26:31
Prakash: I just feel like they don't have enough time. Realistically, summer is dead — sell in May and go away. The market doesn't come back until August. So you get September, October, November. Your docs have to be basically ready by May or June. Then you spend the next three or four months on the road show and close by end of October, early November. I think they've already missed the window, and the earliest realistic timeline is Q1 of next year. Same with Anthropic.
2:27:40
Nathan Labenz: Decent volume on this market — you might want to get in there and make a trade. Question 6: On December 31, 2026, will Anthropic carry a higher valuation than OpenAI?
2:27:56
Prakash: Tough one. I'd say yes, but only a 55% yes — not strong.
2:28:20
Nathan Labenz: I'm also yes, and a little stronger — I'll say 75%. The growth rate has been higher, they have the best publicly known model, and Swix even alluded to some upcoming news. I know there'll be an OpenAI answer — GPT 5.6 reportedly launching this week — but Anthropic's trajectory is strong.
2:28:52
Nathan Labenz: One reason to temper my number is that government is meddling in the market in ways that don't seem fully even-handed, which could shift things. Still, I'll say 75%. Let's see. 82%. Another potential trading opportunity. That feels a touch high to me even directionally.
2:29:58
Prakash: I feel like the market is not judging the future so much as judging the present view of the future.
2:30:13
Nathan Labenz: Yeah. The other thing to keep in mind is that OpenAI has invested so heavily in compute — when you have 25% of global compute you can be a little lax from time to time on shipping the next model and you're not in an existential crisis. Anthropic's compute share is obviously much smaller. Question 7: Will NVIDIA be the world's largest company by market cap on December 31, 2026? I believe they are today, with Google at number 2.
2:31:39
Prakash: The wildcard in the works is SpaceX. All they need to do is get Starship in the sky on a regular basis actually delivering cargo. I'll say 70% NVIDIA.
2:31:56
Nathan Labenz: I'm a little higher — 80%. SpaceX would need to add more than $3 trillion in market cap just to reach where NVIDIA is today. If that happens, my guess is NVIDIA is also jamming because it's probably a broad run-up. The most likely challenger, if any, would be Google — some Gemini 4 or robotics moment where they reveal cards nobody knew they had.
2:32:57
Prakash: SpaceX gets judged on vibes — the market can run on vibes without the cash flow that NVIDIA needs to justify its multiple.
2:33:10
Nathan Labenz: Let's find out. 73.5% — you win that one by 3 points. But overall we're showing pretty good calibration to the market. I probably won't be pouring my assets into SpaceX — as much as I believe in them, another doubling of the valuation is hard to justify. Question 8: Will Anthropic's valuation surpass Bitcoin's market cap at any point before 2027?
2:35:11
Prakash: Bitcoin is around $60 a coin times 21 million coins — roughly $1.3 trillion at $60. Anthropic's last reported valuation was around $950 billion, so they're close.
2:35:50
Prakash: I'd go with 60% yes.
2:35:59
Nathan Labenz: Yeah, I'm basically with you. Is Anthropic more valuable than Bitcoin in terms of actually making money, having real use cases, impact on the world? Yes, I'd say definitively. The biggest barrier is just that we might not see a public repricing event within the remaining months of the year. But I think it's even higher — I'll say 75%.
2:36:29
Prakash: The Chinese can invest in Bitcoin but not in Anthropic, so there's that demand asymmetry.
2:37:12
Nathan Labenz: 66.5% — I was slightly closer before I adjusted, and then I moved farther. But we're bracketing it well. Not super liquid — $150,000 in volume, and this market has only been trading for 6 weeks. Question 9: Will any AI model reach a 1550-plus overall Chatbot Arena score in 2026?
2:37:45
Prakash: I'll go 80%. I thought Fable was already up near 1500-plus.
2:38:00
Nathan Labenz: Yeah, if we open the leaderboard I wouldn't be shocked to see Fable has already crossed that number.
2:38:07
Prakash: We should have intellectual integrity — don't check the leaderboard, just guess and see.
2:38:19
Nathan Labenz: Fair. On the general theory of market design — if I'm thinking about who made this market and what they thought was a reasonable over-under, these things have generally exceeded people's estimates from 6 months ago. We still have the rest of 2026 ahead of us. I'll say 80% — the questions people are posing with any runway are being answered in the affirmative.
2:39:16
Prakash: 80% as well.
2:39:19
Nathan Labenz: 18%. Holy moly. Okay, we need to dig into what's going on here.
2:39:28
Prakash: Really miscalibrated.
2:40:15
Prakash: The leaderboard shows improvements of maybe 4 to 5 Elo points every 3 to 4 months. So going from 1508 to 1550 would take about 30 months of improvement at that pace — roughly 2 to 3 years.
2:40:52
Nathan Labenz: Yeah, interesting. This feels like we're hitting the limit of the benchmark — people are not asking questions that effectively differentiate the models. If you're asking basic questions, you maybe don't even notice the difference, or you like the faster model more. Looking at frontier code benchmarks — Fable versus Opus 4.8 — tells me much more than a 2-point Arena gap. Something is a little lost in that process. But maybe that's just me rationalizing a bad answer. Question 10: Which bank will be lead-left underwriter on OpenAI's IPO?
2:42:56
Prakash: I think it'll be Goldman, but JPMorgan could be in there too — Sam Altman doesn't have the same issue with Jamie Dimon that Elon does. I'd say Goldman 60%, Morgan Stanley 20%, rest of the field 20.
2:43:24
Nathan Labenz: I have no strong view, so I'll say 30/30/40 to express my ignorance. The market says 77% Goldman — so that field 40 is really just taken from the field and given to Goldman. Interesting. Why is who gets to underwrite even something people bet on?
2:44:15
Prakash: If you're in the industry it's something people want to talk about — it's about who makes the money that has to be made. Investment banking for IPOs is a bit like a studio system. You pick an underwriter, they make a lot of money. The question is: is it the very smart people at Goldman, the very well-connected people at Morgan Stanley, or JPMorgan with the large balance sheet that can support the offering? It's very hard for any other bank to get in there because you need financing relationships built over a long time.
2:45:24
Nathan Labenz: What's the barrier to OpenAI just doing something different — the way Google did some things differently in its IPO? Could they just skip the banker and use their own agents to run the road show? And would they even want to?
2:46:50
Prakash: It's a debate people like Bill Gurley have pushed for a long time — just do a direct listing where no one sells any shares upfront, you just let the market trade. Spotify had some success with that. But the problem for OpenAI and Anthropic is that they want to raise money, not just allow employees to sell shares. When you want to raise money, you want to choose your own price, lock up employee shares to create scarcity, and control the narrative for the next six months. When you do an IPO, you get to pick the price in the book-building process — the banks gather investors, run a two-way negotiation, and settle on a price the company controls. After the 180-day lockup expires, the market readjusts. So if you want to raise money and control your share price, you want the IPO. You're not forced into selling at whatever the market happens to price that day. Even Elon, with his unlimited capital-raising ability, has found it more advantageous to maintain that price control.
2:51:31
Nathan Labenz: I also feel like, beyond the mechanics, there's something in the political economy of it — you just don't screw all the big banks out of a payday they were looking forward to without creating risk for yourself downstream if you're ever in a vulnerable spot.
2:52:08
Prakash: In techno-capital theory, it's also about giving capital owners a reason to root for you. If people own your shares and they're up 20%, they're going to advocate for you — including policymakers. I think it's healthy for our congressional representatives to own stocks in the companies that employ their constituents. So yeah, the IPO makes sense.
2:52:47
Nathan Labenz: We can bracket whether an index fund would be enough to create that enthusiasm for another conversation. Two more — bonus questions. Opus decided the first 10 would be numbered and the last two would be bonuses. Bonus 1: Will a Chinese company hold the number-1 AI model at any point in 2026, according to LM Arena overall with style control off?
2:53:24
Nathan Labenz: I'll say 1%.
2:53:30
Prakash: I'd say 20.
2:53:33
Nathan Labenz: That's really high in my opinion. Let's find out. 9.5%.
2:53:42
Nathan Labenz: This feels like dumb money — enthusiastic people buying the lottery ticket of the dark horse, like the Andrew Yang prediction market bettors. At 9.5% I can get a 10% return as long as no Chinese model tops this leaderboard at any point in 2026. I'd be shocked if that happened.
2:54:46
Prakash: I feel like LM Arena can be reward-hacked. The major frontier labs have no incentive to care about LM Arena anymore if they can get more revenue off coding benchmarks. But Chinese models are still fundraising off these benchmarks, so they're still benchmark-hacking. That's why I give it a bit more.
2:55:28
Nathan Labenz: Let's look at where the first Chinese model sits today. GLM 5.1 is the latest, and it's basically 30 points from the top. GLM 5.2 will probably close that gap substantially — maybe come in around Opus 4.6 thinking, so roughly 4th or 5th.
2:56:03
Prakash: Say probably around Opus 4.6 thinking — not below Fable 5, above Musepark. Probably number 4 or 5.
2:56:18
Nathan Labenz: Yeah, that's interesting. I'm not degenerate enough a gambler to place this trade, but I hear you — the gameability of LM Arena is the biggest real risk. And finally: Will OpenAI publicly release GPT-6 by December 31, 2026?
2:56:48
Prakash: I'd say yes, but maybe 50%. Not that high.
2:57:01
Nathan Labenz: This is a tricky one. There's inherent uncertainty about how fast progress will go, plus the fact that last time a frontier company did a full integer version increment, the model got yanked off the market. So there are weird incentives that might make them call it 5.8 rather than 6 when they get there. It feels like they did something like that with 5.4 or 5.5 — there was a step change and a lot of commentary that 'this was the real GPT-5 they should have released,' but they just called it a point release. I'm basically with you — it feels like a toss-up. I'll go slightly higher to have some difference: 55%. Let's see.
2:58:31
Nathan Labenz: 71%. The market thinks it's going to happen.
2:58:44
Prakash: Market thinks it's gonna happen.
2:58:46
Nathan Labenz: Yeah — they have an IPO, so they need something to drive the narrative. Overall I'd say we were pretty close on most of them.
2:59:01
Prakash: The biggest surprise was that 18% versus our 80% on the 1550 Arena score. We were badly miscalibrated on that one.
2:59:33
Nathan Labenz: Yeah. Okay. Cool. Well, that was a pretty fun exercise. If anybody has feedback I'd love to hear it. It could be an interesting little app to put on the website — people can go try it for themselves, compare their performance to ours. Any feedback you'd put into the transcript for Claude to self-improve on?
3:00:13
Prakash: Should we be using squared error instead of absolute error for scoring?
3:00:23
Nathan Labenz: Yeah, there's almost certainly a better scoring system.
3:00:26
Prakash: Squared error would penalize more dispersion — the forecasting community has thought a lot about this.
3:00:38
Nathan Labenz: There's also logarithmic scoring from the superforecasting literature — where saying 0% and being wrong gets you an infinite penalty, which might be a bit extreme. But yeah, I wouldn't put too much weight on the score as calculated by Opus on this first run. Definitely room for improvement.
3:01:15
Nathan Labenz: Cool. That was an experiment — that's what we're here to do. I enjoyed it. Anything else you want to cover before we break for the day?
3:01:27
Prakash: No. The session with Swix was a great session. Yeah.
3:01:33
Nathan Labenz: Absolutely. And the code — I'm not sure we ever said what it was for — is a discounted registration for the AI Engineer World's Fair. The code is AIAM, so basically free money if you want to go. I had a great time the one time I made it in person. Do encourage anyone who's interested to check it out. See you tomorrow, Prakash.
3:02:02
Prakash: Bye bye. Bye for now.

Opening: Cognitive Empathy for the Administration, Dean Ball to OpenAI, GLM 5.2

Nathan opened with a weekend recalibration: after Judd Rosenblatt called out the AI safety community for a lack of cognitive empathy toward the Trump administration, Nathan acknowledged he had probably been too harsh. Even if export controls are a sledgehammer applied to the wrong nail, the administration is engaged and taking AI seriously — and the constructive response is to educate, not ridicule. Prakash extended the point with Warren Buffett's 29-year lag before investing in Google, illustrating how vast the bubble-to-outside gap truly is and suggesting the administration's clumsy moves may reflect genuine unfamiliarity rather than bad faith.

Nathan then recapped his extended Cognitive Revolution conversation with Dean Ball, who is leaving White House OSTP on July 6 to lead OpenAI's new Strategic Futures team — chartered around catastrophic risk, recursive self-improvement, labor disruption, and lab–government relations. The irony Nathan flagged: Ball advocates for broad diffusion over nationalization while concluding that effective frontier-AI policy may require the inside access that only comes from working at the lab. Prakash closed the open on GLM 5.2 from Zhipu, the Chinese open-weights model released June 16 that dominated weekend discourse after Matt Veloso (former Meta/Google DeepMind VP) declared it the first open model that passes as a daily driver — a post that briefly spiked Zhipu's Hong Kong stock.

Interview: The State of AI Engineering — swyx

Shawn Wang (swyx) — who coined 'the AI Engineer' in his 2023 essay 'The Rise of the AI Engineer' and now curates the AI Engineer World's Fair while doing evaluation research at Cognition — joined for a wide-ranging conversation anchored in the state of the field as he sees it across the 30 tracks he programmed for this year's World's Fair. Software factories have replaced coding agents as the organizing frame; the RAG track is now 'search and retrieval'; continual learning sits at an unresolved fork between weight-update partisans and systems-side pragmatists.

The FrontierCode benchmark — swyx's evaluation project at Cognition, inspired by METR's finding that roughly half of SWE-Bench-passing PRs are unmergeable slop — anchored a long sequence on measuring production-ready code quality. Results show Fable scoring roughly 2.5× higher than Opus at less than 2× token cost. The conversation widened into ecosystem structure: the adviser/router pattern (OpenRouter Fusion, Sakana Fugu) is theoretically limited because a weak model doesn't know what it doesn't know; cloud infrastructure is being rebuilt for agents at a 10–15× scaling stress level straining GitHub and the SaaS stack; and enterprises are beginning to vibe-code their own internal systems of record rather than pay for a dozen $20/month subscriptions with siloed chatbots bolted on.

swyx was unusually candid on the bear case for synchronized lab IPOs — the 'Illuminati group chat' read that insiders may be distributing to retail at peak hype — while tempering it with an honest bull case that consumer agent adoption is probably still 100× underpenetrated. The segment closed on the CS enrollment collapse (Stanford −42%, Berkeley −61%) and swyx's career advice: the job market for juniors is genuinely hard, but the members of technical staff at every frontier lab are 21 years old — demonstrate taste, ship interesting things, and practice 'learning in public.'

Close: Guess the Markets

After the swyx interview, Nathan and Prakash ran a live 'Guess the Markets' segment — twelve AI-related prediction-market questions drawn from Polymarket, Kalshi, and Manifold, with neither host having seen the questions or market values in advance. Topics ranged from Anthropic holding the top Chatbot Arena rank through 2026 (both guessed ~55–60%; market said 63.5%) to which company leads in math benchmarks (both said OpenAI ~50%; market gave Google 73%, surprising both hosts), whether any company would formally announce AGI before 2028 (Nathan 10%, Prakash 80%, market 45%), and whether any model would hit 1550+ Chatbot Arena Elo in 2026 (both guessed 80%; market said 18%, prompting a discussion about Arena score saturation). The segment also covered OpenAI's IPO prospects, Anthropic's valuation versus OpenAI's, NVIDIA's market cap leadership, and GLM 5.2 topping LM Arena. Nathan called it an experiment worth refining and floated posting the quiz on the show website; Prakash suggested logarithmic scoring for a future round.