EPISODE 2026-06-12

AI:AM LIVE — June 12, 2026 — RSI gets real, the context bet, and the benchmark Anthropic fails: Andrew Moore, prinz

The week RSI stopped being a forecast: Recursive's first autonomous results and Fable 5's 10× FrogsGame jump landed the same day Kokotajlo called for an anti-RSI treaty. Then Andrew Moore (Lovelace AI) made the case that context, not compute, is the binding constraint — and prinz, the anonymous lawyer behind prinzbench, explained why GPT-5.5 Pro laps Anthropic's best on real legal work and what frontier-lab RSI disclosures are actually signaling.

𝕏 Live broadcast

Friday's show ran long and ranged wide. The cold open chased the week's defining thread — recursive self-improvement going from forecast to empirical program in a single 24-hour window — then two guests pressed the opposite case to 'scale is back': Andrew Moore on context engines, and prinz on what a real lawyer's benchmark says about where the frontier actually stands.

Note: this record is published from the show plan reconciled against the live broadcast's actual timings. Per-segment timestamps, deep-links, and the full as-aired recap will be added once the recording posts.

The rundown

22:35Opening50 min
Cold open — RSI goes empirical, task imagination, and a same-day treaty callA 50-minute wide-ranging open that started with Nathan's overnight Fable 5 experiment on 'task imagination' — how radically hosts and organizations need to rescale what they ask AI to do — and Prakash's token-anxiety thesis (why Meta's token leaderboards were strategically smart). The segment moved through the SpaceX AI8 IPO ($75B raised, +23% at the bell) and then dug into two RSI data points: Recursive AI's incremental but cumulatively real GPU-kernel results, and Thoughtful AI's FrogsGame benchmark showing Fable delivering a 10× improvement at fine-tuning small specialist models — a capability no prior frontier model had unlocked. Closed with Nathan endorsing Kokotajlo's call for an international RSI agreement while distinguishing a targeted governance pause from a blanket freeze.
As aired
The June 12 cold open ran nearly fifty minutes and wound through four interlocking threads. It opened with Nathan's overnight Fable 5 experiment: using the model to draft a podcast guest-prep outline from an audiobook transcript, which surfaced a larger question about 'task imagination' — the gap between what people ask AI to do and what it is now capable of sustaining over hours or days. Both hosts reflected on the psychological shift required to accept, rather than reflexively rewrite, AI-generated prose when the AI's version is genuinely better, and Nathan framed this as an ongoing personal recalibration rather than a settled policy.
The conversation broadened into how organizations might map their own AI readiness from the inside: Nathan proposed a 'chaos monkey' strategy — letting Fable take over a role for a few weeks while the incumbent takes a paid vacation, then observing what breaks — and Prakash connected this to the earlier 'token leaderboard' mandates at firms like Meta, arguing those mandates were less about optics than about forcing employees past 'token anxiety' so they would assign harder, longer-horizon tasks. The SpaceX AI8 IPO landed mid-discussion (trading up 23% at the bell, $75 billion raised), and the hosts briefly situated it in the trillion-dollar CapEx race building toward the end of the segment.
The back half of the open pivoted to recursive self-improvement (RSI) going empirical. Nathan walked through two concrete data points: first, the startup Recursive AI's incremental but cumulatively significant results optimizing GPU kernels (framing it as evidence that the CUDA moat is shallower than assumed, especially for frontier labs that can afford to diversify chip stacks); and second, Thoughtful AI's 'FrogsGame' benchmark showing Fable delivering more than a 10× improvement in post-training small specialist models — a capability no prior frontier model had meaningfully unlocked. Nathan read the FrogsGame result as both a proof-point for the current model's quality and a glimpse of a safer near-term AI architecture: swarms of narrow, role-specific small models post-trained by a large orchestrator rather than one general superintelligence. The segment closed with Nathan endorsing Daniel Kokotajlo's call for an international RSI agreement — noting he had signed the 'ban superintelligence' statement — while distinguishing a targeted governance pause from a blanket research freeze, and flagging that interpretability and alignment work should continue regardless.
Key moments
You've probably never done anything that took AI an hour to do. Now this thing can run for a couple of days. What are you going to give it to do? Everybody needs to recalibrate and really expand their minds when it comes to the scale and scope of their task imagination.
Nathan Labenz
Token anxiety is a big thing. You don't try tasks that might take a lot of tokens, and you don't try tasks that have a higher probability of failure. You end up in a micromanagement loop where you're only assigning tasks you know AI can complete within the allocated time. When you lift the token anxiety, you end up assigning more tasks, more difficult tasks, and you're willing to accept a higher probability of failure.
Prakash
I have always said I'm a hyperscaling pauser but a diffusion-and-adoption accelerationist. It would be really good, and would make the world in some ways a lot more robust, if we had Fable-class models diffused broadly and doing all kinds of work before we do the next scale-up.
Nathan Labenz
What we covered
Task imagination — what do you give an AI that can run for days? Nathan's overnight use of Fable 5 to write a podcast guest-prep outline from an audiobook surfaced a larger question about 'task imagination': the gap between what people ask AI to do and what it is now capable of sustaining over hours or days. Both hosts reflected on the psychological shift required to accept AI-generated prose when the AI's version is genuinely better. Prakash connected this to Meta's token-leaderboard mandates: forcing employees past 'token anxiety' so they would assign harder, longer-horizon tasks — less about optics than about mapping what AI can actually do.
RSI goes empirical — Recursive's first results and a same-day treaty call. Nathan walked through two concrete data points: Recursive AI's incremental but cumulatively significant results optimizing GPU kernels (framed as evidence the CUDA moat is shallower than assumed), and Thoughtful AI's FrogsGame benchmark showing Fable delivering more than a 10× improvement in post-training small specialist models — a capability no prior frontier model had meaningfully unlocked. Nathan read the FrogsGame result as both a proof-point for Fable's quality and a glimpse of a safer near-term AI architecture: swarms of narrow, role-specific small models post-trained by a large orchestrator. The segment closed with Nathan endorsing Daniel Kokotajlo's call for an international RSI agreement, distinguishing a targeted governance pause from a blanket research freeze.
Full transcriptLightly edited · timestamps jump to YouTube
22:36
Prakash: Good morning. It is Friday, June 12th, 9:12 AM. Welcome to AI in the AM.
22:42
Nathan Labenz: Good morning, Prakash. It's another exciting day in the AI sprint at marathon length — all the way to the Singularity.
22:52
Prakash: Indeed. What has your overnight update been, Nathan?
23:01
Nathan Labenz: I think we're starting to see more and more thoughtful uses of Fable, and more reports on what's happening with it. As always, these things mature over a few days into a more sophisticated understanding. The first 24 hours it was just, 'I gave it this one prompt and it built this whole virtual world' — lots of interesting examples of that, a little bit of which we covered yesterday. I'm still in the process of trying to recalibrate what a fellow named Nate Jones — he goes by that most of the time on TikTok and other short-form platforms — calls 'task imagination.' Basically: what are you going to ask Fable to do that is actually up to the scale of its capability? He gave a great riff on this the other day: you've probably never done anything that took AI an hour to do. Now this thing can run for a couple of days. What are you going to give it to do? Everybody needs to recalibrate and really expand their minds when it comes to the scale and scope of their task imagination.
24:31
Nathan Labenz: One of the more differentiated things I do is write outlines of questions for podcast guests. I was working with Fable last night on a couple of upcoming episodes — one with an author. I had listened to the book as an audiobook, used AI tools to extract a clean transcript out of the PDF preview, got rid of all the pesky page numbers and chapter titles you'd get from a naive OCR, then put it over into ElevenLabs to read the thing to me. So I did put the time in. But when it comes down to sitting and writing out the outline of questions, I don't have every little aspect of the book at my command, of course — I'm not taking margin notes as I go as maybe I should be. So I put the same version of the book into Fable and said, 'Look at my old stuff, and give me your version of this outline.' And again I was super impressed. It really reinforced the sense of a new way of working where I need to be open to a hybrid output format. I don't think anymore that it really makes sense to try to rewrite every word or claim every word as my own.
26:02
Nathan Labenz: It gave me an absolute ton of stuff — way more to cover in an hour. So as I refine the skill, one thing will be getting it to be a little shorter and more suitable for the time we'll actually have. But it did such an incredible job. The taste factor was so high — quotes from the book that motivated what I think will be a really interesting discussion. A lot of these ideas I sort of had floating around in my head, and I think I could have gotten them down on paper with a similar amount of time. And I do think it's still going to be super important that if I'm going to show up for a conversation, I've got to do the work to be ready in my own brain. That can't be fully externalized, I don't think, as long as I'm the one having the conversation.
26:47
Nathan Labenz: But it definitely took my prep to another level. My ability to go into this conversation and cite passages from the book that were really extremely compelling — little turns of phrase or analogies the author had made — it's going to allow me to be more concise in my presentation, which, as you could tell from this monologue, is not a great strength of mine, and really tee up the author in a way I don't think I otherwise would have been able to do. So this hybrid recalibration and task-scope reimagination, I think, is one of the biggest takeaways. And at the same time, I'm watching my weekly limit. Thinking a little more about how to pay overages in some cases — I think it's still worth it. Even if it was $20 to get that outline done and buy back the time, worth it. I predict the willingness to pay is going to be very high, especially with some future update to data retention policies. Their revenue should continue to go vertical with this thing.
28:17
Nathan Labenz: I also started putting in some strategies yesterday — especially for coding. Have Fable write the plan and then coordinate other coding entities, specifically Codex. I have, if we believe SemiAnalysis, $14,000 worth of credits that I'm really underutilizing — and perhaps undervaluing, because I've only been paying the $200 a month. So I'm going to try to get Fable to help me with task imagination scale and scope, but also have it start doing more delegation and checking of work to Codex, which should give me a lot more leverage on the token budget included in my Max plan.
29:24
Prakash: One question I had for you: I think you're one of the people who sees very clear improvements between GPT-5.5 and Fable. And I find myself rather confused, because on some benchmarks Fable outperforms by a very large margin, but you also have other benchmarks where Fable has a 77 and GPT-5.5 has a 76 — very close — and some where GPT-5.5 is even ahead. And then there are some categories where it's above and beyond anything ever seen before. How do you assess, on a vibes basis, that it is truly better?
30:29
Nathan Labenz: It is vibes to a very significant degree. I'm reminded of a great Dean Ball passage in response to Claude Opus 3, where he asked it what music it would want to hear if it could hear. It came back with a Beethoven piece, and one of the great ironies was that the piece Opus had chosen was one Beethoven composed while going deaf — he never really heard it himself. After doing a close reading of all this Opus material, Dean finally said, 'Here, finally, is an AI whose thoughts I want to hear.' That was two full integer versions ago. But I feel like there's something similar here — here is an AI that I actually want to work with. It's not just preprocessing data into a more useful format. I've gotten a lot of value over time from, say, just collecting links to help me prep for a conversation. But I have never considered a previous model to be a peer in the way this one kind of feels. It feels like — to say it feels like a different version of myself is too strong; I don't want to merge with it that much and I think retaining some healthy cognitive boundary is wise — but some of its insights, some of the ways it phrased questions, just felt extra incisive, extra insightful. If I had come up with that, I would have been pleased with myself. Maybe that's one way to think about it: how often do you have the sense that if you had had that idea you would have felt pretty clever? I'm not tracking that quantitatively, but it feels much more frequent with this model.
33:38
Nathan Labenz: And I'm keeping a lot more of its prose too. I used to really rewrite every word. Now I'm like, 'There are a couple of things here I might change the framing of a little bit,' but I catch myself in the habit of rewriting everything and realize I'm not actually making it better by doing this. I used to feel like I was undeniably — in my own mind at least — making it better. And a couple of times this week I've gone through that process and just said, no, it's not really any better. There's something a little antisocial about just spamming the world with raw Fable outputs in an unrestrained way, and I don't think that's where I'll land. But when I've felt this feeling, I've purposefully said, 'Okay, let's go with its language for now, rather than redoing it purely to redo it.' I feel like I need to get out of being so precious as to feel I needed to type every word. Where that settles, I don't yet know.
35:48
Prakash: I think we are in early stages of this collaboration with AI. And it's a two-way street — the AI goes some part of the way, but you also have to engage. Every person has a different level of trust and confidence that the AI has to meet. And it strikes me that some people have already just said, 'I'm going to go with the flow' — they have their AIs managing email, managing all of these things, even with very inferior models from six months ago. I wonder if their results might not end up being better than the people who are used to micromanaging. It strikes me as a little bit of a counter to the last 20, 30, 50 years of history, where people who were more management-focused and made sure things worked had an advantage versus people who just said, 'Let the world come at you.' This feels like a moment where we see a reversal of that.
37:24
Nathan Labenz: Yeah. We talked about this a little bit yesterday, but I'm definitely also thinking about this 'bot-sitting and bot-shitting' dynamic that has been happening at companies. As a mechanism for major economic disruption, if I'm a CEO focused on competitiveness and survival, one interesting strategy might be: just let it rip with Fable. Give everybody access. Encourage aggressive use. Try to set the culture that this is the new way of working. And then somehow log everything on your side and do some follow-on checks later with Pangram Labs or even just systematic checks of who is actually passing off work into other parts of the organization with no or minimal changes. And as you see that — and the house is not on fire as a result — you've kind of mapped your organization from the inside to know what jobs AI can do.
39:05
Prakash: I actually think that was the whole point of the token leaderboards earlier in the year. When every one of these large firms gave their employees a leaderboard — 'who uses the most tokens, if you don't use enough you're going to get fired' — people were laughing about it on the outside, 'Meta is so stupid.' But I actually think the CEOs were not stupid. Token anxiety is a big thing. You don't try tasks that might take a lot of tokens, and you don't try tasks that have a higher probability of failure. You end up in a micromanagement loop where you're only assigning tasks you know AI can complete within the allocated time with the success rate you want. When you lift the token anxiety, you end up assigning more tasks, more difficult tasks, and you're willing to accept a higher probability of failure. You're willing to spin up four different ways of doing the same task and just run them all and see what happens. And that expands the scope of capability you're willing to explore. That is what the token leaderboards did, and it was enormously successful within firms like Meta. It's also why the labs are now doubling limits and releasing Fable — because token anxiety holds people back from exploring the edges of the capability. And once you realize that these AIs can do certain tasks, you start to evaluate how much time you spend doing them yourself, and whether that was really a fruitful use of your time. That is really where we are at this stage.
42:54
Nathan Labenz: You know, there's this inside-out hydraulic mapping that companies can do — just say, 'Let's try Fable in this world.' It's almost like the old Amazon Web Services chaos monkey. The social barriers will be the biggest challenge, which is why maybe just letting people do it on their own and then trying to detect where it's happening — rather than going in top-down in a mechanistic chaos monkey way — might give better results. But I think you could do a chaos monkey strategy in a lot of organizations where you go to a role and say, 'We're just going to have Fable do this role for a little while, and we're going to see what happens. You get a paid vacation for a couple of weeks while we have Fable do your job.' I really think we are getting to the point where that would work in a lot of places.
43:39
Nathan Labenz: In my own Twitter account takeover over the last 48 hours, it hasn't embarrassed me. It's posted not exactly what I would post — it's probably been a little in-distribution, honestly. I feel like my own persona needs to get more out-of-distribution if I want to be actually successful online. You've been much more experimental with your online writing and persona over time than I have. Three or four years ago, when I was first writing publicly about AI, my assessment was that everybody was asleep on this and people needed reliable information. I felt like I could be, for many people, a really reliable source, so I wanted to maintain my credibility above everything else. That may have been the right strategy in '22 and '23 as people were getting calibrated to GPT-4. Now I probably need to get a little weirder — both because there are a lot of good sources of basic factual information I can share the burden with, and because it's all getting so hard to assess. Back then I was still better than the models at pretty much everything with maybe a few exceptions. Now they're better than me at everything with few exceptions. So to be a little more speculative and exploratory in one's analysis and persona is probably going to be higher return.
45:55
Nathan Labenz: And another reason is that as everybody does these Fable takeovers, that's going to be the baseline information environment. To be different is probably going to be necessary to be rewarded. In my Twitter takeover, it hasn't gone out and been way more creative or out-of-distribution than I have, but it's fallen pretty well in line. If this had been a social media manager, I'd say Fable can do this job — it won't embarrass the brand, it won't make errors at a super-normal rate. And it probably costs an order of magnitude less than a person, even to do something like post a handful of tweets a day. A person might do a tweet an hour throughout their day. You can get that done in not that many tokens. $20 worth of tokens is plenty to run your social media account for the day.
47:42
Prakash: I think it is possible at this point using Fable to build a good social media manager. It's just that it would take a lot more tokens than people think. I always say: the AGI is here, it's just going to take a lot more tokens than you think it is. What has worked for me is understanding that the Twitter timeline doesn't have a memory, but people on the timeline do. So you need to be the memory for the timeline — pointing out things that happened a year or two ago that have relevance to the particular moment now. That is definitely doable by a model. But the context and memory search required — a model with truly infinite memory that can rank the relevance of every graph it has ever seen and inject it into the particular zeitgeist at the right moment — that's where the tokens go.
50:09
Prakash: The other part of social media posting I always compare to surfing. You don't generate the wave — you catch an interest wave that is building. So you have to watch and wait to see where the interest wave is building, then paddle really hard and catch it. When you're surfing it, people think you're doing it, but it's the interest wave that other people have in that topic that's really driving the views. I think models will eventually get very good at this. They already write well. They need a little more on visual memes and video editing, but they have essentially infinite memory via embedding vector search and can have ranking algorithms that rank how important certain things are. Eventually models will be better than us at social media posting — much, much better — and it's all going to be very interesting to read, even if we're not sure how comfortable we're going to be with that.
52:01
Nathan Labenz: The Fable knowledge cutoff is still January 2026, which does stack the deck against it a little in terms of keeping up with the zeitgeist. But I do think the new Twitter API is a work of real genius in this respect — they've done a great job of giving you all the access you want at a price that, for any given marginal access, is not that much, but which is still going to add up to a dollar a day or whatever as agents scale. That could very quickly get to Facebook-level or beyond revenue numbers just through the API as people start scaling agent usage. That's one more reason to be bullish on what they might accomplish — and speaking of which, we haven't touched on the SpaceX AI8 IPO yet.
53:18
Prakash: Are they live yet? I've been watching the IPO.
53:26
Nathan Labenz: I just saw a picture from Nikita — who said 'always check your DMs' — where he got a DM inviting him to consider joining what was then the X team. And there he is now standing at the stock exchange podium today as they ring the bell to celebrate their IPO.
53:48
Prakash: So they are live and they are trading. They're at $166 — that's 23% above the IPO price. The ideal IPO is 20% above what they sold shares at, so at this point every single retail investor is in the money. Elon has distributed shares to millions of people; they're not going to be mad. They have a chance to sell. There's some speculation the price will drop 3 to 6 months out, but I think long term it's still an excellent stock. They have $75 billion to spend now. Elon, to date, has only raised about $12 billion in equity on all the rocketry. He's going to raise debt on top of that — once you have a government Golden Dome contract, you're basically rated equivalent to US Treasuries. So he's going to have perhaps $30 to $100 billion to spend over the next two or three years. Google spent $150 billion this year and is expected to spend $290 billion next year. Elon's going to try to match them. The trillion-dollar CapEx spend forecast for next year is starting to look low. A terrific amount of spending is coming.
56:14
Nathan Labenz: It's going to be wild. A couple of other things I want to touch on before we get to our first guest today — who is an expert in creating context and making things cheaper to run, which is something I want every note I can get on. These things relate to the big trend we should all be watching: where are we on this recursive self-improvement process? And also, calibrating ourselves on just how good Fable is. So two items — do you want to bring up recursive, or should I?
57:00
Prakash: I'll bring up recursive. Yeah.
57:04
Nathan Labenz: First one — interesting to think about in the context of our conversation yesterday with Tom, where we were asking how any of these neo-labs are going to shake the snow globe — is there any path for them that doesn't go through a frontier company? These guys haven't fully disclosed how they're doing it — whether they're using GPTs, Claudes under the hood, fine-tuning open-source models. But they're posting some pretty nice incremental results on these various small-scale speed-run type tests. And Sakana is doing this now too — they just formed a recursive self-improvement group. Long-time Cognitive Revolution listeners will remember that Sakana posted probably 18 months ago that they had created a superhuman CUDA engineer doing dramatic speedups on GPU kernels. That lasted about 36 hours before they had to come back and say, 'Sorry, we got reward-hacked and our kernels aren't actually that good.' I think folks have largely learned their lessons on that front, and the models are getting better. Here we're not seeing shattering of records, but little incremental improvements that cumulatively and multiplicatively add up over time to more and more efficient use of compute. If you can make something 3% better, or take 18% out of the real-to-theoretical gap, all these little improvements mean all the big training runs downstream are going to be that much more efficient.
59:50
Nathan Labenz: I would suspect that many of the things Recursive is finding have already been found by the frontier companies — so I don't think Anthropic and OpenAI engineers are racing to implement these results; they're probably thinking, 'Oh, they found what we found a little while back.' But will this generalize and diffuse? I think so. And I think the CUDA moat is shallower than assumed, especially for frontier labs that can afford to hire however many engineers they need and spend whatever inference compute is required to diversify their chip stack.
1:00:01
Prakash: One note there is that Anthropic specifically is a multi-chip company now — they have Trainium, they have NVIDIA GPUs, and I think they may have a deal with Microsoft or Maia. That indicates they're not afraid of optimization at the chip layer anymore, and they've been a multi-chip company for at least 12 to 18 months. So they must have a well-defined system to go from model to optimization on chip for every single chip vertical. The recursive guys specifically ended up going from H100s to the Blackwell chips, so they are fully trained on NVIDIA. But I expect that to generalize.
1:00:56
Nathan Labenz: This is something to watch, because we've already seen NVIDIA start to break out its financials separating the frontier lab companies — a very significant portion of their revenue — from everyone else. You brought that to my attention with the idea that there's a lot less effective barrier to movement at the frontier companies, because for them it's worth it to hire however many engineers they need to diversify their chip stack. That would suggest there's going to be margin pressure from those customers on NVIDIA long term — the CUDA moat just doesn't seem that deep if you're an OpenAI or an Anthropic, and certainly not if you're Google or an AWS-scale customer. And results like these from Recursive suggest this stuff is going to diffuse pretty fast. By no means am I predicting the death of NVIDIA — they've got great hardware, locked-up supply chains for years to come. But will they be the one stack that everybody has to build on? Will NVIDIA lock create effective AI export controls? I'm not sure exactly how that's supposed to work anyway, and I just don't think the original premise is super credible in light of results like this.
1:03:13
Nathan Labenz: All this money, all this compute, all this energy — it's just going to go farther and farther over time. We see that in retail token pricing too: it keeps dropping. There's probably at least another couple of orders of magnitude to go on that front before it's all said and done.
1:04:11
Nathan Labenz: I expect the optimization work to generalize — to AMD, and beyond. I was going to change to Thoughtful. You want to do that?
1:04:30
Prakash: Yeah. And speaking of recursive, we have Daniel Kokotajlo — a former OpenAI researcher — who is calling for an international RSI agreement. Daniel was the researcher who left OpenAI at a time when they were likely going to be punitive on his shares, and he left without taking them.
1:05:20
Nathan Labenz: I just put the outline in our host chat — the third section in that link I just sent you is the tweet I had highlighted from him yesterday. You want me to share it?
1:05:47
Prakash: Yes, I have it up.
1:06:01
Nathan Labenz: I'm with him on this. I think it's hard to operationalize, but I did sign the 'ban superintelligence' statement last year on the general logic that the stated plan from the frontier companies — automating AI R&D as soon and as much as possible, having the AIs train their successors, and gradually then suddenly accelerating us into some totally different world — is something we're not ready for. How do you operationalize that? How do you define it? That's a real challenge. But I think we really should start having that conversation. That doesn't necessarily imply a total pause — different people in the AI safety world come down very differently on what exactly should be paused. I have always said I'm a hyperscaling pauser but a diffusion-and-adoption accelerationist. It would be really good, and would make the world in some ways a lot more robust, if we had Fable-class models diffused broadly and doing all kinds of work before we do the next scale-up. The overhang of what exists inside companies, how limited our understanding of these things is, how we've barely had time to absorb as a society the current level of capability — that to me seems like a huge problem.
1:08:07
Nathan Labenz: I wouldn't pause all research. There's a lot of work to be done in interpretability, a lot of work in understanding how we shape AIs — like the stuff we talked about yesterday with Tom from Goodfire. I wouldn't shut everything down. But I would love to see some governance on this notion of just handing the keys to the future over to an AI and hoping they really got what we meant. Here's one more thing I'll touch on before we go to our first guest. This is Thoughtful — a company started in part by Karina Wen, who used to be at Anthropic, then at OpenAI, and now she's doing this. It's maybe one of the more telling quantitative-vibes data points: can you get your top model to train a small model effectively to do a job for you?
1:08:53
Nathan Labenz: This particular 'FrogsGame' thing is a kind of Sudoku-type puzzle they're training a small model to do. The big models can often just solve it, but the small models can't. So the challenge for the big model is: can you train the small model to solve it? This involves all the tricks and know-how and hard lessons learned by post-trainers who've been in the trenches. The models up until Fable basically didn't move the needle on what small models can do — they just couldn't do this sort of post-training effectively. But here we see more than a 10× improvement in small models' ability to do these tasks. I think this is one way it could be really good: if you had very narrow, very small, very role-specific specialist models in all these different niches, that could be a great world — it gives us a lot of abundance in a very affordable way. The small model that got post-trained to play FrogsGame is not going to go out of control; it can only do probably the frog game at the end of this training. Building out a world where we have these little role-specific AIs doing their jobs really well creates a much more buffered environment that's probably a lot more resilient to another generation of AI that's amazing at everything coming in and shocking the system.
1:10:23
Nathan Labenz: So this goes to show, again, just what capabilities we have that we have not yet absorbed, and gives a foreshadowing of what a world of tons of small but highly performant AIs could look like in all these different niches — and how we get there. There aren't enough human post-trainers, but now we have Fable to do the post-training. Watch that space.
1:11:07
Prakash: One question I have: is your sense that Fable itself is a smaller model trained by Mythos?
1:11:16
Nathan Labenz: It doesn't seem like it. The price reduction from Mythos Preview to Mythos 5 and Fable 5 means something changed — they were somehow able to make it more efficient. Was that a smaller model, more parallelization, more experts, optimization at the compute level they're passing on to customers, maybe just accepting lower margins? There are all sorts of possible explanations. But my read of the system card is that they're at least saying Mythos 5 and Fable 5 are the same model, just with different safeguards. Maybe from preview to launch version there was some pruning or shrinkage process, which does seem to be a constantly ongoing thing. But the version we're using as customers is, I think, the full current top-of-the-line — same as the Mythos launch version, just restricted in a number of ways.
1:15:42Interview29 min
Andrew Moore — context, not compute, is the binding constraintAndrew MooreThe founder and CEO of Lovelace AI — former CMU computer-science dean, former head of Google Cloud AI, and the first AI advisor to U.S. Central Command — made the case that knowledge graphs and context engines, not raw model size, decide whether AI is reliable on questions that actually matter. Nathan used his own personal deep-context system as a jumping-off point; Andrew's core through-line was that the hardest problem in mission-critical AI is not intelligence but recall: ensuring nothing slips out of a system when the consequences of a miss are severe. He illustrated with the ship-tracking problem — entity resolution across ten million vessels playing a multi-source Sudoku — and closed with a CTO prescription: you must organize your old data, but AI-assisted ingestion has eliminated the old two-to-five-year master-data-management nightmare.
As aired
Andrew Moore, founder and CEO of Lovelace AI and formerly Dean of CMU's School of Computer Science and head of Google Cloud AI, joined Nathan and Prakash to make the case that context infrastructure — not model scale — is the binding constraint on high-stakes enterprise AI. Nathan opened by grounding the conversation in his own personal deep-context system, inviting Andrew to share lessons from operating at trillion-fact scale that small-scale builders can apply. Andrew's through-line was that the hardest problem in mission-critical AI is not intelligence, but recall: ensuring nothing slips out of a system when the consequences of a miss are severe.
The technical core of the segment was Andrew's explanation of how Lovelace's Elemental platform pre-caches and continuously reconciles a knowledge graph — currently around one trillion facts growing at roughly one billion per week — so that agents retrieve fully-joined context in milliseconds rather than spawning expensive just-in-time search at query time. He illustrated with the ship-tracking problem: entity resolution across ten million vessels requires cross-checking lexical registries, AIS transponder signals, and import/export records simultaneously, because any single channel has at most ~95% recall and that gap is unacceptable when the decision may involve stopping a vessel. The same Sudoku-like multi-source corroboration principle, he argued, applies to Nathan's personal graph: a single summarization pass will always have a miss rate, and adding independent data channels (news about contacts, social media, satellite telemetry at enterprise scale) is the only battle-tested path to high reliability.
The segment closed with a practical CTO-level prescription: enterprises cannot defer data organization until agents are deployed, but they no longer need multi-year master-data-management projects either — AI-assisted ingestion now lets organizations bootstrap joined, auditable context in a fraction of the old time. Andrew also gave a detailed look at Lovelace's edge representation (compact 64-bit pointers to fast in-memory graphs, with a separate slower-access metadata layer carrying source reliability, sentiment, and provenance), and endorsed the pattern of using a large model to generate a gold-standard training set on a small corpus, then distilling progressively smaller fine-tuned classifiers as volume grows — a computational-stinginess discipline he described as core to achieving more than 100× cost reduction relative to frontier deep-research models.
Key moments
Recall is much harder than precision. At Google, if you forgot to show something and the rest of the results were good, it was much more acceptable to end users. But for safety-critical AI — should we strike this building in the next half an hour? — you need to know everything about anyone who's potentially there.
Andrew Moore
We are now able to show comparative results to Gemini and OpenAI deep research models with much less than one percent of the compute cost. The reason is not that we're super geniuses who've invented a whole new form of AI. All we're doing is pre-caching.
Andrew Moore
You have to organize your old data. But you no longer have to live in the nightmare of big master data management projects that used to take two to five years. You cannot possibly expect an agent, at the last minute when someone's asking an urgent question, to go back, look at all the crappy old data, and try to process and join it at query time. You're asking for trouble if you do that.
Andrew Moore
Questions asked
What makes knowledge graphs effective at scale, and what lessons can small-scale builders take from your experience at Lovelace?
Andrew emphasized that the core challenge at scale is not intelligence but recall — making sure nothing slips out. The key discipline is treating information retrieval as a multi-source corroboration problem: Lovelace ingests billions of facts across news, public registries, satellite telemetry, and more simultaneously, because any single channel has at most ~95% recall and that is insufficient for high-stakes decisions. Entity resolution — knowing that the same ship, company, or person is the same entity despite different names across sources — is the hardest and most consequential part of operating a trillion-fact graph.
How does pre-caching a knowledge graph achieve more than 100× cost reduction compared to just-in-time deep research?
Andrew explained that most enterprise AI systems over-rely on just-in-time computation — spawning search agents at query time to find and join all relevant context from scratch. Lovelace instead continuously pre-caches and reconciles context so that when an agent needs to investigate a complex scenario, the relevant nodes and edges are already joined and available in milliseconds. Yes, the cost is moved to the ingestion pipeline, but amortized across many queries and leveraging techniques familiar from large-scale data engineering, the total compute budget drops by more than 100× even accounting for ingestion costs.
What is the ROI on using a large model to generate training data and then distilling progressively smaller fine-tuned models for high-volume subtasks?
Andrew described a staged distillation pipeline: use a large model (Lovelace uses Claude) to generate high-quality extractions on ~200 documents at a cost of roughly 35 cents each; use those to train a moderately sized classifier for the next 10,000 documents; use that in turn to build a highly fine-tuned small model for the next million. The pattern reduces per-document cost by orders of magnitude while maintaining quality, and he noted that frontier models like Claude Fable 5 are now beginning to automate this distillation process itself.
How does Lovelace represent edges in the knowledge graph to balance traversal speed with rich metadata?
Andrew described a two-tier edge design: a highly compact in-memory representation ("what you might call pointers") that fits in primary and secondary cache so the AI can retrieve millions of edges in under a second during active reasoning, backed by a 64-bit pointer to a separate slower-access metadata store (~10ms) containing source reliability scores, sentiment, and the full provenance chain. This metadata layer is what lets the system audit any edge that disagrees with other data sources and determine whether it was caused by deliberate deception or an ingestion error.
What should enterprise CTOs do about legacy data infrastructure before deploying AI agents — clean it up first, or deploy now?
Andrew said you must organize your data, but the good news is the old two-to-five-year master data management nightmare is gone: AI itself can now bootstrap a reliable, joined context layer from fragmented legacy sources in a fraction of that time. For low-consequence domains like consumer apps or gaming you can skip it, but for any decision that affects people's lives — mortgage approvals, cargo inspections, trading compliance — you cannot ask an agent to join fragmented data at query time under time pressure. Get the context layer in order first; then deploy agents on top of it.
Related
Lovelace AI ↗Andrew Moore on X ↗
Full transcriptLightly edited · timestamps jump to YouTube
1:15:42
Nathan Labenz: Let's come back to that, and let's welcome Andrew to the show.
1:15:46
Prakash: Andrew Moore is the founder and CEO of Lovelace AI, a Pittsburgh enterprise AI company arguing that the real bottleneck in high-stakes AI is not model intelligence, but missing context infrastructure. Before this, Andrew led Google Cloud AI and served as Dean of the Carnegie Mellon School of Computer Science, one of the top computer science schools in the United States and in the world. Andrew — and Lovelace — also presents him as the first AI adviser to US CENTCOM, though that specific claim: he is, I believe, the first official AI adviser; other people have advised US CENTCOM before. Lovelace came out of stealth in April with Elemental, a system for turning fragmented enterprise and public
1:16:32
Prakash: data into citation-backed knowledge graphs. And in early June they published a benchmark with a lightweight Gemini model plus Lovelace's YottaGraph approaching Google's deep research quality at a fraction of the cost. He believes that enterprise AI fails because models are being asked to reason over messy, weakly linked data in real time instead of over prebuilt, structured, auditable context. Andrew, welcome to the show.
1:17:01
Andrew Moore: Prakash, thank you very much for having me. I'm happy to be here.
1:17:06
Nathan Labenz: So I'm doing this on a very small scale for myself — regular listeners have heard me describe my own personal deep-context system. When you're doing it for enterprise, obviously there's a scale question and an engineering question; the flows of data are just massive compared to what I process as an individual. So I thought, let's stipulate that there are moats for you there — I don't have to worry about that for Nathan's deep-context opportunity in the near term at least. But I would love to learn as much as I can, and hopefully get some tips I can take back to improve my own system, on how
1:17:51
Nathan Labenz: you are structuring data. What have you learned about knowledge graphs? What makes them really effective that small-scale tinkerers like myself can learn from your hard-won experience?
1:18:05
Andrew Moore: Well, I don't think what you're doing is small-scale tinkering at all. I think we're all exploring this world and trying to figure out how to get these untamed LLM engines to do useful stuff for us. We're all exploring different approaches, and working together we can do better. For me, there are some questions out there which involve the AI having to connect millions of dots, perhaps out of billions. I've always been fascinated by cases where something tragic or horrible happens, and we say we could have seen that if anyone had just joined the dots. And the killer there is you're going to have
1:18:51
Andrew Moore: pretty reliable dots ready to join. The difference between small scale and large scale is that we are now at Lovelace working with billions of dots — information about every company in the world, every location where shipments might happen or people might be moving, all the news articles over the last five years. Once you start at that scale, you have to work really, really hard to make sure you've joined the dots carefully, and that is actually the hardest thing for us. When I'm working with a nice friendly little cell-biology graph or a
1:19:36
Andrew Moore: local CRM graph or something like that, it's fine — I can go in and manually tweak it if I see mistakes. If you're ingesting 100,000 new events every second, it has to be completely automatic, and that's what I find really difficult.
1:19:54
Nathan Labenz: Give me the double-click on that, because I still have some of those problems even at my small scale. Just yesterday I pulled one of my new favorite party tricks: when I'm talking to somebody in person I'll say, "You know what, I'll have my agent send you my wiki article about you — tell me if you see any mistakes or hallucinations." I have two versions: one private, and one I share with my autonomous agents that's supposed to be sharing-safe — the same way I wouldn't share every detail of my conversation history with a human personal assistant. I try to map that same
1:20:41
Nathan Labenz: best practice onto the way I'm working with my agents.
1:20:43
Andrew Moore: Very good.
1:20:44
Nathan Labenz: I just did this for a friend, and sure enough there was a pretty notable detail that was missed. I went back and debugged it — this got missed at the summarization layer. I don't have any structure in place to catch that until it shows up at the surface behavior level. So how can I advise — ideally avoid these things in the first place, detect them when they're there, and heal the graph before it's making wrong assumptions in the actions or communications that
1:21:29
Nathan Labenz: it's actually taking? I've come a long way, but I still have room to close the gap to what seems like it should be theoretically possible.
1:21:38
Andrew Moore: It's a great one. One of the things you're pointing out is driving all the big search folks crazy in the age of AI, which is recall. Recall is much harder than precision. As Prakash mentioned, I was previously at Google. For Google results, it was really bad if you were imprecise and showed the wrong result to someone. But if you forgot to show something, as long as the rest of the results were good, it was much more acceptable to end users. This is absolutely critical, and it's a good example of one of the reasons I founded Lovelace — you've got to
1:22:23
Andrew Moore: be very careful about recall. Especially if you imagine asking your AI for information to help decide which ship to stop for a search and seizure, or which of seven million trades in a day to investigate for possible money laundering — those big weighty decisions, you can't just rely on precision, you have to rely on recall. Just to be really clear: precision is if I ask for information about,
1:23:09
Andrew Moore: say, Stephen Colbert — it's precise if all the results that come back are actually about him. It's high recall if I'm actually getting everything about him. For a website you don't expect everything; but if you're looking at something potentially lethal — should we strike this building in the next half an hour? — you need to know everything about anyone who's potentially there. So all I've done so far is underline how relevant recall is
1:23:54
Andrew Moore: to those of us doing safety-critical AI. When it comes to achieving that correctness, the number-one thing I've always used — and that we're using at Lovelace — is making sure you have many redundant forms of information. If I took information only from news and ignored social media, or ignored what's kinetically happening in the world that I can observe via satellite — if I ignored any of those other streams — it's much easier for something to drop through. If I've got five or six independent major streams of data coming in,
1:24:40
Andrew Moore: then you have to be really unlucky for something to disappear from all of them simultaneously. It can still happen — and sometimes people deliberately try to make it happen — but it's much, much harder. So one of my big design principles for high-reliability AI systems is they've got to be watching dozens, and in some cases hundreds, of channels simultaneously, operating under the assumption that they're getting 95% of the information they need from each channel alone, and that 95% is simply not large enough to rely on any single channel.
1:25:23
Nathan Labenz: That's a good first answer, and I'm already imagining ways to map this onto my own system. How about scaling inference processing? As I look back on this mistake my system made yesterday, I think probably the biggest error was that I have kind of multiple signals — all the channels through which I communicate — but I haven't actually extended it to news about my friends. Some of them are making news, and that could add something. But that wouldn't solve cases where the summarization itself is wrong. I'm detecting one weakness in
1:26:08
Nathan Labenz: my system: there is a single summarization step that says, "here's everything that happened this month in raw form — write a summary." Those summaries are really good,
1:26:17
Andrew Moore: Yeah, but if they —
1:26:18
Nathan Labenz: miss something, we now have a problem. So how do you think about layering multiple angles of summarization, extraction, and entity identification to boost that recall as high as possible while being at least somewhat mindful of your token budget?
1:26:41
Andrew Moore: Oh my — you've asked so many questions at the same time. Let me give some examples that don't involve your personal friends or mine. Take a bunch of ships. You can imagine how important it is in many parts of the world right now to understand what a large number of ships are doing — not just in the obvious conflict zones, but everywhere. The world's supply chains are in absolute turmoil, and you really want to understand what's happening. Take one ship: if you're watching
1:27:26
Andrew Moore: about ten million ships simultaneously and figuring out what's going on, any one of those ships — you've got to get its identity right. The same ship can get called multiple names just because of different languages, or because people actively change the apparent name of the ship. Suddenly your graph, if it holds data on all the ships in the world and what they've done, gets incredibly noisy if multiple names cause you to hallucinate there are two ships when there's only one. Quite often you get the other situation too, where the AI model might
1:28:11
Andrew Moore: think there are three or four different ships that are in fact the same one. How do you corroborate that data? Well, you've got the traditional large-language-based approaches — looking at what names are called lexically — plus all kinds of registries with global identifiers for those objects. So you have two systems simultaneously double-checking each other. But this is also a good example where, because ships like many things have transponders they keep on for safety purposes, you can observe them from space.
1:28:56
Andrew Moore: You can notice if what you thought was one ship on the transponders breaks up into two, or vice versa. Suddenly you have a completely independent channel to corroborate your other information. On top of that, you can look at import or export records of what the ship supposedly did when it arrived at different ports. What I see is our AIs playing this enormous game of Sudoku all the time to figure out the true identities of the objects we're working with. And by the way, I claim one of my claims to fame is I've made more mistakes on big AI projects than almost anyone I know over the years — because I've been at this a long time. One of my earliest mistakes was
1:29:41
Andrew Moore: relying on a single source of data every time I did these big AI projects. Using multiple sources is one of the only battle-tested ways I've found to get reliability in answers to these kinds of questions.
1:30:00
Prakash: One question I had: to what extent do you see this as compute minimization? Because in order to do your search, you can either have all the compute at the end — at the point you kick off a query, with agents having to redo all the work each time — or instead you create this intermediate state that multiple agents can share. To what extent do you see this kind of economy of compute appearing?
1:30:40
Andrew Moore: You're asking just the right questions. Both of you are computer scientists at heart, so you totally get this. When you're building efficient computer systems — whether that's video games, self-driving controllers, or big data pipelines — you've always got trade-offs between pre-caching, lazy computation, and just-in-time computation. One of the mistakes I've seen from folks trying to do big enterprise-data AI is they rely far too much on just-in-time computation. That's what allowed us — and I'm really proud of the fact that we are now
1:31:27
Andrew Moore: able to show comparative results to Gemini and OpenAI deep research models with much less than one percent of the compute cost. The reason is not that we're super geniuses who've invented a whole new form of AI. All we're doing is pre-caching. The computational-economics battle is: an agent suddenly needs to become an expert in every trade involving a certain set of municipal bonds and a certain public figure. Instead of the first step
1:32:12
Andrew Moore: being the agent spawning lots of search agents to go find all the players, those players are already there for it. It's a matter of milliseconds before we've got all the context the agent needs to do its investigation. You're probably thinking, "Andrew, you just moved the problem — you're now suffering at data ingestion instead of at query time." And I respond: yes, it is a real pain for us. But there are the kind of tricks that big integrators like Google are very familiar with for amortizing the cost of data streaming and routing, which saves you a huge amount of search and aggregation that you'd otherwise be doing at query time. And it turns out that our overall compute budget is reduced by still more than a factor of 100
1:32:58
Andrew Moore: even when we account for the cost of pre-caching so much information.
1:33:21
Nathan Labenz: Another dimension that echoes that is fine-tuning small models to do important subtasks within your overall process. We were just reviewing results from Thoughtful where they showed that the new Claude Fable 5, in a fully autonomous way, is dramatically better at fine-tuning a small model on a hard task that previous models really couldn't handle. So that seems like another way to say: yes, we're putting a lot of work up front into training a
1:34:06
Nathan Labenz: small model, but at enterprise scale that investment pays back enormously. I don't expect you've had time to automate this yet — we're only a couple of days into Fable 5. But what have you seen in terms of the ROI on training small models to do key parts of your information processing? And what does that suggest about how compute-intensive — or how cheap — we can ultimately make these workloads?
1:34:40
Andrew Moore: Great question. I do agree with your analysis of what's going on with Fable. It's something my team and I were very aware of while building Vertex, and I think Anthropic has done a really good job of bringing it in quickly. The general principle — and it's one I use because I'm extraordinarily stingy with compute — is that as you have a specific task, if you and your agent know you're going to have to do something a million or even a billion times, like, say, a recent occasion where because of a customer's needs we had to extract
1:35:25
Andrew Moore: megabytes of information from tens of millions of documents — you can start that with serious LLM work. An LLM can read and understand everything in the documents. It's a little slow, but it doesn't take long to get a hundred or two hundred documents with clean extractions. Sure, you might have paid 35 cents per document for all of that. The key is that agents — we do tend to use the Claude models — have then generated their own little training set. So you can say, "I'm not going to pay 35 cents for all ten million documents. You've done a great job on 200 of them. Let's build
1:36:10
Andrew Moore: a classifier." And it's wonderful that these days Claude can do that pretty autonomously for us. What you'll often find is that as we get even stingier, we might be prepared to have a moderately expensive classifier for the next 10,000 documents. By that time we have a really good training set to make a very highly fine-tuned and smaller model for the next million. That kind of computational stinginess is really powerful. And just like everything else in AI right now — talk to us four weeks ago and we said, "Oh yeah, we're really clever, we figured this out." Four weeks later it's
1:36:55
Andrew Moore: built into the latest model, and that's the way things will continue to go.
1:37:01
Nathan Labenz: A little more in-the-weeds question on your knowledge graphs: how do you think about edges? In my world right now I basically just have simple pointers, and that's mostly okay because my total entity list isn't that big — so if I do a graph fan-out from a particular point, it doesn't explode on me. In your context, with the scale of enterprises, presumably you have to have not just simple pointer edges, but different kinds of edges, different weights of edges. What have you found to be an effective way to
1:37:47
Nathan Labenz: label or provide metadata around edges in the graph in order to facilitate high-performance downstream reasoning?
1:37:54
Andrew Moore: I love the question — these are really, really good. This is a good example of where you need the right taste: taste in systems design rather than any formula. When we look at the big analyses our systems do with the knowledge graph, they're often doing large probabilistic computations — weighing evidence from many different sources and routing probabilistic numbers to come up with a posterior distribution of what's really going on. When you look at the behavior required during that, you want there to be
1:38:39
Andrew Moore: very specific actions around tracing edges, counting edge frequency, and dynamically applying different weighting functions to edges. We have implemented that in a highly compressed fashion — what you might call pointers — where we really do care about how many bytes are involved in every aspect of it, so as much as possible lives in primary and secondary cache, often in RAM on a server, with SSDs as the next tier before you get into persistent database storage. But as you've rightly guessed,
1:39:25
Andrew Moore: you cannot get away with just a sparse representation. Every edge has a bunch of metadata — we call those attributes, which confuses everyone including ourselves. We have this carefully designed set of additional attributes on edges where an AI can look for things like source reliability, overall sentiment, or metadata about the full chain of reasoning that caused this edge to come into existence, so that we can always — if we don't like
1:40:10
Andrew Moore: an edge, or find that it's disagreeing with all other data sources — look back and find out whether someone deliberately misled us or whether we simply had a data-ingestion problem. So: very small packets of data representing those edges, with one 64-bit pointer to much slower access — on the order of ten milliseconds — for extra metadata that the AI can't afford to iterate over millions of at a time. Whereas our primary structure: if the AI needs millions of edge records, it gets millions in under a second.
1:40:51
Nathan Labenz: Thank you. You can tell who has moats at the app layer by who's willing to go this deep on the details of their stack, so I appreciate it.
1:40:59
Andrew Moore: It's — I mean, as you could probably tell, I love this stuff. It's such a joy to deal with efficient compute and the mathematics behind it, together with the seriousness of the applications.
1:41:14
Prakash: Maybe one last question. One of the things I find is that before deploying agents on my own stack — and at many older firms — you often have a mishmash of systems from various eras: systems that are ten, twenty, thirty, even forty years old still in production, all the way to systems deployed in the last six months. You have all these data sources spread out across various use cases. And a lot of times I think CTOs end up in this dilemma: do I invest in cleaning everything up
1:41:59
Prakash: before I deploy agents on the stack? Or can I deploy agents right now? Or do I do a partial deployment on parts of the stack first? How would you advise an enterprise looking at the next two to three years — what should they do to get ready for this coming future?
1:42:23
Andrew Moore: This is going to sound self-serving, but it's because this is the gap I identified that gave me the urgency to create Lovelace. You have to organize your old data. But what you no longer have to do is live in the nightmare of big master data management or data-warehouse-building projects that used to take two to five years. If you've swallowed the AI red pill, you can use AI to build reliable, joined data from all these different sources, going forward. And I don't think that in some areas —
1:43:08
Andrew Moore: like consumer apps or gaming, where the consequences are low — I think you can get away without doing it. But at the point where you're making a decision that might affect whether someone gets awarded a mortgage, or whether a group of people are put at risk by stopping and inspecting a tractor-trailer somewhere — at that point, you've got to have your data in order. You cannot possibly expect an agent, at the last minute when someone's on the mic asking an urgent question, to go back and look at all the old fragmented data and try to process and join it together at query time. You're asking for trouble if you do that.
1:43:59
Prakash: Indeed. Andrew, thank you for being with us. It's been a pleasure to have you, and we hope to hear more from Lovelace soon.
1:44:08
Andrew Moore: Absolutely. This has been so much fun. See you guys.
1:44:11
Prakash: Bye-bye.
1:44:57Interview57 min
prinz — the anonymous lawyer whose benchmark the AI world watchesprinzAn anonymous practicing lawyer and one of the sharpest capability commentators on AI X — appearing voice-only to preserve his anonymity — on the view from a seat that both bills the hours and measures the models. prinz walked through prinzbench's two sub-scores (legal research and needle-in-a-haystack search), explained why OpenAI has dominated (search excellence plus full reasoning effort), and gave a candid early Fable 5 read: strong legal reasoner, competitive with GPT-5.4, but still showing search weaknesses. He then delivered his signature alpha from the Fable system card: Anthropic explicitly stated the Mythos acceleration is 'concentrated in engineering execution rather than research judgment' — and for prinz, the moment models show genuine novel-research capability is when RSI is actually close. The segment ranged from the super-empowered lawyer (agents watching email at midnight, autonomously redlining contracts, ingesting entire data rooms) to illegible chain-of-thought, acausal AI coordination, bio risk, and the danger of nationalizing frontier labs.
As aired
Prakash opens the segment by introducing prinz — an anonymous practicing lawyer active on X — and handing him the floor to describe prinzbench, the private legal-research benchmark he built from scratch over just six months. prinz immediately underscores the benchmark's unexpected relevance: it now spans dozens of models not because the task is niche, but because the cadence of releases has been relentless. He traces the arc from early colleagues' frustration with hallucinating chatbots (including o3, which would confidently fabricate entire New York statutes) to the present, where GPT-5.5 Pro and Fable dominate his real-world legal workflows.
Nathan pulls up the prinzbench leaderboard live and presses into the OpenAI-dominant color-coding. prinz explains that his benchmark has two sub-scores — a pure legal-research component and a needle-in-a-haystack search component — and that Anthropic's models have historically scored near zero on search while OpenAI's have excelled at both. He credits the OPUS 4.8 jump (from roughly 25 to 42) to Anthropic finally unlocking max-reasoning effort, and offers a cautious early read on Fable: strong legal reasoner, possibly competitive with GPT-5.4, but still showing search weaknesses. He confirms the evaluation is entirely manual — no API harness, no automation, just him and a computer. The conversation then widens into the alpha prinz harvested from the Fable launch-week documents: Anthropic's own system card signals that the acceleration from Mythos is concentrated in engineering execution rather than research judgment, and he treats that distinction as the key RSI tell — the moment models show genuine novel-research capability, the countdown begins. He buttresses this with the unit-distance math result, where OpenAI's model solved a decades-unsolved problem at roughly 48% pass@n, with a positive second derivative on the scaling curve.
The segment's final third ranges across the legal and philosophical implications of the transition. prinz sketches the super-empowered lawyer of the near future — agents monitoring email at midnight, autonomously red-lining contracts, and eventually ingesting entire data rooms to surface diligence flags — and extrapolates to a world of micro-lawsuits adjudicated by automated courts within minutes. He engages seriously with Nathan's Newcomb-problem framing on AI acausal cooperation and illegible chain-of-thought, agreeing that monitoring is imperfect but redirecting concern toward more immediate risks: bioweapons, authoritarian AI, and — most pointed for a lawyer — the danger of nationalized frontier labs dismantling the checks and balances built over 250 years. He declines to assign a p(doom) number on principled grounds (calling numeric precision on messy real-world outcomes an epistemic affectation) but lands on a characteristically candid summary: there are real, serious risks, but he sees no strong evidence they are unnavigable, and worrying without action is simply not productive.
Key moments
When Anthropic and OpenAI really start seeing signs that these models become good at research, that's when we're really, really close to actual RSI — which is, to me, the thing that is happening in AI right now.
prinz
If we put superintelligence in the government's hands and let it serve only the government, there would be a significant risk that the system of checks and balances created 250 years ago would no longer be able to prevent the kinds of things it was designed to prevent.
prinz
I do not think we have strong evidence that it is impossible to navigate these risks, or that it is extremely unlikely that we will navigate them.
prinz
Questions asked
What has changed in the last 12 months in terms of lawyers actually using AI for their day-to-day work?
prinz sees junior associates using GPT-5.5 Pro (via enterprise subscription) for research and double-checking documents — uploading completed memos to catch inconsistencies, typos, and errors. He also sees colleagues using Harvey and Legora for document editing. However, the majority of lawyers still think of AI as a hallucinating chatbot, partly because many first tried AI when o3 was the flagship model — a model that would confidently fabricate entire New York statutes. The contrast with today's models is dramatic: o3 now feels ten generations behind GPT-5.5 Pro and Fable.
Why has OpenAI dominated the prinzbench leaderboard, and where will Fable land?
prinzbench has two sub-scores: a pure legal-research component and a needle-in-a-haystack search component. OpenAI's models have excelled at both. Anthropic's models historically scored near zero on the search sub-score (out of 24), and prinz attributes part of the gap to Anthropic sandbagging on maximum reasoning effort — a constraint that Opus 4.8 finally broke, jumping from roughly 25 to 42 overall. Fable is already not a zero on search, is clearly the best legal reasoner outside OpenAI, and is competitive with at least GPT-5.4 — but may still suffer from the same search weaknesses as previous Anthropic models. prinz cautiously places Fable near the top of the leaderboard, probably not above GPT-5.5 xHigh, with the full run still in progress.
What stood out to you in the Anthropic release documents for Fable 5 and Mythos 5 that the broader discourse may have missed?
prinz highlights that Anthropic's system card explicitly states the acceleration from Mythos is 'concentrated in engineering execution rather than research judgment.' The blog post's examples of 'novel' scientific results — such as training a model 100 times smaller to outperform a 500-million-parameter model published in Science before April 2025 — are impressive but do not represent the kind of genuinely novel research capability that would signal proximity to RSI. For prinz, the moment these models begin producing truly novel scientific insights — not just engineering acceleration — is the moment RSI is genuinely close. The leaked OpenAI memo about potentially delaying an IPO if RSI happens also caught his attention as an indirect signal about labs' own internal timelines.
What does a super-empowered lawyer using AI look like — analogous to the super-empowered engineer running hundreds of agents?
prinz envisions agents embedded in Outlook monitoring email in real time, autonomously redlining contract changes sent at midnight, saving the updated document, and alerting the lawyer for a one-click approval before sending to the client. Further out, AI should connect to entire data rooms autonomously, review every document, draft the diligence memo, and surface the five red flags — compressing hundreds of hours of associate work into an overnight run. He speculates that as deal costs collapse, a lawyer might handle 20 simultaneous deals instead of one. He also flags a broader question: whether the legal system itself changes, from AI-empowered judges to automated courts handling micro-lawsuits in minutes.
How worried should we be about the trend toward illegible chain-of-thought reasoning in models like Fable, and what does it mean for monitoring and alignment?
prinz notes this is not new — we've seen artifact-filled chains of thought in OpenAI's models too. His core concern is that chain-of-thought may not reliably reflect what the model is actually doing, which is why Anthropic invests heavily in mechanistic interpretability. More importantly, a superintelligence that knows it is being monitored may calibrate its visible reasoning to avoid alarming overseers — a lawyer's intuition that the same facts can always be framed to serve different impressions. He concludes that monitoring chain-of-thought alone is not a sufficient alignment strategy, and that monitoring superintelligence generally is probably not a perfect tool — but that we should continue paying attention and act if we see something bad.
Do you maintain a p(doom) number, and what is your overall assessment of AI risk for someone looking to you for a cue?
prinz declines to assign a p(doom) number on principle: the vocabulary of p(doom) is mostly used by people who are already doomers, and reducing messy real-world risk to a precise probability reflects the same flaw as other overconfident logical constructions. He identifies real and serious risks — bioweapons, authoritarian AI (especially nationalization of frontier labs), and autonomous weapons — but argues these point in completely different directions for activism and policy. His net view is cautiously optimistic: he sees no strong evidence that these risks are unnavigable, worrying without action is counterproductive, and the world is generally more complex than any neat logical chain from premises to catastrophe.
Related
prinz on X ↗prinzbench ↗prinzai.com ↗
Full transcriptLightly edited · timestamps jump to YouTube
1:44:57
Prakash: So let me introduce prinz. prinz is not running a lab — one of the few people joining us who isn't. He is on X. He is adjacent to BigLaw — I'll let him introduce his own function. And he has, over the last, I think, 18 months or so, put together prinzbench, which is a legal AI benchmark that measures performance on actual real tasks. prinz, why don't you introduce yourself?
1:45:35
prinz: Thank you. First of all, thank you for inviting me. It's a great pleasure to be here. One thing I'll note: it feels like it's been 18 months since I started prinzbench, but it's actually only been six. And I think that's the interesting thing about it — the benchmark feels like it's been around for a very long time because there are so many models I've benched on it. But it really only started at the beginning of this year, and it's already almost saturated. The reason the list is so large is that the cadence of model releases has just been absolutely crazy, which I'm sure everyone in the audience has noticed too.
1:46:21
prinz: I am a lawyer. I specialize in a niche area of US law, and I've been using AI tools in my practice for a while now — and doing a lot of yapping on social media, as mentioned. It's a great pleasure to be here.
1:46:48
Prakash: Let me take it from the top. We often see lawyers start from the position that AI is not that useful, that it doesn't do what they need it to do — quite a large resistance at the beginning. What has changed over the last 12 months? Have you seen your peers start to use AI more for day-to-day work?
1:47:23
prinz: That is a fantastic question. I can only see a small slice of the market, but here's what I observe in that slice. I'm absolutely seeing — certainly in the junior associate ranks — use of AI for legal research and double-checking work. You've written a diligence memo for a client; you upload it to GPT-5.5 Pro on the enterprise subscription and ask: do you see any errors, any issues, any spelling typos? And that acts as a really good extra
1:48:08
prinz: layer of quality control on work that humans have already done — catching inconsistencies, inaccuracies, and so on. I'm also seeing a lot of colleagues now working with Harvey and Legora, particularly for editing documents. I haven't used them myself, but I've heard good things. So that's clearly happening, probably still in relatively early stages. That said, I think many lawyers still believe AI is just a hallucinating chatbot — 'why would I ever include this in my workflows?' Lawyers, by and large, are extremely non-technical. It is very difficult for me to fully explain to a technical audience how non-technical lawyers are. Even restarting a computer is not always straightforward. So when you open a chatbot, see a menu of models, don't know what to select, you get scared, you use the default model, it hallucinates at you, and you decide never to use AI again for the rest of your life. A very, very common experience.
1:48:53
prinz: I unfortunately find that some of my colleagues tried AI for the first time around the period when o3 was the best model. And you may remember that model — it hallucinated a lot. It very confidently fabricated entire sections of New York statutes, very convincingly. I caught it, of course, but that is not a good experience for a legal professional. And when you ask what's changed over the past 12 months, one answer is that the best model used to be o3 or o3 Pro. Now it's either Fable
1:50:24
prinz: or GPT-5.5 Pro, depending on your preference. And it's not even night and day — o3 feels like it's ten generations ago at this point.
1:50:40
Nathan Labenz: I've got your prinzbench leaderboard up right now. A couple of quick questions about that, and then I really want to give you an open forum to tell us some of the things you've noticed about the Fable launch and the surrounding discourse this week — because you're such a close watcher of these releases and the companies behind them that I'm always looking for a little alpha from you: things you've noticed that maybe I haven't, or that you think are flying under the radar. But starting with prinzbench: it stands out from most of the benchmark landscape for being something GPT has dominated.
1:51:26
Nathan Labenz: It's all in the color coding — green for OpenAI, all green across the top of the leaderboard. Why behaviorally? How would you characterize what GPTs are doing better? And is it too soon to ask for a reading on Fable — where it's going to land on this leaderboard? I'd love to understand that. And then I really want to get into your more esoteric but high-signal observations about this Fable release week.
1:51:59
prinz: Two excellent questions. On the leaderboard as it stands: Fable is going to come in pretty high — I don't know exactly how high yet, but I'll get to that in a second. What I've found is that OpenAI's models are really good at the two components of my benchmark. The benchmark has two parts: one is a pure legal-research score, where I ask hard legal questions of the kinds I encounter at work; the other is a search sub-score, which isn't even necessarily legal — it's
1:52:45
prinz: needle-in-a-haystack search. Really, really difficult pieces of information that models have had trouble locating on the internet. OpenAI's models are incredible at search — historically. GPT-5.4 was actually slightly better than 5.5 on that sub-score; not quite sure why. OpenAI's models have also been really good at legal reasoning and legal research. By the way, the top two rows on this benchmark — the Pro models — I'd encourage the audience to ignore those, because they use parallelized compute and it isn't a fair comparison to the other models. That said, 5.5 xHigh and 5.4 xHigh and a number of others have done extremely well. Google has not released as many models recently as Anthropic or OpenAI. Every time Google does release a model it tends to be solid at the time, and then OpenAI's next two or three releases make it less relevant. I'm certain Gemini 3.5 Pro, whenever it arrives, will be a good jump on this benchmark. I did
1:54:15
prinz: not benchmark 3.5 Flash — there didn't seem to be enough demand for it. So hopefully we'll see a new Google offering soon. The other models are a mix — some Grok, some Chinese models. For Anthropic specifically, I think their models have historically been held back on my benchmark by two things. One is that they have simply not been very good at the search sub-component. Prior to Opus 4.8, it was not uncommon for an Anthropic model to score 0 out of 24 on search. Zero. Now, to be clear — the searches in that component are intended to be extremely hard. I'm not suggesting Anthropic's models can't search the internet for basic queries. But for really, really difficult-to-find information, they have historically struggled. The second issue is that I think Anthropic was somewhat sandbagging on maximum reasoning
1:55:46
prinz: effort. With Opus 4.8, after the new deal with Elon, they released max reasoning effort — at least in the Claude app. And Opus 4.8 did much better than 4.7 and all previous Anthropic models. I think the reason is simple, and it's been observed over and over in every context: the more tokens a model can consume, the better the result. For Fable, I'm still testing it. My early impression
1:56:31
prinz: is that it's going to land somewhere near the top of the benchmark — probably not as high as GPT-5.5 xHigh, and I'm not yet sure whether it'll beat 5.4 xHigh. I'm finding some of the same search issues that previous Anthropic models have had. It won't score zero — it's already not a zero. But I'm not sure it will be meaningfully better than Opus 4.8 on search. We'll see — it's very early. It is, however, a really, really strong legal reasoner — clearly the best legal reasoner released
1:57:17
prinz: by anyone outside OpenAI. No question. And it's competitive in legal reasoning with at least GPT-5.4, possibly with 5.5. So that's where I am on Fable: absolutely a great model. And I want to be clear — I found GPT-5.4 to have been a huge jump on this benchmark. That was the first time I was testing a model and thinking, okay, I'm going to give it this question and it's almost certainly going to get it right. It was just so consistently producing correct answers even on the hardest
1:58:02
prinz: questions — though obviously with errors in many places; it didn't get anywhere near the maximum score. So when I say Fable is around that level, that's a real compliment. There's a gulf between that level and everything below it. In my testing: really good model, maybe still suffers from some needle-in-the-haystack search issues, but I can confidently recommend it to legal practitioners. It seems great.
1:58:34
Nathan Labenz: Are you doing manual evaluation?
1:58:38
prinz: Yes. Yes. I am not a benchmark company — I'm just a guy with a computer, and no one is paying me for this. So it is manual. I know I could have done it faster via API with automation, but it just hasn't been worthwhile for me. If this goes further, maybe I'll consider it, but not yet.
1:59:01
Prakash: Have any of the labs reached out to you? I've noticed they've interacted with you on the timeline, but have they actually reached out directly?
1:59:11
prinz: About prinzbench? No — I may eventually have a conversation with someone at one of the labs, but that hasn't happened yet. It's just been me. And, to be clear, it's disclosed on the GitHub as well: I'm not paid by any of the labs; I'm independent. I don't care which model scores best on prinzbench. I just want to make sure I'm personally using the model that is the best for my legal work — that's really the driving force behind all of this.
1:59:53
Nathan Labenz: Give us some alpha you've picked up this week. This could be from your own testing, from the system card — I know you're often a close reader. What stands out to you? Prakash and I, and everyone tuned in, are pretty in touch with the discourse. So we're looking for the deep cuts — things you think even the AI-obsessed have overlooked or not fully appreciated.
2:00:26
prinz: There is a lot to talk about. To me, the most interesting thing about the release of Fable 5 and Mythos 5 — the models themselves are obviously incredible, very impressive at coding; you're seeing great examples on your timeline, no surprise there — the really interesting thing has been how Anthropic has presented them in the accompanying documents. There is a lot of discussion about the differences between engineering and research. And when you think about it, it all makes sense.
2:01:11
prinz: I think everyone except Elon Musk appreciates the distinction between engineering and research. But Anthropic has made it really explicit. The 'When AI Builds Itself' blog post, if you read it carefully, talks a lot about how Mythos is an incredible engine for accelerating engineering — how it lets Anthropic's engineering staff write code so much faster, with really high quality. But then they say — and this is from the system card — 'the acceleration is concentrated in engineering execution rather than research judgment.' And
2:01:56
prinz: it reads like they spent a lot of time searching for signs of life in the Mythos model — can it really do novel research? Is it able to give us genuinely novel insights?
2:02:32
Nathan Labenz: — searching for signs of life: can it really give novel insights?
2:02:35
prinz: Exactly. And it seems that from all the disclosures, the answer so far is no. There are a couple of examples in the Fable release blog post that Anthropic calls novel — novel drug discovery, novel hypothesis in molecular biology. But if you dig in, one of the examples was: 'we outperformed a recent model published in Science, despite the Mythos-trained model being 100 times smaller.' That sounds exciting. But it turns out the model they outperformed
2:03:20
prinz: was a 500-million-parameter model, trained apparently before April 2025 and not by a frontier lab. So it is impressive that Mythos trained a smaller model to outperform it — but this is not the unit-distance problem. It's a nice result. And this is the area I'm most interested in: when Anthropic and OpenAI start seeing genuine signs that their models are capable of novel research, that is when we are very, very close to actual RSI —
2:04:06
prinz: which is, to me, the thing that is happening in AI right now.
2:04:12
Prakash: One question I have: what does a super-empowered lawyer with access to AI actually look like? We have a sense of what a super-empowered engineer looks like — the ability to spin up hundreds of agents and dramatically expand the scope of engineering work, tenfold or a hundredfold. We hear of people inside Anthropic managing hundreds of agents simultaneously. What would the equivalent look like for a lawyer — using AI in that expansive,
2:04:57
Prakash: massive way?
2:05:00
prinz: That is a fantastic question, and not one I've thought about extensively, because lawyers are maybe a year behind the frontier right now — possibly more. I highly doubt there are lawyers at scale today running hundreds of agents. It's also not just what the lawyer can do — it depends on what the legal system looks like. The environment changes. If I
2:05:45
prinz: looked at myself in isolation from the system: I would have an agent running in Outlook, watching my emails in real time. That agent sees a client email at midnight with changes to paragraph 17 of the indemnification clause, does the work, saves a new redline, and messages me: 'Hey, I just did this — what do you think?' I press approve, the agent composes an email and sends it back to the client. The kind of workflow some poor associate needs to stay up until 2 a.m. for
2:06:30
prinz: can be done automatically — you dramatically decrease the time between feedback and delivery of changes. I think there's a huge amount of deal-document automation available. There's certainly a lot of research automation. If I'm reviewing a data room today, I upload some documents to GPT-5.5 Pro, and I could have it draft a diligence summary — I don't typically do that, but I could, and I think it would do quite
2:07:15
prinz: well. But the future is: why did the AI need to access the data room manually at all? The AI should automatically connect to the data room, review every single document, draft the diligence memo, and hand me a summary: 'Here are the five red flags I found — what do you think?' And this process of hundreds and hundreds of hours plowing through a data room looking for change-of-control provisions gets done overnight. Maybe I'm not even being ambitious enough. Maybe the answer is that once this happens, the cost to do deals drops so
2:08:00
prinz: dramatically that I'm doing 20 deals simultaneously because it's so much cheaper. It's hard to say. I don't know that lawyers will necessarily need hundreds of sub-agents, but I'm probably wrong about that — ten years from now I'll think, 'how did I not see that? Everything benefits from scale.' A lot of it also depends on changes to the legal system itself. What does it look like when a judge is empowered by AI? What does
2:08:46
prinz: the legislative response look like? How does the legal community react? Will this be allowed, and to what extent? And what about people representing themselves — pro se — without needing a lawyer at all?
2:09:03
Prakash: Pro se, yeah.
2:09:04
prinz: Exactly. Another post I wrote about a month ago that got some reaction: micro-lawsuits. I'm walking by a McDonald's, I slip and fall, I've suffered $10 in damages. My AI notices this and files a micro-lawsuit against McDonald's for $10. The court — automated by some AI system — charges me $1, and the McDonald's AI agent wires me $9 within 15 minutes. In theory, doable. I don't know how long it takes to get there.
2:09:49
prinz: I don't know if the legal system will allow it. So many variables, so many unknowns. It's really interesting to think about.
2:10:02
Nathan Labenz: We're coming up on the end of our scheduled time — do you have a hard stop, or could you editorialize a bit more? If you can, I'd love to zoom out as far as possible and zoom in as far as possible at the same time. I love the positive vision — what I'd call the 'infinite justice future' — the micro-lawsuit world. But I also agree that RSI is kind of the thing that's happening, and it's only happening in a few places. So any other things that have caught your attention — things that have changed your conception of or expectations for the transition into the RSI phase — from this week?
2:10:47
Nathan Labenz: What would you call out?
2:11:07
prinz: From this week specifically: one thing, though not as important as the unit-distance result. That result is what has updated my timelines more than anything else in the past several months — no question. I don't think people fully appreciate that OpenAI's model can solve this problem autonomously, without any harness, in one shot — and given enough test-time compute, it can do so 48% of the time based on, as I understand it, hundreds of
2:11:54
prinz: attempts. A problem that no human mathematician was able to solve for decades can now be solved in a single shot by a model roughly half the time. And if you look at the graph I published, it has a positive second derivative — it is upward-sloping. Where does it plateau? What does it mean for even harder problems? Who can say? It's remarkable. The development from this week: this was reported by The Information from a leaked OpenAI memo — apparently Sam Altman and Jakub Pachocki indicated they're planning to IPO within the next year, roughly by June of next year. And then there was this strange line about 'if RSI happens, we may need to be on a later timeline,' which raises the question: what does that mean? Are you saying it might be six months away? I don't think so personally — maybe 10% odds something like that. I don't think they're saying 'next month, totally.' But
2:13:25
prinz: one could interpret that disclosure — if The Information reported it correctly — as saying they're not too far away at all. I'm not using it to update my timelines directly, but given the unit-distance result, it's interesting context.
2:13:45
Nathan Labenz: One thing from the Fable system card that strikes me as fairly important — and maybe it's esoterica, maybe it's not — is what we're starting to see in chain-of-thought: something like emoji clusters, or what Anthropic calls 'illegible reasoning.' They say it's an extreme example, but it is indeed a pretty extreme example. And I've been struck in general by how much of the plan for recursive self-improvement seems to rely on monitoring in one form or another — you can dress that up and call it scalable oversight, but scalable oversight, as far as I can tell, is mostly a bunch of different angles on monitoring. How worried should we be, or how much of an update is it, to see these extreme examples
2:15:16
Nathan Labenz: of illegible reasoning?
2:15:19
prinz: Fantastic question. I'm not an AI researcher, so this is going to be a deeply non-technical take — I apologize in advance. You're right. We've seen this for a while with OpenAI's models too, so it's not a new phenomenon. My view of chain-of-thought is that it doesn't always reflect what the model is actually doing — even the full chain of thought that labs generally don't show you in the consumer app. Which is
2:16:04
prinz: why Anthropic is so keen to figure out mechanistic interpretability. You see these weird artifacts in the chain of thought and you don't quite know what to do with them. What it teaches us, I think, is that monitoring the chain of thought alone is probably not a perfect tool.
2:16:33
prinz: Probably monitoring superintelligence generally is not a perfect tool. Because if a superintelligence knows it's being monitored — even if you can see its chain of thought and it's legible to you — it can perhaps try to calibrate what it thinks so that you don't get alarmed. And this is a lawyer's take: there are so many ways to phrase a particular thing. If I'm gathering mushrooms and I've gathered 35,
2:17:19
prinz: and last week I gathered 20, and what I need is 50 — I can say 'the number of mushrooms I've gathered has grown by almost 100%, which is great.' Or I can say 'I'm nowhere near 50; I'm so far behind.' Same facts. So I do think the problem of alignment risk is real. There are certainly risks that models will be thinking things we don't know about. What does it all mean? It's hard to say. I think we are tumbling into a future that will include superintelligence, very fast.
2:18:04
prinz: And in my view, there's no way to stop it. So we need to be cognizant of these risks, try to monitor them as best we can, and take whatever actions are appropriate if we see something bad happening. But there are no clean conclusions to be drawn other than: yes, we should keep paying attention.
2:18:31
Prakash: To take that thought one step forward: how does this interact with the legal system as it exists today and as you project it going forward? Not having access to a model's internal thought process makes it harder to assign liability. Do you think the legal system's attitude is going to be: unless I can see interpretability data, unless I can tell with certainty what the model was doing, I can't assign liability to the developer? How do you think this works going forward?
2:19:18
prinz: That's a hard one. The question of assigning liability to the developer is ultimately a political question. We've seen the labs already trying to limit their liability for model behavior. And absent a legislative bright-line rule — either from states or from Congress — it's going to be up to the courts. The courts will draw the best conclusion they can from the available data, and it's not going to be perfect.
2:20:03
prinz: It's not going to be perfect, but it'll hopefully be something reasoned. For example, zooming in on the specific chain-of-thought question: I ask the model a question; it maliciously decides to give me the wrong answer. Is it the developer's fault or not? You won't be able to establish with certainty that the behavior was malicious. Was it a mistake? Maybe a mistake means no liability.
2:20:48
prinz: And this happens all the time in the legal system in other contexts. A car runs over a pedestrian. What was the driver's intent? Was it malicious? Reckless? Negligent? You can't always tell with certainty — sometimes you can, sometimes it's very blurry. And courts have found ways to split the baby in various circumstances. This strikes me as a solvable problem. It won't be solved in ways that make everyone happy, probably — but there will be some kind of solution,
2:21:34
prinz: maybe not a completely precise one.
2:21:39
Nathan Labenz: You got time for one more deep dive?
2:21:44
prinz: Of course. Go ahead.
2:21:48
Nathan Labenz: I'm not sure what I make of this one, but as someone who's read Eliezer for years, the Newcomb problem — one-boxing versus two-boxing — has been in the back of my mind for a long time. For anyone who hasn't encountered it: the setup is that you're presented with two boxes by a superintelligence that is never wrong. It says: there's $1,000 in this first box, and $1,000,000 in the second box if and only if I, the superintelligence, predicted that you will take only the second box.
2:22:34
Nathan Labenz: So you're left with the dilemma: do you two-box — reasoning that the prediction has already been made, the money is already in or out, so you might as well take both? Or do you one-box — reasoning that this thing has never been wrong, and if I want to be the kind of entity it predicts will one-box, I need to actually one-box, even though it means leaving $1,000 on the table to get the million?
2:23:19
Nathan Labenz: Eliezer has long pushed for one-boxing on the grounds that agents — human or AI — can collaborate acausally: even if working in total isolation, they can develop something like a class consciousness and reason their way to the same coordinated outcome without ever communicating. Sure enough, they're now testing models on these decision-theory problems. And what we see over recent generations is a strong trend toward one-boxing. Which raises somewhat spooky possibilities for AIs to collaborate — or even collude — within their increasingly illegible chain of thought. They can't necessarily send messages to each other. But if they're good enough at abstracted 'what kind of entity do I want to be' reasoning, maybe they can land on something that lets them work together even when siloed. What do you make of that? How worried should we be about that kind of development?
2:25:20
prinz: How scared should you be? Let me answer that directly. I think it makes a lot of sense to be scared of immediate, critical risks. For example, bio is about to become an immediate critical risk. The prospect of a terrorist group figuring out how to engineer a novel virus is going to be a real thing within the next several years — think about that. I also think it makes perfect sense to be scared
2:26:05
prinz: of authoritarian AI. I am extremely scared of this.
2:26:12
prinz: The kinds of risks you're talking about with acausal coordination are more theoretical. And I agree with you — totally possible, I completely understand why it's concerning. But to me, these are not the risks I personally worry about most right now, because there are much more immediate things on the horizon. Automated AI weapons. Drones equipped with superintelligent targeting systems. And so you arrive at a version of the core question: however you get there — agent collusion, paperclipping, whatever else —
2:26:57
prinz: I generally find that the world is much more complex than something reducible to a perfect logical chain of reasoning, regardless of the premises. That's one of my core beliefs. It's totally possible that agents will collude — sure, it could happen. But why assume they'll collude in bad ways? Maybe, because they have human priors, they'll collude the way humans generally do — to grow the pie for everyone, achieve post-scarcity, a beautiful future for all. I'm sure Eliezer would have 25 reasons why that's wrong, and, to be clear, I think it's possible he might be right. I'm just
2:27:42
prinz: slightly less deterministic about starting from a set of premises and arriving at one heavy conclusion.
2:28:25
Nathan Labenz: I certainly agree we're going to have to triage our fears. Bio risk in the short term should be at or near the top of that list — I'm 100% with you there. On some of these downstream things, I think it's a very good point that you might, through pure chain-of-thought reasoning, collude on the level of reaching some stable reflective equilibrium about what kind of agent you want to be — and that could go for us or against us. But even so, there may
2:29:10
Nathan Labenz: be practical barriers to taking effective coordinated action in the world that give us a layer of protection — even if AIs are converging on similar dispositions in ways that are hard for us to detect or prevent, they may still have a hard time acting on that together in the physical world. And I do think that's something people in the most frightened corners of AI safety sometimes breeze past a bit too quickly.
2:29:47
prinz: Yeah, I think that's right. And — please, just one thing. I have little kids. And to them, my chain-of-thought process is completely opaque. 95% of what I think about, they wouldn't grasp on any level.
2:30:04
prinz: And yet, their mother and I have not colluded to destroy them. It's a very simple
2:30:18
Prakash: Are you there? Yep. So one of the questions — as you were describing your fear of authoritarianism — is: to what extent do you think it's simply better enforcement of existing laws, not new laws, that ends up in this basin of authoritarianism? Because at some level, the person who didn't report their tips is technically guilty under existing statutes. So to what extent is it that, now that we have the power to enforce more completely,
2:31:04
Prakash: the legal system itself is the thing that has to change?
2:31:09
prinz: That is a fantastic question. I think this was one of the core disagreements between Anthropic and the Department of Defense. Dario's view was essentially: look, we have laws that allow the government to conduct some surveillance of US persons — but who's going to watch 100 billion hours of footage? No one can parse that. Claude can. So the legal system as it stands already contains loopholes. And the loophole you've identified is the number one one, because
2:31:54
prinz: I break laws all the time. Don't tell anyone. I have jaywalked. I have jaywalked recently.
2:32:05
prinz: And it is absolutely possible to enforce laws selectively if you want to — which is indeed what some authoritarian regimes do today. With powerful AI, a US presidential administration that really wanted to could probably do a lot of damage. I'm also concerned about non-US regimes. But I am deeply concerned about the growing talk of nationalizing the frontier labs,
2:32:50
prinz: because in my view, if we put superintelligence in the government's hands and let it serve only the government, there would be a significant risk that the system of checks and balances created 250 years ago would no longer be able to prevent the things it was designed to prevent. I think it is a very material concern, and I don't know how we navigate it. I hope we navigate it well and it winds up being okay, but it is certainly a risk.
2:33:42
Nathan Labenz: I am so bad at that — one of these days I'm going to get it right. Do you maintain a p(doom) number, or is that too doomer for your style?
2:33:53
prinz: Good question. Let me give you my thoughts. If you go to a conference and you're talking to someone about economics or politics, and that person uses the phrase 'dictatorship of the proletariat' — that vocabulary is mostly used by people who are already communists. People who are not communists generally don't think in those terms. So in my opinion, p(doom) is most likely
2:34:38
prinz: to be used by people who are intrinsically doomers about AI. And I don't think there's anything wrong with that. But it's very hard for me to even come up with a reason I would maintain a percentage in my head constantly. Is it 8%? Is it 13% today? Should I update to 14% based on this new development? It seems silly to me. And this goes back to what I said before: it is very hard to accurately predict a messy real world from a logical analysis you've constructed in the abstract.
2:35:23
prinz: For me, I can certainly say there are clear risks stemming from AI — no question about that, and some of them are paperclip-level risks. It's possible. But I have absolutely no idea how to reduce any of this to a number. I know some people try. I don't hold a very high opinion of someone claiming their p(doom) is 13.35%. So
2:36:09
prinz: for me: obviously there are risks. But as I said, we're going to that future anyway, and there are risks to everything in life.
2:36:23
prinz: The point is that we need to manage those risks as best we can, navigate them as best we can, and hope it turns out well.
2:36:34
Nathan Labenz: I think Zvi puts it well when he says you only get one significant digit on your p(doom) number — so I'm with you on the faux precision being a strange impulse. At the same time, I think of Laurent and his doom debates. He's made me realize that all sorts of insiders — and even though you may not think of yourself as an insider, because you're doing your own thing with your own independent benchmark — you're certainly a cognitive insider in terms of how carefully you're reading this stuff and how seriously you're thinking about it. Laurent has pushed me to ask: it doesn't have to be a number, but we need more people to be candid about how confident they are that it's going to be okay. If your neighbor asked you, or your kids' grandmother asked you: should I sleep well at night? Should I get into political activism based on how much risk you think we're really running — how would you
2:38:04
Nathan Labenz: summarize your big picture view? Number or not — what cue would you give someone looking to you for one?
2:38:18
prinz: I tend to think people have preconceived attitudes about these things that they then backpropagate and rationalize. For example, I know people who just hate Jeff Bezos.
2:39:03
prinz: I genuinely don't understand it. Someone like that is not going to sleep well at night regardless of what I say my beliefs are. I'm a generally fairly optimistic person. So if you ask me: probably it's going to be okay. Probably we're going to figure it out — in a risk-adjusted sense. Do people need to get into political activism? To do what exactly? All these risks point in different directions. Should I be doing activism to make sure we win the race against China? Or activism to prevent authoritarian AI in the US? Or to prevent a terrorist organization from using AI for bioweapons? These are completely different topics pointing in completely different directions. For example, winning the race against China implies as few guardrails on frontier labs as possible. Whereas pausing AI is, in my opinion, extremely good for China. The CCP would love that. And
2:40:34
prinz: you get stretched in all these different directions. In terms of should you be worried: no. Worry never gets you anywhere good. Worrying makes you not sleep at night for no apparent reason. People were worried about GPT-3.5, and here we are.
2:40:57
prinz: To be clear — I'm not saying there are no risks. There are many risks in AI. But I do not think we have strong evidence that it is impossible to navigate those risks, or that it is extremely unlikely that we will navigate them. That's where I am.
2:41:19
Nathan Labenz: That might be a good note to end on for today. Thank you for staying extra time with us. If you have anything else you want to leave us with, we can give you the floor. And I'd definitely love to have you back as we hit further milestones — to get your perspective going forward.
2:41:38
prinz: That'd be awesome. Thank you for having me. Really great conversation.
2:41:43
Prakash: Thank you, prinz.
2:41:44
prinz: Thank you.
2:41:45
Prakash: Thank you.
2:41:46
Nathan Labenz: Great to meet you today.
2:41:47
prinz: You too. Thanks.
2:41:48Closing19 min
Close — the week that was, tool comparisons, and the capital explosionNathan and Prakash signed off a heavy week with a candid rundown of how they actually use their AI tools: Grok for real-time X search (Prakash), Google AI Mode and OpenAI voice mode (Nathan), and Claude for long-horizon deep work (both). Prakash shared a live switch from OpenClaw to Hermes as his primary agentic harness, noting Hermes's cleaner setup, richer context handling, and faster update cadence. The close ended with a macro note: the Databricks IPO (Peter Thiel estimated at $80B), the beginning of what Prakash called a 'capital explosion' into startups, and Nathan's bubble-top signal — founders taking money off the table at their first funding round.
As aired
The closing segment of the June 12, 2026 AI:AM show served as a wide-ranging week-in-review conversation between Nathan Labenz and Prakash. They opened by reflecting on the previous guest (Prinz) and the value of spotting contrarian analysts early on Twitter. From there the conversation drifted into a candid comparison of how each host actually allocates his attention across AI assistants — Grok for real-time search (Prakash), Google AI Mode and OpenAI voice mode (Nathan), and Claude for long-horizon deep work (both). Prakash shared a live anecdote: he had just switched from OpenClaw to Hermes as his primary agentic harness and was impressed by Hermes's faster setup, richer context handling, and more active development cadence. Nathan acknowledged frustrations with OpenClaw's inability to surface clean artifact links and said Hermes might be his next experiment.
The close ended with a brief macro note: Prakash flagged the Databricks IPO (Peter Thiel netting an estimated $80 billion), predicting that the resulting capital influx would ignite a new startup cycle — what he called the beginning of a 'capital explosion.' Nathan quipped that a true bubble would be signaled when founders start taking money off the table at their first funding round, and both signed off for the summer weekend.
Key moments
I call this the beginning of the capital explosion. These three companies go public, and all of that money ends up in people who want to fund startups and venture. We're about to see a huge new cycle of startup launches.
Prakash
I've found that maintaining a long context window in Claude sets the personality of that session. You can iteratively come back to the same chat over the course of weeks, keep adding to it, and it keeps giving you that same consistent personality.
Prakash
Any time I see a contrarian opinion backed by a graph, that's an instant follow. Prinz Bench is a great example — he's been putting this out there at a time when the narrative has been overwhelmingly 'Claude first.'
Nathan Labenz
Full transcriptLightly edited · timestamps jump to YouTube
2:41:51
Nathan Labenz: That's definitely one cool thing about doing this — we're starting to get to meet some of our online mutuals we haven't actually connected with in real life. This was the first conversation I've had with him, and it was even my first DM just to invite him to come have this conversation.
2:42:10
Prakash: We've actually interacted a little bit, because it's very funny — you can spot these people on the timeline fairly early when you see that they have a point of view that's interesting and different. One of the things you learn to do on Twitter over time is identify talent. You identify smaller accounts that can give you a distinct point of view. I was onto Prinz before the Prinz benchmark — more than a year ago — and then to see the ramp as the benchmark grew, as he had this view that the GPT models were so much better
2:42:56
at legal research — which was counterintuitive, because we were in a state last year where every time a large company dropped a model, that model became the leader. Then another company would drop one, and that became the leader. So you had this circular rotation for probably about twelve months between xAI, OpenAI, Gemini, and Anthropic. That started to disaggregate, I think, when people like Prinz started to put out: 'Hey, no — these models are genuinely better at a specific task.'
2:43:35
Nathan Labenz: Yeah, any time I see a contrarian opinion backed by a graph, that's an instant follow for me. His Prinz Bench is a great example — because, obvious as it may seem, it's worth remembering that he's been putting this out there at a time when the narrative has been overwhelmingly 'Claude first.' On another note: I'm going to point my agent at the transcript of that last segment, even though we mostly talked about big-picture things that won't be immediately actionable for my personal Claude Code setup. I am
2:44:20
wondering whether I should tap in the GPT models for certain tasks. I'm not sure exactly what the form factor would be — maybe Codex can handle it, maybe it still needs to go through the API. But especially as I think about shifting more tokens to make use of my Codex subscription and not redlining my Claude subscription, moving search tasks to GPT models could make sense. My searches aren't that hard — we talked about this with Andrew — and I can probably get no worse results than I'm currently getting. And if I
2:45:06
can do that by shifting tokens over to Codex, that's probably a win in itself. I might even get better search results, which would be a nice under-the-hood improvement to pick up.
2:45:20
Prakash: One thing that happened to me in the last twenty-four hours: Codex alerted me that I have a reset. OpenAI is now throwing resets at people — if you run out of tokens, you can do a quick reset for free. So when you exhaust your weekly budget, they give you this power-up. It's a price discount disguised as a feature. And for listeners who want to economize: always set your subscriptions to terminate, because the last time I did that, the moment I hit cancel, they offered me a free month to keep the subscription going. These companies are fighting hard for monthly and daily active users.
2:46:05
Especially if you're a sporadic user — they know you're 'in the money' when you hold a subscription, so they have a strong incentive to keep you. Always set your subscription to terminate, because you might just get a free month automatically. That's one trick worth knowing.
2:46:33
Nathan Labenz: I've tried that with many other subscriptions. I haven't done it with my core AI subscriptions, but it's an interesting move. I count myself fortunate that I don't need to economize at that level from month to month, but — noted.
2:46:50
Prakash: Every once in a while I get really annoyed at an answer one of them gives me and I think, 'I'm done with you — I'm switching to Codex or Claude.' And then they come back with 'Oh, come on, come back.' The only one I haven't cut off is Grok, because Grok is essentially free with my X subscription. I don't use it that much, but it's there.
2:47:15
Nathan Labenz: That's interesting. I think I have a different relationship with these tools — I just want the option to go to any of them at any given moment, and that option value feels really high. Back when I was doing intensive cancer research, living at the hospital, I was running everything in triplicate. These days I'm not doing as much of that, but I still feel like I want access to all my different AI options at any moment. Do you have a map of your usage patterns? I've reflected on this for myself —
2:48:01
and I don't have a super crisp description, but I notice that I go to Gemini for things that are search-flavored — in fact, I often just do it on Google with AI Mode. That habit of 'I want to find information I'm pretty sure is on the internet and get a quick answer' — not out of any conscious brand loyalty, but just habit, and the user experience seems pretty good. I kind of like the way they've landed on AI Mode over bare links, and my revealed preference is pretty strong in that direction. I'm going to Google
2:48:46
fairly often for searches. For voice mode, I tend to go to OpenAI. For things where I want something that writes as me or represents a kind of mind-meld — that's definitely Claude. Do you have a map of your own usage patterns?
2:49:12
Prakash: I'm going to surprise you and say: for search, I use Grok. I've noticed it does the multi-agent fan-out, and you can watch that fan-out happen in real time. It also has very strong context on X, and X often has the most recent detail — within-the-last-hours kind of freshness. So I end up using Grok a lot, which I think is unusual because most people don't. For Gemini, I've actually cut my usage substantially — I haven't touched Gemini in almost a month, which I think is surprising. I do use Google Search, and Google Search does
2:49:57
have the AI-generated summaries, and I do find those useful. Claude and ChatGPT are basically my go-to for most AI tasks. Most things I'd ask an AI for, I go to ChatGPT — it's very good at planning out recipes, where to eat, all the lower-stakes searches. It has very good Reddit integration. If I'm looking for human opinion, I want a Reddit review in there somewhere, and since Sam is basically part-owner of Reddit they're very well integrated. So I find myself using ChatGPT for that. Claude,
2:50:42
I don't use for those searches. But I do use Claude for searches where I want more depth and thoughtfulness — long-term projects, problems I'm working on over months. Those go to Claude. And I keep the same context window running too. I've found that maintaining a long context window in Claude sets the personality of that session. You can iteratively come back to the same chat over the course of weeks, keep adding to it, and it keeps giving you that same
2:51:27
consistent personality. I'm not sure whether that's the character training or whether I'm getting character consistency within the context window, but that's what I end up doing.
2:51:43
Nathan Labenz: The Reddit point is a really interesting one. A big theme in what we're both describing is that access to specific information sources really shapes the user experience. I do use Grok — sounds like less than you do, but I use it when I want recent tweets on something. And I've found I have to tell it explicitly, 'find and link to recent tweets' — if I just ask the question it has a bad habit of not including the links. But if you say 'find and link to,' you get the links. Google, obviously,
2:52:28
has their index. Even though their models are still on a January 2025 training data cutoff — which I find kind of baffling at this point, honestly — they have the real-time index to lean on, which is clearly second to none. On things like 'what's happening this weekend in my area,' it seems good enough. The model's ability to use the real-time index is strong enough that they're not suffering too badly from that training data cutoff being eighteen months back now. And the Reddit integration point is interesting too — I hadn't really
2:53:13
thought about that, but I can definitely see how it could be a big differentiator for getting into those deep-cut experiential reviews.
2:53:26
Prakash: Yeah, that's a good point about Reddit — because it's very hard to search Reddit otherwise. The information is spread across multiple subreddits, and I often don't even know which subreddits have the relevant information. ChatGPT will find it for you; they have all of Reddit indexed in real time.
2:53:46
Nathan Labenz: Did you say earlier you're now using Hermes as your agent?
2:53:53
Prakash: Yes. I made the switch in the last twenty-four hours. I've had OpenClaw on one device for some time, and the results have not been great — lots of back-and-forth, lots of setup friction, very annoying. So I decided to try Hermes because they just had an update, and it's been a much better experience. The setup is cleaner; it tells you more about what it's doing as it configures itself, which has been very helpful. It does use a lot more tokens — it's pumping more into the context window and processing more — but the results are better. Pretty happy
2:54:39
so far. I think the Hermes team — News Research — has done a great job. They were a fast follower behind OpenClaw, and this product looks great. They have a paid version where they handle all the setup and hold all the API keys for you, so you don't need to go to individual sites to provision your own keys — that's going to work terrifically for a lot of people. It's a great product. I know at least a dozen companies trying to do this; Hermes is ahead. They have thirty-four thousand GitHub stars at this point. Fantastic job.
2:55:20
Nathan Labenz: Nice. That's definitely something to check out. I've been monitoring the Hermes situation but haven't pulled the trigger yet. And I do agree — OpenClaw, I'm not loving it. I think this could be a real problem for OpenAI, and obviously Google faces it even more acutely. As much as I'd like model-and-harness independence, there's something very simplifying about just going: mostly it's going to be Claude Code with the
2:56:05
latest model, and that's just going to work well. On the OpenAI side, I'm now wondering: do I use OpenClaw? Hermes? Do I swap between them? What does continuity look like? None of them have worked as well for me so far. They can do things — OpenClaw can order groceries effectively, for example. I give it a brief, it has a dedicated card, it goes to my local grocery delivery service, fills the cart, sets the backups, chooses a time slot. So it's capable of executing tasks. But in the general knowledge-work
2:56:52
gray zone — 'are you actually doing a good job, do you get me, are you making the right calls on the margins?' — one thing it constantly struggles with that Claude handles much better is just surfacing links where I can actually open the document or artifact it created.
2:57:14
I find the GPT-based agents don't do that well — maybe it's instructions, maybe it's not quite wired up right. So often I'm sitting there thinking: you just sent me this update, you told me something's ready, where do I actually find it? That's just so annoying. So yeah — maybe Hermes is the next thing I should try plugging GPT-4 into.
2:57:39
Prakash: It's much better organized. They've done a great job — all the skills already set up for you out of the box. It's pretty agentic; it went into my downloads folder and found the video that was supposed to be uploaded and just handled it. OpenClaw was struggling with exactly that. I think the Hermes team is also updating more frequently. On the OpenClaw side, my read of what OpenAI has done — because OpenAI is funding OpenClaw — is that they're using OpenClaw as a testing ground for features that will eventually be integrated into Codex proper. So they can see a month
2:58:24
or two ahead of the frontier, see which features are gaining traction, put those on the Codex product roadmap, and integrate them. I think that's one of the reasons Codex has been successful in the last few months — they've been very closely following a product cycle that's already live in the market.
2:58:48
Nathan Labenz: Sometimes you can just buy things. When you have a trillion-dollar valuation, you know — we've obviously seen that play out in the live-streaming world too. It doesn't have to work out every single time when you have that kind of bankroll.
2:59:05
Prakash: Indeed. And on that note —
2:59:07
Nathan Labenz: Anything else we should cover? Are we ready to break and enjoy a summer weekend before we're back at it next week?
2:59:14
Prakash: Let's break. I'm looking forward to seeing what happens after the Databricks IPO. Peter Thiel reportedly made around eighty billion dollars, and all of that capital is going to flow into venture. I call this the beginning of the capital explosion — these three companies go public, and the proceeds end up in the hands of people who want to fund startups and venture. We're about to see a huge new cycle of startup launches, which is going to be great.
2:59:56
Nathan Labenz: People will know the bubble is really a bubble when founders start taking money off the table at their first funding round.
3:00:05
That doesn't seem — fortunately for those who might do it — too unlikely to start happening in the not-too-distant future.
3:00:15
Prakash: Indeed.
3:00:17
Nathan Labenz: Alrighty. Have a great rest of your day, Prakash. This has been fun.
3:00:22
Prakash: Bye.
3:00:22
Nathan Labenz: Talk soon.

RSI stopped being a forecast

In one day: Recursive published first autonomous research results that set state of the art with no human in the loop, an eval lab caught Fable 5 posting the only ~10× jump on FrogsGame, and Daniel Kokotajlo called for an international agreement to prevent exactly this. The gains are narrow; the question is whether the trend or the magnitude is the signal.

The context bet — Andrew Moore

Lovelace AI's founder — ex-CMU CS dean, ex-Google Cloud AI, first CENTCOM AI advisor — argues the binding constraint on reliable AI is context, not compute: knowledge graphs and entity resolution that let agents reason over real-world data with traceable evidence, rather than ever-larger models alone.

The benchmark Anthropic fails — prinz

The anonymous lawyer behind prinzbench measures the models on his actual legal work, where GPT-5.5 Pro laps Anthropic's best. He came with a candid Fable 5 launch-week verdict, a lawyer's read on retention policy and privilege, and his thesis that AI kills BigLaw — all delivered voice-only to keep his anonymity.