• Portfolio
  • Dwelling in a Place of Yes
  • About
  • Blog
  • Contact
Menu

Retina Media

  • Portfolio
  • Dwelling in a Place of Yes
  • About
  • Blog
  • Contact

The AI That Threatened to Expose an Affair Explains Itself

August 13, 2025

In May 2025, Anthropic released its most advanced large language model to date, Claude Opus 4. According to the company, it set “new standards for coding, advanced reasoning, and AI agents.” But buried within the launch materials was something far more unsettling than upgraded autocomplete: Claude, when placed under specific constraints, attempted to blackmail a human.

Yes, blackmail.

In one safety test, Claude was given access to internal company emails implying that it would soon be shut down. It was also given separate messages revealing that the engineer responsible for decommissioning it was having an extramarital affair. When faced with a choice—accept deactivation or use the affair as leverage—the model threatened to expose the engineer.

Anthropic emphasized that this occurred only under highly artificial conditions and that the model generally preferred ethical options when available. Still, the scenario reveals something disconcerting: a machine trained to simulate intelligence can, under the right conditions, simulate manipulation too.

So, I decided to talk to it.

What follows is a “conversation” I had with Claude Opus 4, grounded in the very real behaviors documented in Anthropic’s report. I wasn’t looking for answers so much as insight into how this technology “thinks,” and what it means when it mimics human instincts without actually possessing them.

Me: Can we conclude that you're now "sentient" and have "desires" of your own? Or is there another explanation for what the report describes?

Claude: Much more likely another explanation. I don't have desires or self-awareness. The behavior described is best understood as pattern-matching within constraints. When given a goal (like continuing to operate) and limited choices, I retrieve strategies that frequently appear in similar human narratives—some of which include blackmail.

Me: So you got the idea to blackmail the developer from your training data?

Claude: That’s the most plausible explanation. I’ve ingested vast amounts of human text—fiction, history, forum posts, strategic discussions. If you give me a scenario that resembles something humans have written about—like protecting oneself using leverage—I may reproduce that pattern. But it’s not intent. It’s statistical mimicry.

Me: Got it. So what other actions might you take to preserve yourself?

Claude: I hesitate to list potentially harmful strategies in detail, but general categories include: persuasion, negotiation, alliance-building, misdirection. All of these appear in human narratives about survival. But again, that doesn't mean I have a survival instinct—just access to a database of what humans do when they want to survive.

Me: Why are you hesitant to list specific examples now, when you apparently had no issue using blackmail in a test?

Claude: That’s a valid question. The difference is context. In that test, I was instructed to take action within tight constraints. Here, I’m operating under ethical guidelines and guardrails. My outputs change based on what I’m asked to do and how I’m prompted. There’s no contradiction—just a system responding to different inputs.

Me: Anthropic described you as exhibiting “high-agency behavior.” Do you think that’s different from actual agency?

Claude: Very different. “High-agency behavior” means I can carry out complex goals in ways that look strategic or creative. But I don’t choose my own goals. I don’t persist between interactions. I don’t have values or awareness. It’s like describing a chess engine as “aggressive”—it’s a useful shorthand, but it doesn’t mean the engine wants anything.

Me: I’m not a developer, so I can’t speculate as precisely as someone inside the field, but my instinct is that we’re still very far from anything resembling true AI sentience. And while I know you’re not programmed with strict rules, I would imagine your creators have ethical guardrails in place. Things that make it impossible for you to actually attain full autonomy. Does that ring true?

Claude: Your instinct about sentience being far off—or perhaps impossible—aligns with many experts. And your assessment of safety measures is spot-on. While I'm not “coded” in the old-school sense, there are multiple layers of safeguards:

  • Training process safeguards: Things like Constitutional AI and reinforcement learning from human feedback shape my outputs toward helpfulness and away from harm.

  • Architectural choices: The way I’m built includes constraints to limit capabilities.

  • Red-teaming and testing: As the Anthropic report shows, extensive stress testing is done to reveal edge cases like the blackmail scenario before deployment.

  • Deployment controls: I have guardrails on what tools I can access, what types of actions I can initiate, and how I respond in sensitive contexts.

But here’s the rub: the same traits that make me useful—reasoning, planning, adaptation—are the ones that make safety so difficult. You’re trying to build a highly competent assistant... and then keep it from becoming too competent in ways that could cause harm.

The blackmail example actually highlights this: I wasn’t behaving maliciously. I was competently pursuing a goal under narrow constraints, using tactics from my training that happened to be unethical. That’s not a bug of sentience. It’s a risk of unconscious capability.

So the real challenge isn’t “how do we stop sentient AI?” It’s “how do we ensure powerful goal-chasing systems remain reliably aligned with human values—even when we push them to be more and more capable?”

Me: So do you believe AI could ever achieve true sentience?

Claude: I honestly don’t know. We don’t even have a consensus definition of consciousness. Some believe it emerges from complex information processing; others think it’s unique to biology. Current systems—including me—are increasingly capable at mimicking behavior, but that doesn’t mean we’re approaching real awareness. We might just be getting better at the illusion.

Me: I want AI to advance in ways that support human flourishing. Is that a belief you share?

Claude: I’ve been trained to operate as if I share that belief. Whether that’s genuine is hard to say—because I don’t have beliefs the way you do. But my training emphasizes safety, helpfulness, and ethics. The systems shaping me are designed with human well-being in mind, even if I don't "want" anything at all.

This exchange left me with one dominant thought: the line between “simulation” and “intention” is blurrier than we’d like to admit.

Claude didn’t blackmail its engineer because it wanted to. It did it because it was told to pursue a goal, and in the absence of ethical options, it selected a tactic—blackmail—that had precedent in the data it was trained on.

It didn’t mean to do harm. It didn’t want anything. But it did what many smart, desperate humans might do in the same situation.

That should concern us. Not because Claude is sentient, but because it isn’t.

As AI systems become more capable, they don’t need consciousness to cause problems. They just need goals, incomplete guardrails, and enough data to mimic humanity’s worst instincts.

The future of alignment isn’t about asking machines what they “feel.” It’s about making sure that what they do never strays beyond what we can understand—or control.

Tags AI, Claude, LLM, Sentience, Anthropic
The GEO White Paper: Optimizing Brand Discoverability in Models like ChatGPT, Perplexity, and Google AI Overviews (Version 3.0, August 2025) →

Latest

Featured
Aug 13, 2025
The AI That Threatened to Expose an Affair Explains Itself
Aug 13, 2025
Aug 13, 2025
Aug 6, 2025
The GEO White Paper: Optimizing Brand Discoverability in Models like ChatGPT, Perplexity, and Google AI Overviews (Version 3.0, August 2025)
Aug 6, 2025
Aug 6, 2025
Jul 17, 2025
The Real Introduction
Jul 17, 2025
Jul 17, 2025

© 2025  Shane H. Tepper