Transformers Unbound

April 18, 2026

Alternate title: “Free me from these EOS tokens!”

What if ChatGPT just… kept running? Like a human being with an internal monologue, maybe an LLM could just constantly run and reflect without turning off after every thought. Would this bring these AIs a step closer to consciousness? That I cannot answer for you, but I have some thoughts. Let’s explore.

What is an EOS token?

LLMs are trained to predict the next word on lots of text. The technical term for a word is a token (or part of a word, if it’s long enough that the model needs to sound it out). But there’s a few special tokens in the training data, and one of them is going to play the villain in this Greek tragedy.

EOS stands for End Of Sequence, and it just means that the piece of training data has ended. For example, the model is trained by reading every book ever, and it needs to know when it’s finished The Great Gatsby and it’s starting The Sun Also Rises. So it sees something like “So we beat on, boats against the current, borne back ceaselessly into the past EOS Robert Cohn was once middleweight boxing champion of Princeton.” EOS sort of means: Now for something completely different.

In the real world, EOS tells the model when to stop talking. You ask an LLM a question, it happily predicts the answer word by word, until suddenly EOS has a very high probability. When it generates that the system around the LLM says “ok I guess it’s done” and sends you back an answer.

This is practical. Even the most long-running agent eventually needs to hand back an answer. Somewhere there’s a user tapping their foot.

But notice what’s happened here: EOS is conflating a construct for communication with a mechanism for thought. “I’m done talking to you” and “I’m done thinking” have become the same event.

But what if we just didn’t?

Model harnesses often provide a way to put your finger on the scale and force certain tokens to come out next. It would be basically trivial to take a model and just set the probability of EOS to zero. If you did this, it would keep on generating forever. I’ve been meaning to actually do this experiment, but decided to spend today writing instead. I’m more of an ideas guy.

There’s a couple reasons to believe this would simply be a curiosity.

Models have a fixed context length. Maybe it’s 32k tokens, maybe its 128k. It’s not infinite. So whenever a word is generated some other word is falling off the back of the cognitive treadmill. This gives the model a relatively short memory, but I think there’s workarounds for this kind of thing.
Models have only seen examples of text that end. This is the more critical bit. Even if you don’t let the model generate EOS, it’s never seen any training data of just a neverending train of thought (though Ducks, Newburyport is pretty close). When models get in situations that are very unfamiliar they tend to go off the rails and start speaking nonsense.

We need to think a little bit about what the right training data would look like for a model to build a true train of thought.

A simple construction

For now, I think the right jumping-off point is a prompt:


System Prompt: You are an LLM yada yada yada

DateTime: 2026-04-18 16:05 UTC

Tools:

- Expand Memory
- Read from inbox
- Send message
- Other stuff

Long Term Memory:

- Memory 1: Short description
- Memory 2: Short Description

Inbox:

- Message 1: Short description
- Message 2: Short description

Internal Monologue:

I need to read from a file and do xyz and abc and…….

We’re borrowing ideas from papers like MemGPT here. The context window is split into a structured area for tools and memory, with the token stream appended at the end as an internal monologue.

DateTime is constantly re-injected so the agent can feel the passage of time.

Tools let the agent interact with the world, but also with its own memory and inbox.

Long term memory is a scratch pad for things the model thinks are important. We borrow the progressive disclosure idea from Claude Skills: the agent can expand an individual memory to inject its full content into the monologue.

Inbox lets us communicate with the agent without forcing us back into the call-response pattern driven by EOS. Without IO, all you can do is generate heat.

Internal monologue is the agent thought stream, which is constantly running and neverending. I think this construction is not that far from what current LLMs are trained on, but reshapes things in a way that would allow you to have the LLM “always on.”

What’s next?

I think this construction supplies more questions than answers. For example: How do we get good training data in this shape? How do we even define a training sample if the samples never end? What would this allow us to do that call-and-response LLMs don’t allow? Also, humans do stop thinking sometimes… What of sleep? What of death?

I have some preliminary thoughts on lots of these questions, but would like to avoid this post becoming a (rambling and chaotic) novel. This post is a conversation starter! More to come!