Language, Entropy, and Language Modeling

Is English—developed over millennia to suit the communication preferences of humans—really the ideal language that machines would use to think, reason, and learn? Almost certainly not.

Foreword

I was inspired to write this after reading the excellent work Dynamic Chunking for End-to-End Hierarchical Sequence Modeling, developed by Sukjun Hwang, Brandon Wang, and Albert Gu. Their work proposes a clever method for training on raw input characters\footnote{I use "characters" and "bytes" interchangeably, but prefer the term "characters" because it is more strongly associated with NLP in my priors.} by compressing a sequence of input characters (discrete space) into a sequence of latent representations (continuous space). This sequence of latent representations is then passed through a language model, and then upsampled back to the character space by a Decoder network. Albert's blog post expands upon some of their motivations, challenges, and failed approaches—and especially given this last point—I was encouraged to share some of my own experiments and learnings around tokenization.

I've been working on a flavor of tokenization—to be precise, trying to train Language Models on a latent language designed for autoregressive modeling—off and on since January, but between graduating, starting a company, watching said company disintegrate, working to help a friend start their company, applying to jobs, getting kicked out of Stanford graduate housing, and finally couch surfing around South Bay for this past month, I've been remiss in driving this project to completion.

I hope that someone might be able to build upon, flesh out, or even take umbrage with the ideas presented in this post and draw them out to a satisfactory conclusion.

Language Modeling

One of the first things you teach students in an introductory NLP course is the derivation of language modeling objective from maximum (log)likelihood estimation: \begin{align*} \max \mathbb{E}_{x \sim D} \left[\log p_\theta(x_1, ..., x_n)\right] &= -\min \mathbb{E}_{x \sim D} \left[\log p_\theta(x_1, ..., x_n)\right]\\ &= -\min \sum\limits_{x\sim D} p(x_1, ..., x_t)\log p_\theta(x_1, ..., x_n) \\ &\approx -\min \sum\limits_{x\sim D} \frac{1}{|D|} \log p_\theta(x_1, ..., x_n) \\ &= -\min \frac{1}{|D|} \sum\limits_{x\sim D} \sum\limits_{i=1}^t \log p_\theta(x_t|x_1, ..., x_{t-1})\\ \end{align*} And then maybe a little later on, you take a course in information theory (or read Cryptonomicon\footnote{Neal Stephenson's novel offers a great introduction to information theory in the context of a historical fiction narrative.}) and realize that this really just the cross entropy between two distributions: $p$ is the "true" conditional distribution from written English and $p_\theta$ is the distribution your language model learns over the course of training. With this understanding, you can rewrite the language modeling objective as the sum of the entropy of the underlying language and the KL-divergence between the predicted language distribution and the actual language distribution: \begin{align*} \mathcal{L} &= -\frac{1}{|D|} \sum\limits_{x\sim D} \sum\limits_{i=1}^t \log p_\theta(x_t|x_1, ..., x_{t-1})\\ &= \sum\limits_{x\sim D} p(x) -\log p_\theta(x) \\ &= \sum\limits_{x\sim D} p(x) \left[\log p(x) - \log p(x) - \log p_\theta(x)\right] \\ &= \sum\limits_{x\sim D} -p(x) \log p(x) + p(x) \left[\log p(x) - \log p_\theta(x)\right] \\ &= \underbrace{\sum\limits_{x\sim D} -p(x) \log p(x)}_{H(X)} + \underbrace{\sum\limits_{x\sim D} p(x) \log \frac{p(x)}{p_{\theta}(x)}}_{KL(p||p_{\theta})} \end{align*} where $H(X)$ is the entropy of the English language and $KL(p||p_{\theta})$ is the KL-divergence between the true distribution $p$ and your learned distribution $p_{\theta}$. Informally, the KL-divergence is a measure of how "off" $p_{\theta}$ is from $p$\footnote{The KL divergence between $p$ and $p\theta$ can be interpreted as the number of extra bits needed to compress $p\theta$ when you use a code optimized for p.}. When our language model perfectly models the distribution of the English language, this term drops to 0, and we're left with H(X), the entropy of the English language.

Ok; so we're training our language model to minimize $\min H(X) + KL(p||p_{\theta})$, but $H(X)$ is not a function of $\theta$, so we're really just minimizing $KL(p||p_{\theta})$. As a result, the lower bound on the cross entropy loss will be $H(X)$, and even the theoretically "perfect" language model will be unable to break past this threshold. Or can it?

Learning a Latent Language

The only way to decrease $H(X)$ is to reduce the entropy of the written English language. This seems really hard, but maybe we can cheat a bit. But first, let's look at sources of entropy in language. Concretely, what are these and do we even want to try to eliminate them?

One of the examples that I love to use here is French. In French, we often modify nouns based on the gender, specifically, we add an "e" to nouns if the speaker is female. For example: The english phrase

I am tall translates to Je suis grand(e) where the speaker adds an "e" to the end of "grand" if they are female.

Putting yourself in the proverbial shoes of a LLM trained to on French literature, generating this last token is really hard: it's a coin flip without knowing the gender of the speaker. However, by moving to English— we eliminate this source of uncertainty because our nouns do not change based on the gender of the speaker.

Going back to the first question, how might we cheat? Well, we fundamentally need a different language, one with a lower entropy than English, and ideally, other characteristics—such as sequence length and the distribution of entropy across tokens—that is more amenable to next-token prediction than English is. Maybe we can learn it in an end-to-end fashion, similar to how recent works try to learn how sequences of adjacent characters should be merged together into tokens \cite{hwang2025dynamic, blt, nawrot2022efficient, nawrot2021hierarchical}.

Latent Language Implementation

As a first approximation, we want to learn a latent language with two properties:

It is easy for an LM to model.
It can be converted to and from English with relative ease.

Borrowing from what already works in Computer Vision as a v1 attempt \cite{taming, ldm} for latent space image generation, let's try jointly training a compression model alongside a language model. The compression model is composed of two components: an Encoder ($\mathcal{E}$) and a Decoder ($\mathcal{D}$). The Encoder takes in a sequence of characters and outputs a (compressed) sequence of discrete latent tokens, using Vector Quantization to map the Encoder's output to the nearest token in our latent language. The Decoder is then tasked with reconstructing the original English sentence from the (compressed) sequence of latent language tokens.

This Encoder-Decoder paradigm is enough to learn a latent language, but there's no a priori reason that latent language would be easy for a LM to model or have desirable properties that make it better than English. To try and satisfy this first characteristic, we jointly train a Language Model alongside the compression model, making the next-token prediction loss backpropagate through the language model and back through the Encoder. The balance of the compression model minimizing reconstruction error, combined with the Language Model minimizing next-token prediction error, would hopefully avoid non-degenerative solutions\footnote{Such as the language model collapsing the latent vocabulary to a single, easy-to-predict token.} and learn a latent language that is more easy to model than English.

Such an architecture might look like the one below:

Where the Encoder and Decoder are instantiated as non-causal sequence models, specifically a BERT\cite{bert} model. I chose a non-causal model because it enables better compression than a causal model: information from later in the sequence can become blended into tokens that appear earlier into the sequence, expanding the space of latent languages that could be learned. Specifically, it might be desirable to develop a language where the characters that appear early in the sequence are strongly (and in turn influence) characters that appear at the end of the sequence\footnote{The OG Attention Is All You Need transformer architecture follows a similar pattern for machine translation, using a non-causal Encoder.}.

For the actual compression and decompression methods, I played around with several different techniques for static compression (i.e. we go from an input character sequence length of $t$ to a latent token sequence of length $t / c$ where $c$ is the compression factor):

A simple 1d convolution across the sequence dimension with kernel size $k$ and stride $c$.
A state-space model\cite{mamba}, emitting a token every $c$ steps.
Top-k attention pooling (where $k= t/c$).
A linear layer to project each $d$-dimensional token down to $d/c$ and then reshaping the sequence length to merge every $c$ tokens together.
A linear transformation on the sequence dimension itself: if the input is (b, t, d), then we can learn a per-feature dimension linear map to project to (b, t/c, d).

Out of these, (1) and (2) performed the best, with similar training metrics. Decompression of the conv1d is simply a transposed 1d convolution, and for the SSM, we duplicate each token in the compressed sequence $c$ times, add learnable positional embeddings $(1, 2, ..., c)$, and pass this input through a Decoder-side SSM to feed into the BERT-style Decoder model.

For the language model, I went with a GPT-2 small architecture\cite{gpt2}. I would have really liked to scale up the parameters on both the compression model and language model, but I'm limited by access to compute.

The joint training objective is a linear combination of the compression model and language model objectives: $$\mathcal{L}(x) = \underbrace{-\sum\limits_{i=1}^t \log p_\theta(x_i|z)}_{\text{compression model loss}} \underbrace{-\beta \sum\limits_{k=1}^{t/c} \log q_\phi(z_k|z_{ < k})}_{\text{language model loss}}$$ where $x = (x_1, x_2, ..., x_t)$ is the input sequence of characters, $z = (z_1, ..., z_{t/c})$ is sequence of latent tokens, $\beta$ is a loss-weighing factor for the multi-task objective, $p_\theta$ is the Decoder's distribution over English characters, and $q_\phi$ is the language model's distribution over the latent alphabet.

Remarks on Lossy Compression and No Free Lunch

You'll notice this approach employs lossy compression; the Decoder network does not perfectly reconstruct the English characters input to the decoder. This differs from tokenization methods which bijectively map sequences of english characters into tokens. However, bijective functions cannot reduce entropy: $H(X) = H(f(X))$ where $f$ is a bijective function. If we want to develop a latent language that has lower entropy than English, some sort of lossy compression is needed.

You will also notice that we are still computing the cross-entropy loss with respect to English characters: $P(x_i|z)$ for the sequence of latent tokens $z = (z_1, ..., z_{t/c})$, so in a sense, we're continuing to pay the entropy penalty for the English language. We're just changing where that price is paid, moving part of the cost from the language model to the compression model.

Let's return to our hypothetical example of translating between French and English. The French phrase:

Je suis grande might get mapped to a latent alphabet, let's call this latent alphabet English: I am tall and then back to: Je suis grand Where the reconstructed phrase drops the "e" at the end of "grand". In this setting—moving from a higher entropy human language to a lower entropy one—the lower bound on the language modeling loss will be lower, but the price of the higher entropy is still paid, via the reconstruction loss at the compression model. Revisiting the loss function, and assuming the multi-task loss weighing term $\beta = 1$ for simplicity, with random variable X over English sentences and random variable Z over latent language sentences (i.e. Z = Encoder(X)), our training objective becomes: \begin{align*} \mathcal{L} &= \underbrace{\mathbb{E}_{x \sim D} \left[-\sum\limits_{i=1}^t \log p(x_i|z)\right]}_{\text{Compression Model Objective}} + \underbrace{\mathbb{E}_{z \sim Z}\left[-\sum\limits_{k=1}^{t/c} \log q(z_k|z_{ < k})\right]}_{\text{Language Model Objective}}\\ &= \mathbb{E}_{x \sim D}\left[-\sum\limits_{i=1}^t \log p(x_i|z)\right] + H(Z) + KL(q||q_{\phi})\\ &= \mathbb{E}_{(x,\mathcal{E}(x)) \sim D}\left[ -\sum\limits_{i=1}^t \log p(x_i|z)\right] + H(Z) + KL(q||q_{\phi})\\ &= \underbrace{\mathbb{E}_{(x,z) \sim D}\left[-\log \prod_{i=1}^t p(x_i|z)\right]}_{\text{Conditionally independent factorization}} + H(Z) + KL(q||q_{\phi})\\ &= \mathbb{E}_{(x,z) \sim D}\left[-\log p(x|z)\right] + H(Z) + KL(q||q_{\phi})\\ &= H(X|Z) + H(Z) + KL(q||q_{\phi})\\ &= H(X,Z) + KL(q||q_{\phi})\\ &= H(X,\mathcal{E}(X)) + KL(q||q_{\phi})\\ &= H(X) + KL(q||q_{\phi})\\ \end{align*} Which gives us almost exactly the same formula as our Language Modeling objective, albeit through a different route\footnote{The different route alluded to refers to the non-causal modeling assumption of the compression model: $p(x|z) = \prod p(x_i|z)$.}. There truly is "no free lunch", but at least we might be able to choose where we pay for lunch (at the compression model or at the language model).

Results

For the following results, I use a BPE checkpoint trained for 100 epochs while the latent language model was only trained for 75 epochs (when I lost access to compute). I just used the OpenWebText dataset\cite{owt}, writing and epoch every ~3,000 steps. Both bpe and latent language models used the exact same training settings: learning rate of 5e-5, dropout of 0.05, sequence length of 512 characters, etc.

In the below plot, we compare a standard BPE model trained on OpenWebText to a latent language model trained on the same dataset. Using the notation from earlier, this compares $H(X) + KL(p||p_\theta)$ for the BPE model with $H(Z) + KL(q||q_\phi)$ for the latent language model where the latent language model uses a compression factor of $c=2$ while the BPE model uses a compression factor of $c=4.1$. We compute loss-per-character, normalizing default the BPE cross-entropy loss by 4 and the latent language cross-entropy loss by 2.

It's also interesting to look at the distortion from the compression model, i.e., $H(X|Z)$. This term is very small, on the order of 1e-4, indicating the compression model is able to reconstruct the input sentences through the information bottleneck with relative ease.

Perhaps more interesting than the distortion from the reconstructions, we can look at the distortion from the output of the language model during training with teacher forcing. How well does the Decoder reconstruct the input (shifted by one token) after taking the most likely latent token output from the language model?

Formally, this term is $P(x|\hat{z})$ where $\hat{z}$ is the sequence of latent tokens output from the language model with highest probability mass. It can be viewed as the distribution over English characters when the "surprise" from selecting the next token is factored out (by the language model). In this experiment, it is $\sim 0.07$. Unsurprisingly, this is substantially lower than the entropy of the English language\footnote{Often estimated to be around 0.6-1.3}, and while the end-to-end method does not work, it's still quite cool to see an output distribution over English characters have such low cross-entropy loss.

Entropy Estimation Analysis

Using our OpenWebText dataset as a proxy for "all written english", we can approximate the inherent entropy of English and compare this value to the inherent entropy of our latent language. We approximate both values by computing 4-gram probabilities on the validation dataset using statistics computed on the training set. For example, the phrase:

I am tall. might become tokenized to 'I', ' am', ' tall', '.' where the probability of this example is: p = p('I') * p(' am' | 'I') * p(' tall'|'I', ' am') * p('.'| 'I', ' am', ' tall') and the entropy of this example is: $$H(X) = -p \log p$$ and we can then divide by the number of characters in the phrase (10 in this example) to compute a normalized measure of Entropy in bits per character. Note that tokenization does not (theoretically) change the entropy of the underlying language, but it would likely affect our n-gram approximation as it changes the granularity of the n-grams. For the approximation in this section, we use BPE.

We estimate the entropy of English as $\sim 1.8$ bits per character\footnote{Read: bits per byte because we use 8-bit ASCII encodings.} and the entropy of our latent language as $\sim 3.0$ bits per (english) character. This result shows that we did not learn a latent language with lower inherent entropy than English. Specifically, backpropagating the gradients from the next-token language modeling loss back through the Encoder of the compression model does not cause it to develop a latent language with lower inherent entropy than English. I have a couple of ideas for why this didn't work, but haven't (yet) been able to investigate them.

Mapping Latent Tokens to English

It might also be useful to look at how individual (or groups) of latent tokens are mapped to English to better understand the properties of the latent language that was learned. For this analysis, I chose:

The quick brown fox jumps over the lazy dog because it contains every letter in the English alphabet. Our model uses 8-bit ASCII tokens as input, so the input to the compression model becomes the following sequence of tokens: 89, 109, 106, 37, 118, 122, 110, 104, 112, 37, 103, 119, 116, 124, 115, 37, 107, 116, 125, 37, 111, 122, 114, 117, 106, 105, 37, 116, 123, 106, 119, 37, 121, 109, 106, 37, 113, 102, 127, 126, 37, 105, 116, 108 Passing this sequence through the Encoder model, we arrive at the compressed sequence of codewords that compose our latent language: 36629, 17899, 60809, 43672, 51707, 3434, 63324, 9911, 42644, 25019, 36033, 30057, 60667, 26184, 25722, 54845, 12234, 65037, 48188, 42491, 36764, 22858 We use a compression factor of 2, so the latent token sequence length should be 0.5$\times$ the original sequence length. Decoding this latent sequence back to english, we completely recover the 8-bit ASCII sequence: 89, 109, 106, 37, 118, 122, 110, 104, 112, 37, 103, 119, 116, 124, 115, 37, 107, 116, 125, 37, 111, 122, 114, 117, 106, 105, 37, 116, 123, 106, 119, 37, 121, 109, 106, 37, 113, 102, 127, 126, 37, 105, 116, 108 But what happens if we decode a single token at a time, or subsequences of tokens, rather than the full latent sequence? Well, decoding a single token at a time in isolation results in gibberish:

But then decoding sequences of latent tokens does quite a bit better: 36629, 17899, 60809, 43672, 51707, 3434, 63324, 9911, 42644 translates to: The quick brown b- with adding the next token in this sequence: 36629, 17899, 60809, 43672, 51707, 3434, 63324, 9911, 42644, 25019 allowing us to correctly decode the word "fox": The quick brown fox It seems like the latent language learned to blend information from across the English input characters; individual tokens may be nonsensical in insolation, but when combined with adjacent latent tokens, they develop meaning in the conversion back to English.

Example (non-cherry picked) Generations

The following prompts are the starting sentences of the first and second examples in the validation split of the OpenWebText dataset. The next token is sampled with temperature = 0.9, top-p = 0.95, and top_k=50. I did not fine-tune these hyperparameters, they were the default that I found for generating with GPT-2. The latent language model uses a compression factor of $c=2$; although from playing around with other settings, setting $c=4$ also seemed to work well. BPE tokenization on average reduces ever 4.1 adjacent characters to a single token, so we'll approximate the sequential compression factor of BPE as $c=4$.

For the latent language model, we feed the prompt through the encoder, auto-regressively generate in the latent space, and then pass the full sequence (prompt tokens + generated tokens) through the decoder to output an English sentence. As the compression model is lossy, there will occasionally be spelling mistakes in the output, even for the prompt. I've highlighted some of the spelling mistakes in the generations below.

Example 1:

Prompt:

                Just in case the words "Command & Conquer" and "free-to-play" hadn't got you wary enough for the upcoming C&C Generals 2, there'
            

BPE Model (compression factor = ~4.1):

                Just in case the words "Command & Conquer" and "free-to-play" hadn't got you wary enough for the upcoming C&C Generals 2, there's now no sign of what you'll need to see.

In a live demo from Ubisoft Montreal, the game's creators announced that the next C&C game would be "the largest expansion ever" in the world.

The demo was revealed during the latest C&C game reveal, and it will be available from Ubisoft Montreal on October 18th.

"It's a huge achievement," said Brad "Pascal" Lattimore

Latent Language Model (compression factor = 2):

Just in case the words "Command & Conquer" and "free-to-play" hasn't got you eary enough for the upcoming C&C Generals 3, there's a new character with some of the first big parts in the series: Ary Generals 3, which argues that a command of the "responsible" guards reacted with high speed over that final place in line with this redesign to the series, is coming and is set to run on the VFX Store until 30 April 2015.

There are 16 complete characters of "free-to-play" game day, so it's not important to note,

Example 2:

Prompt:

''It just goes to show that this is not just something that happens elsewhere, it could happen in countries like Australia
            

BPE Model (compression factor = ~4.1):

''It just goes to show that this is not just something that happens elsewhere, it could happen in countries like Australia,'' said Greg Abbott.

''The state is going to make more decisions, and to provide for Australians, it would be a shock to see Australia's economy get in a better position."

Australian Prime Minister Tony Abbott has been critical of Australia's performance in the G20 summit, urging leaders to try to stop the "federalization" of Australia's financial sector.

''It's not just a surprise," ico

Latent Language Model (compression factor = 2):

''It just goes to show that this is not just something that happens elsewhere, it could happen in countries like Australia," said Jgiia Hermeson, an ex-separatist pedophile, in the city of Budapest, in 2012, in which experts spoke to the editorial editor of the author of the ancient Greeks.

Her claim that leading to the preferences began on this site of this week's paper on Tuesday, the first reference that began last week on The Telegraph.

"It was at the end of America. It was at the end of its season in

Thoughts on the Future

A valid question is does this even matter: is there any benefit in reducing the inherent entropy of the underlying data for language modeling? My honest answer is that I really don't know. Looking at a more concrete case, would there any benefit to language modeling in a higher-entropy language like French by first translating the input French prompt to English, generating the response, and then re-translating this entire sentence back to French? You will likely get a different output French sentence—with even the prompt changing on occasion due to the resulting generation—but maybe this is "good enough"? If not, would a latent language be useful in other contexts where decoding back to English is less important? Perhaps in text-conditioned generation to reduce the sequence length of the prompt (i.e. text to image and text to video generation)?

Another point I'd like to address is how a latent language model would be different than tokenization methods, even the most recent dynamic tokenization approaches like H-Nets\cite{hwang2025dynamic}, Dynamic Token Pooling\cite{nawrot2022efficient}, Byte-Latent Transformers\cite{blt}, and Hourglass Networks\cite{nawrot2021hierarchical}? First of all, the cited methods actually work whereas my latent language idea does not :) But a close second is generation in the latent space. For dynamic tokenization methods, the input to inner language model is not actually the language model's output, but rather the language model's output first transformed by the Decoder and then again by the Encoder. The Decoder is needed to figure out the next English token to be generated, and then the Encoder is needed to map this sequence into a representation suitable for the LM to generate the next token. This characteristic contrasts with the notion of a "latent language" where one would be able to autoregressively generate sequences entirely within the latent space.

Extrapolating a bit, it's possible a future generation of language models employ some combination of dynamic chunking within a latent language optimized for machines, not humans. Perhaps this will arise as autonomous agents develop efficient communication protocols to convey information without humans in the loop, but I'll leave these ponderings to the science fiction writers of today rather than blog post of an unemployed and homeless PhD graduate.

Citation

Cite as:

@article{fifty2025latent,
  title   = "Latent Space Language Modeling",
  author  = "Fifty, Christopher",
  journal = "cfifty.github.io",
  year    = "2025",
  month   = "Jul",
  url     = "https://cfifty.github.io/writings/latent_lm.html"
}

References

@article{hwang2025dynamic, title={Dynamic Chunking for End-to-End Hierarchical Sequence Modeling}, author={Hwang, Sukjun and Wang, Brandon and Gu, Albert}, journal={arXiv preprint arXiv:2507.07955}, year={2025} }
@article{nawrot2022efficient, title={Efficient transformers with dynamic token pooling}, author={Nawrot, Piotr and Chorowski, Jan and Lancucki, Adrian and Ponti, Edoardo M}, journal={arXiv preprint arXiv:2211.09761}, year={2022} }
@article{nawrot2021hierarchical, title={Hierarchical transformers are more efficient language models}, author={Nawrot, Piotr and Tworkowski, Szymon and Tyrolski, Michal and Kaiser, Lukasz and Wu, Yuhuai and Szegedy, Christian and Michalewski, Henryk}, journal={arXiv preprint arXiv:2110.13711}, year={2021} }
@inproceedings{taming, title={Taming transformers for high-resolution image synthesis}, author={Esser, Patrick and Rombach, Robin and Ommer, Bjorn}, booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition}, pages={12873--12883}, year={2021} }
@inproceedings{ldm, title={High-resolution image synthesis with latent diffusion models}, author={Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj{\"o}rn}, booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition}, pages={10684--10695}, year={2022} }
@inproceedings{bert, title={Bert: Pre-training of deep bidirectional transformers for language understanding}, author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina}, booktitle={Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers)}, pages={4171--4186}, year={2019} }
@article{mamba, title={Mamba: Linear-time sequence modeling with selective state spaces}, author={Gu, Albert and Dao, Tri}, journal={arXiv preprint arXiv:2312.00752}, year={2023} }
@article{gpt2, title={Language models are unsupervised multitask learners}, author={Radford, Alec and Wu, Jeffrey and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya and others}, journal={OpenAI blog}, volume={1}, number={8}, pages={9}, year={2019} }
@misc{owt, title={OpenWebText Corpus}, author={Aaron Gokaslan and Vanya Cohen}, howpublished={\url{http://Skylion007.github.io/OpenWebTextCorpus}}, year={2019} }
@article{blt, title={Byte latent transformer: Patches scale better than tokens}, author={Pagnoni, Artidoro and Pasunuru, Ram and Rodriguez, Pedro and Nguyen, John and Muller, Benjamin and Li, Margaret and Zhou, Chunting and Yu, Lili and Weston, Jason and Zettlemoyer, Luke and others}, journal={arXiv preprint arXiv:2412.09871}, year={2024} }