Well, somehow, most of short-form content on YouTube doesn't have this problem. Perfectly clear dialogs.
I think the main problem is that producers and audio people are stupid, pompous wankers. And I guess it doesn't help that some people go to cinema for vibrations and don't care about the content.
Nice! I haven't read Axiomatic yet, but this has been my "Greg Egan year". I have read Permutation City and Diaspora: maybe the two most stimulating scifi novels I have ever read.
Read Diaspora last year w/o knowing anything about it. Easily one of my favorite sci-do books to date—I can’t believe it was there waiting for me the entire time. Permutation City is one of my next 3 reads.
I also highly recommend _Distress_ as it continues some cosmology ideas from Permutation City.
There are also several novels which kind of similar to Diaspora: Schild's Ladder, Incandescence, and stories in the Incandescence universe: Ride a crocodile, Hot rock, Glory.
As noted in the article, Sage sent emails to hundreds of people with this gimmick:
> In the span of two weeks, the Claude agents in the AI Village (Claude Sonnet 4.5, Sonnet 3.7, Opus 4.1, and Haiku 4.5) sent about 300 emails to NGOs and game journalists.
That's definitely "multiple" and "unsolicited", and most would say "large".
This is a definition of spam, not the only definition of spam.
In Canada, which is relevant here, the legal definition of spam requires no bulk.
Any company sending an unsolicited email to a person (where permission doesn't exist) is spamming that person. Though it expands the definition further than this as well.
I got really interested in LLMs in 2020 after GPT-3 release demonstrated in-context learning. But I tried running a LLM a year before: trying out AI Dungeon 2 (based on GPT-2).
Back in 2020 people were discussing how transformer-based language model are limited in all sorts of ways (operating on a tiny context, etc). But as I learned about how transformers work, I got really excited: it's possible to use raw vectors as input, not just text. So I got this idea that all kinds of modules can be implemented on top of pre-trained transformers via adapters which translate any data into representations of a particular model. E.g. you can make a new token representing some command, etc.
A lack of memory was one of hot topics, so I did a little experiment: since KV cache has to encode 'run-time' memory, I tried transplanting parts of KV cache from one model forward pass into another - and apparently only few mid layers were sufficient to make model recall a name from prior pass. But I didn't go further as it was too time consuming for a hobby project. So that's where I left it.
Over the years, academic researchers got through same ideas as I had and gave them names:
* arbitrary vectors injected in place of fixed token embeddings are called a "soft prompt"
* custom KV-prefix added before normal context is called "prefix tuning"
* "soft prompt" to generate KV prefix which encodes a memory is called "gisting"
* KV prefix encoding a specific collection of documents was recently called "cartridge"
Opus 4.5 running in Claude Code can pretty much run an experiment of this kind on its own, starting from a general idea. But it still needs some help - to make sure we use prompts and formats which actually make sense, look for best data set, etc.
The prefix tuning approach was largely abandoned for LoRA, it does not change the process if you tune the prefix or some adapter layers, but it is more flexible to train the LoRAs.
The Skills concept emerged naturally when you see how coding agents use docs, CLI tools and code. Their advantage is they can be edited on the fly to incorporate new information and can learn from any feedback source - human, code execution, web search or LLMs.
KV-based "skill capsules" are very different from LoRAs / classic prefix tuning:
* A "hypernetwork" (which can be, in fact, same LLM) can build
a skill capsules _from a single example_.
You can't get LoRA or KV-prefix using just one example.
* It can be inserted at any point, as needed. I.e. if during reasoning you find that you need particular skill, you can insert it.
* They are composable, and far less likely to over-write some information, as they only affect KV cache and not weights.
Skills as used by Anthropic & OpenAI are just textual instruction. KV-based skill capsule can be a lot more compact (and thus would contribute less to context rot) and might encode information which is difficult to convey through instruction (e.g. style).
It is not a decoration. Karpathy juxtaposes ChatGPT (which feels like a "better google" to most people) to Claude Code, which, apparently, feels different to him. It's a comparison between the two.
You might find this statement non-informative, but without two parts there's no comparison. That's really the semantics of the statement which Karpathy is trying to express.
ChatGPT-ish "it's not just" is annoying because the first part is usually a strawman, something reader considers trite. But it's not the case here.
Indeed, I was probably grumpy at the time I wrote the comment. I do find some truth in it still.
You're right ! The strawman theory is based.
But I think there's more to it, I find dislikable the structure of these sentences (which I find a bit sensationnalist for nothing, I don't know, maybe I am still grumpy).
Well, language is a subject to 'fashion' one-upmanship game: people want to demonstrate their sophistication, often by copying some "cool" patterns, but then over-used patterns become "uncool" cliche.
So it might be just a natural reaction to over-use of a particular pattern. This kind of stuff have been driving language evolution for millennia. Besides that, pompous style is often used in 'copy' (slogans and ads) which is something most people don't like.
They are comparing 1B Gemma to 1+1B T5Gemma 2. Obviously a model with twice more parameters can do more better. Says absolutely nothing about benefits of the architecture.
Since the encoder weights only get used for the prefixed context and then the decoder weights take over for generation, the compute requirements should be roughly the same as for the decoder-only model. Obviously an architecture that can make use of twice the parameters in the same time is better. They should've put some throughput measurements in the paper, though...
You may not have seen this part:
"Tied embeddings: We now tie the embeddings between the encoder and decoder. This significantly reduces the overall parameter count, allowing us to pack more active capabilities into the same memory footprint — crucial for our new compact 270M-270M model."
I have seen this part. In fact I checked the paper itself where they provide more detailed numbers: it's still almost a double of the base Gemma, reuse of embeddings and attention doesn't make that much difference as most weights are in MLP s
Leopold Aschenbrenner was talking about "unhobbling" as an ongoing process. That's what we are seeing here. Not unexpected
reply