Hacker Newsnew | past | comments | ask | show | jobs | submit | killerstorm's commentslogin

Weird title. Obviously, early AI agents were clumsy, and we should expect more mature performance in future.

Leopold Aschenbrenner was talking about "unhobbling" as an ongoing process. That's what we are seeing here. Not unexpected


This is just a more elaborate form of an escrow contract.

There's absolutely no need to make a new L1 for that: you can use existing smart contract/dapp platforms, plug into existing stable coin rails, etc.


Well, somehow, most of short-form content on YouTube doesn't have this problem. Perfectly clear dialogs.

I think the main problem is that producers and audio people are stupid, pompous wankers. And I guess it doesn't help that some people go to cinema for vibrations and don't care about the content.


Hmm, looking through how-exedev-works, it seems like what you call VM is more like a container, i.e. it doesn't run its own kernel?

Sort of a container which "feels like" a VM? Reminds me of Virtuozzo / OpenVZ VM approach which was popular ~20 years ago when RAM was expensive...


I read a lot of books, the one I'd recommend the most is Greg Egan's "Axiomatic" short story collection.


Nice! I haven't read Axiomatic yet, but this has been my "Greg Egan year". I have read Permutation City and Diaspora: maybe the two most stimulating scifi novels I have ever read.


Read Diaspora last year w/o knowing anything about it. Easily one of my favorite sci-do books to date—I can’t believe it was there waiting for me the entire time. Permutation City is one of my next 3 reads.

I also highly recommend _Distress_ as it continues some cosmology ideas from Permutation City.

There are also several novels which kind of similar to Diaspora: Schild's Ladder, Incandescence, and stories in the Incandescence universe: Ride a crocodile, Hot rock, Glory.


+1 for permutation city.

The core concept is so well established in the book.


1 email sent to 1 specific person is not a spam.

Spam is defined as "sending multiple unsolicited messages to large numbers of recipients". That's not what happened here.


As noted in the article, Sage sent emails to hundreds of people with this gimmick:

> In the span of two weeks, the Claude agents in the AI Village (Claude Sonnet 4.5, Sonnet 3.7, Opus 4.1, and Haiku 4.5) sent about 300 emails to NGOs and game journalists.

That's definitely "multiple" and "unsolicited", and most would say "large".


This is a definition of spam, not the only definition of spam.

In Canada, which is relevant here, the legal definition of spam requires no bulk.

Any company sending an unsolicited email to a person (where permission doesn't exist) is spamming that person. Though it expands the definition further than this as well.


I think this experiment demonstrates that it has agency. OTOH you're just begging the argument.


A bit of backstory:

I got really interested in LLMs in 2020 after GPT-3 release demonstrated in-context learning. But I tried running a LLM a year before: trying out AI Dungeon 2 (based on GPT-2).

Back in 2020 people were discussing how transformer-based language model are limited in all sorts of ways (operating on a tiny context, etc). But as I learned about how transformers work, I got really excited: it's possible to use raw vectors as input, not just text. So I got this idea that all kinds of modules can be implemented on top of pre-trained transformers via adapters which translate any data into representations of a particular model. E.g. you can make a new token representing some command, etc.

A lack of memory was one of hot topics, so I did a little experiment: since KV cache has to encode 'run-time' memory, I tried transplanting parts of KV cache from one model forward pass into another - and apparently only few mid layers were sufficient to make model recall a name from prior pass. But I didn't go further as it was too time consuming for a hobby project. So that's where I left it.

Over the years, academic researchers got through same ideas as I had and gave them names:

* arbitrary vectors injected in place of fixed token embeddings are called a "soft prompt" * custom KV-prefix added before normal context is called "prefix tuning" * "soft prompt" to generate KV prefix which encodes a memory is called "gisting" * KV prefix encoding a specific collection of documents was recently called "cartridge"

Opus 4.5 running in Claude Code can pretty much run an experiment of this kind on its own, starting from a general idea. But it still needs some help - to make sure we use prompts and formats which actually make sense, look for best data set, etc.


The prefix tuning approach was largely abandoned for LoRA, it does not change the process if you tune the prefix or some adapter layers, but it is more flexible to train the LoRAs.

The Skills concept emerged naturally when you see how coding agents use docs, CLI tools and code. Their advantage is they can be edited on the fly to incorporate new information and can learn from any feedback source - human, code execution, web search or LLMs.


KV-based "skill capsules" are very different from LoRAs / classic prefix tuning:

  * A "hypernetwork" (which can be, in fact, same LLM) can build 
    a skill capsules _from a single example_.
    You can't get LoRA or KV-prefix using just one example.

  * It can be inserted at any point, as needed. I.e. if during reasoning you find that you need particular skill, you can insert it.
  * They are composable, and far less likely to over-write some information, as they only affect KV cache and not weights.
Skills as used by Anthropic & OpenAI are just textual instruction. KV-based skill capsule can be a lot more compact (and thus would contribute less to context rot) and might encode information which is difficult to convey through instruction (e.g. style).


It is not a decoration. Karpathy juxtaposes ChatGPT (which feels like a "better google" to most people) to Claude Code, which, apparently, feels different to him. It's a comparison between the two.

You might find this statement non-informative, but without two parts there's no comparison. That's really the semantics of the statement which Karpathy is trying to express.

ChatGPT-ish "it's not just" is annoying because the first part is usually a strawman, something reader considers trite. But it's not the case here.


Indeed, I was probably grumpy at the time I wrote the comment. I do find some truth in it still.

You're right ! The strawman theory is based.

But I think there's more to it, I find dislikable the structure of these sentences (which I find a bit sensationnalist for nothing, I don't know, maybe I am still grumpy).


Well, language is a subject to 'fashion' one-upmanship game: people want to demonstrate their sophistication, often by copying some "cool" patterns, but then over-used patterns become "uncool" cliche.

So it might be just a natural reaction to over-use of a particular pattern. This kind of stuff have been driving language evolution for millennia. Besides that, pompous style is often used in 'copy' (slogans and ads) which is something most people don't like.


They are comparing 1B Gemma to 1+1B T5Gemma 2. Obviously a model with twice more parameters can do more better. Says absolutely nothing about benefits of the architecture.


Since the encoder weights only get used for the prefixed context and then the decoder weights take over for generation, the compute requirements should be roughly the same as for the decoder-only model. Obviously an architecture that can make use of twice the parameters in the same time is better. They should've put some throughput measurements in the paper, though...


You may not have seen this part: "Tied embeddings: We now tie the embeddings between the encoder and decoder. This significantly reduces the overall parameter count, allowing us to pack more active capabilities into the same memory footprint — crucial for our new compact 270M-270M model."


I have seen this part. In fact I checked the paper itself where they provide more detailed numbers: it's still almost a double of the base Gemma, reuse of embeddings and attention doesn't make that much difference as most weights are in MLP s


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: