Hacker Newsnew | past | comments | ask | show | jobs | submit | gwern's commentslogin


There's also severe selection effects: what documents have been preserved, printed, and scanned because they turned out to be on the right track towards relativity?

This.

Especially for London there is a huge chunk of recorded parliament debates.

More interesting for dialoge seems training on recorded correspondence in form of letters anyway.

And that corpus script just looks odd to say the least, just oversample by X?


Oh! I honestly didn't think about that, but that's a very good point!

I think my take away is that you are seeing mostly mode-collapse here. There is a high consistency across all of the supposedly different personalities (higher than the naive count would indicate - remember the stochastic nature of responses will inflate the number of 'different' responses, since OP doesn't say anything about sampling a large number of times to get the true response).

You are right about mode-collapse -- and that observation is exactly what makes this interesting.

In my other comment here, I described The Sims' zodiac from 1997: Will Wright computed signs from personality via Euclidean distance to archetypal vectors, displayed them cosmetically, and wrote zero behavioral code. The zodiac affected nothing. Yet testers reported bugs: "The zodiac influence is too strong! Tune it down!"

Your "mode-collapse with stochastic noise" is the same phenomenon measured from the other direction. In The Sims: zero computed difference, perceived personality. In this LLM experiment: minimal computed difference, perceived personality. Same gap.

Will called it the Simulator Effect: players imagine more than you simulate. I would argue mode-collapse IS the Simulator Effect measured from the output side.

But here is where it becomes actionable: one voice is the wrong number of voices.

ChatGPT gives you the statistical center -- mode-collapse to the bland mean. The single answer that offends no one and inspires no one. You can not fix this with better prompting because it is the inevitable result of single-agent inference.

Timothy Leary built MIND MIRROR in 1985 -- psychology software visualizing personality as a circumplex, based on his 1950 PhD dissertation on the Interpersonal Circumplex. The Sims inherited this (neat, outgoing, active, playful, nice). But a personality profile is not an answer. It is a lens.

The wild part: in 1970, Leary took his own test during prison intake, gamed it to get minimum security classification (outdoor work detail), and escaped by climbing a telephone wire over the fence. The system's own tools became instruments of liberation.

https://github.com/SimHacker/moollm/tree/main/skills/mind-mi...

MOOLLM's response: simulate an adversarial committee within the same call. Multiple personas with opposing propensities -- a paranoid realist, an idealist, an evidence prosecutor -- debating via Robert's Rules. Stories that survive cross-examination are more robust than the statistical center.

https://github.com/SimHacker/moollm/tree/main/skills/adversa...

I wrote this up with links into the project:

https://github.com/SimHacker/moollm/blob/main/designs/sims-a...

The bigger project is MOOLLM -- treating the LLM as eval() for a microworld OS. K-lines, prototype-based instantiation, many-voiced deliberation. The question I keep wrestling with: mode-collapse as limitation vs feature. The Sims exploited it. MOOLLM routes around it.

Would value your take on the information-theoretic framing -- particularly whether multi-agent simulation actually increases effective entropy or just redistributes it.

https://github.com/SimHacker/moollm

The MOOLLM Eval Incarnate Framework: Skills are programs. The LLM is eval(). Empathy is the interface. Code. Graphics. Data. One interpreter. Many languages. The Axis of Eval.

https://github.com/SimHacker/moollm/blob/main/designs/MOOLLM...


Yes, and PKMs in general. Like labeling your emails by topic in Gmail. The problem is that the 'toil' keeps piling up, while the value gained is increasingly hard to see.

I have a little rant about it - "‘Tools for thought’ winds up being a lie: there’s tools, but not much additional thought." https://gwern.net/blog/2024/tools-for-thought-failure https://www.lesswrong.com/posts/CoqFpaorNHsWxRzvz/what-comes...

(My answer, of course, is that almost all of this scutwork is well within the capabilities of a frontier LLM today. We just need to apply them.)


Have you seen any good open source projects using llms to do the scutwork for this kind of PKMs?

No, but I haven't been following the space. (I suspect that with Claude Code-level coding agents, you should be able to do something amazing that thoroughly obsoletes Obsidian/Roam/org-mode, but I don't actually know of anything.)

I've been focused on creative writing, with poetry as my test case, to see what the bottlenecks are to truly amplifying myself through LLMs (as opposed to helping my boss automate away my job or spamming the Internet more efficiently).

I find that frontier LLMs are now there and now I can prompt for genuinely good poetry with LLMs. See https://hollisrobbinsanecdotal.substack.com/p/llm-poetry-and... / https://gwern.net/fiction/lab-animals and https://gwern.net/blog/2025/better-llm-writing

So maybe this year I can turn some attention back to PKMs and Quantified Self stuff...


I haven't tried using agents to make a full editor, but Claude Code and Gemini CLI are actually quite good at writing Obsidian plugins, or modifying existing ones. You can start with an existing one that's 90% of what you want (which tends to be the case with note-taking/PKM systems: people are so idiosyncratic that solutions built by others almost work, but not quite) and tweak it to be exactly right for you.

My own Obsidian setup has improved quite a bit in the last couple months because I can just ask Claude to change one or two things about plugins I got from the store.


Writing or tweaking plugins is great, but it's not a paradigm shift (and risks a lot more toil because now you have to be your own PM or deal with patches/merges, on top of being a reference librarian and copyeditor etc). I feel like if you have a quasi-superintelligence in a box which can run your PKM for you, and you were designing from the ground up with this in mind, that Claude Code is only going to et much better & cheaper, you would not be settling for 'write or modify an Obsidian plugin'. You would get something much different. But 'write a plugin' is basically at 'horseless carriage' level for me.

What I have in mind is something far more radical. There's an idea I am calling 'log-only writing' where you stop editing or rearranging your notes at all, and you switch to pure note taking and stream of conscious braindumping, and you simply have the LLM 'compile' your entire history down into whatever specific artifact you need on demand - whether that's a web of flashcards or a blog post or a long essay or whatever. See https://gwern.net/blog/2024/rss + https://gwern.net/nenex , combined with the LLM reasoning and brainstorming 'offline' using the prompts illustrated by my poems.


That's fair, I guess when I hear "radical overhaul" when discussing PKMs I immediately start worrying about the overload and burnout that doomed my first attempts at Obsidian (see my sibling comment), whereas right now I have a system that works very well for me, especially now that I can just ask Claude to scan the whole directory if I want to ask it questions. But if you do come up with some new blue-sky vision for PKMs, I'd love to at least take a look.

This is the way. If you symlink the .claude directory (so Obsidian can see the files) then you can also super easily add and manage claude skills.

I've spent 20 years living in the terminal, but with claude code I'm more and more drafting markdown specs, organizing context, building custom views / plugins / etc. Obsidian is a great substrate for developing personal software.


The conclusion here seems largely unjustified by the data and indeed is difficult to relate to simple distributions or statistics:

> Increasingly, public institutions seem to exist to manage the obsessions of a tiny number of neurotic—and possibly malicious—complainers.

Why would anyone complain about airport noise when it is ~100% guaranteed to do them no good, and almost all the benefits go to everyone else even if it somehow did anything? Just thinking like an economist here... (Indeed, if a large fraction of locals did complain about something like airport noise, that would itself be highly suspicious to me - as it would indicate an organized campaign or an issue which has become politicized in some way and is now a pretext for something else entirely like a culture war.)

And if there is something I've learned about design and problems, it's that you can have a huge problem, and you are lucky if even 1% will ever tell you.

Your website could be down, and if even 1 person takes the risk of going out of their way to tell you, you should thank your lucky stars that you have such proactive, public-spirited readers!

See also: "Theory of the Nudnik: The Future of Consumer Activism and What We Can Do to Stop It" https://gwern.net/doc/economics/2020-arbel.pdf , Arbel & Shapira 2020; https://pointersgonewild.com/2019/11/02/they-might-never-tel... (commentary: https://gwern.net/ref/chevalier-boisvert-2019); https://en.wikipedia.org/wiki/1%25_rule


> I thought people recognized that they don't appear out of nowhere.

I don't think that paper is widely accepted. Have you seen the authors of that paper, or anyone else, use it to successfully predict (rather than postdict) anything?


I haven't paid attention and the paper seems to be arguing against the existence of the phenomenon of emergence behavior and is not related to predicting what is possible with greater scale.

> is not related to predicting what is possible with greater scale.

If they can't predict new emergence, then 'explaining' old emergence by post hoc prediction with bizarre newly-invented metrics would seem to be irrelevant and just epicycles. You can always bend a line as you wish in curve-fitting by adding some parameters.


> So he's spent $51k on tokens in 3 months to build what exactly? Tools to enable you to spend more money on tokens?

Sounds like that's more than a junior or an intern that would have cost twice as much in fully loaded cost.


He also put his own time into it.

It's also more than hiring someone overseas esp just for a few months. Honestly it's more than most interns are paid for 3 months outside FAANG (considering housing is paid for there etc)


1. You put a lot of time into an intern or a junior too. 2. I didn't say 'paid', I said, 'fully loaded [total] cost'. The total cost of them goes far beyond their mere salary - the search process like all of the interviews for all candidates, onboarding, HR, taxes etc.


1. Idk, I didn't have to. I managed an intern a few months back and he just did everything we had planned out and written down then started making his own additions on top.

2. Yeah I mentioned that also.

3. It's still more expensive than hiring a contractor esp abroad, even all in.


Or the cherries could be a delicious pastry or PBJ-like treat: _Collect Horse Buttery Stable_...


Could you explain why you think that? I'm looking at the lottery ticket section and it seems like he doesn't disown it; the reason he gives, via Abhinav, for not pursuing it at his commercial job is just that that kind of sparsity is not hardware friendly (except with Cerebras). "It doesn't provide a speedup for normal commercial workloads on normal commercial GPUs and that's why I'm not following it up at my commercial job and don't want to talk about it" seems pretty far from "disowning the lottery ticket hypothesis [as wrong or false]".


I think that was pretty clear even when this paper came out - even if you could find these sub networks they wouldn’t be faster on real hardware. Never thought much of this paper, but it sure did get a lot of people excited.


It was exciting because of what it means regarding how a model learns, regardless on whether or not its commercially applicable.


(Cerebras is real hardware.)


It is real in that it exists. It is not real in the sense that almost nobody has access to them. Unless you work at one of the handful of organizations with their hardware, it’s not a practical reality.


how long will that be the case?


They have a strange business model. Their chips are massive. So they necessarily only sell them to large customers. Also because of the way they’re built (entire wafer is a single chip) no two chips will be the same. Normally imperfections in the manufacturing result in some parts of the wafer being rejected and other binned as fast or slow chips. If you use the whole wafer you get what you get. So it’s necessarily a strange platform to work with - every device is slightly different.


At least for the foreseeable future (next 50 years say).


i saw how it nerdsniped an extremely capable faculty member


he pretty much always says it offline haha but i maay have mixed it up with the subsequent convo we had at neurips https://www.latent.space/p/neurips-2023-startups


> which while describes a curve that shows the same behavior, it's more "accurate" of the actual post creation.

I would say that this graph looks a lot more extreme, actually!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: