More

scellus · 2025-12-29T22:07:07 1767046027

Perfect economic substitution in coding doesn't happen for a long time. Meanwhile, AI appears as an amplifier to the human and vice versa. That the work will change is scary, but the change also opens up possibilities, many of them now hard to imagine.

scellus · 2025-12-29T21:09:05 1767042545

My work is better than it has been for decades. Now I can finally think and experiment instead of wasting my time on coding nitty-gritty detail, impossible to abstract. Last autumn was the game changer, basically Codex and later Opus 4.5; the latter is good with any decent scaffolding.

chasd00 · 2025-12-29T21:22:24 1767043344

I have to admit, LLMs do save a lot of typing a d associated syntax errors. If you know what you want and can spot and fix mistakes made by the LLM then they can be pretty useful. I don’t think it’s wise to use them for development if you are not knowledgeable enough in the domain and language to recognize errors or dead ends in the generated code though.

scellus · 2025-12-21T21:28:55 1766352535

Thermal energy storage is one gotcha. It will eventually leak away, even if the CO2 stays in the container indefinitely, and then you have no energy to extract.

The 75% round-trip efficiency (for shorter time periods) quoted in other threads here is surprisingly high though.

scellus · 2025-12-21T09:44:00 1766310240

It's complicated. Opus 4.5 is actually not that good at the 80% threshold but is above others at 50% threshold of completion. I read there's a single task around 16h that the model completed, and the broad CI comes from that.

METR currently simply runs out of tasks at 10-20h, and as a result you have a small N and lots of uncertainty there. (They fit a logistic to the discrete 0/1 results to get the thresholds you see in the graph.) They need new tasks, then we'll know better.

JohnnyMarcone · 2025-12-21T22:40:08 1766356808

Thanks for this comment. I've been trying to find anything about the huge error bars. Do you have any sources you can share for further reading?

scellus · 2025-12-21T09:39:00 1766309940

The time (horizon) here is not that of the model completing the task, but a human completing the task.

scellus · 2025-12-21T09:35:16 1766309716

I find it odd that the post above is downvoted to grey, feels like some sort of latent war of viewpoints going on, like below some other AI posts. (Although these misvotes are usually fixed when the US wakes up.)

The point above is valid. I'd like to deconstruct the concept of intelligence even more. What humans are able to do is a relatively artificial collection of skills a physical and social organism needs. The so highly valued intelligence around math etc. is a corner case of those abilities.

There's no reason to think that human mathematical intelligence is unique by its structure, an isolated well-defined skill. Artificial systems are likely to be able to do much more, maybe not exactly the same peak ability, but adjacent ones, many of which will be superhuman and augmentative to what humans do. This will likely include "new math" in some sense too.

omnimus · 2025-12-21T10:23:54 1766312634

What everybody is looking for is imagination and invention. Current AI systems can give best guess statistical answer from dataset the've been fed. It is always compression.

The problem and what most people intuitively understand is that this compression is not enough. There is something more going on because people can come up with novel ideas/solutions and whats more important they can judge and figure out if the solution will work. So even if the core of the idea is “compressed” or “mixed” from past knowledge there is some other process going on that leads to the important part of invention-progress.

That is why people hate the term AI because it is just partial capability of “inteligence” or it might even be complete illusion of inteligence that is nowhere close what people would expect.

in-silico · 2025-12-21T13:52:10 1766325130

> Current AI systems can give best guess statistical answer from dataset the've been fed.

What about reinforcement learning? RL models don't train on an existing dataset, they try their own solutions and learn from feedback.

RL models can definitely "invent" new things. Here's an example where they design novel molecules that bind with a protein: https://academic.oup.com/bioinformatics/article/39/4/btad157...

omnimus · 2025-12-21T14:12:08 1766326328

Finding variations in constrained haystack with measurable defined results is what machine learning has always been good at. Tracing most efficient Trackmania route is impressive and the resulting route might be original as in human would never come up with it. But is it actually novel in creative, critical way? Isn't it simply computational brute force? How big that force would have to be in physical or less constrained world?

fl7305 · 2025-12-21T23:44:19 1766360659

> Current AI systems can give best guess statistical answer from dataset the've been fed.

Counterpoint: ChatGPT came up with the new idiom "The confetti has left the cannon"

scellus · 2025-12-19T12:59:58 1766149198

I like Opus 4.5 a lot, but a general comment on benchmarks: the number of subtasks or problems in each one is finite, and many of the benchmarks are saturating, so the effective number of problems at the frontier is even smaller. If you think of the generalizable capability of the model as a latent feature to be measured by benchmarks, we therefore have only rather noisy estimates. People read too much into small differences in numbers. It's best to aggregate across many, Epoch has their Capabilities Index, and Artificial Analysis is doing something similar, and probably others I don't know or remember.

And then there's the part of models that is hard to measure. Opus has some sort of HAL-like smoothness I don't see in other models, but meanwhile, I haven't tried gpt-5.2 for coding yet. (Neither Gemini 3 Pro; I'm not claiming superiority of Opus, just that something in practical usability is hard to measure.)

scellus · 2025-12-14T07:59:21 1765699161

Opus 4.5 seems to be able to plan without asking, but I have used this pattern of "write a plan to an .md", review and maybe edit, and then execution, maybe in another thread,... I have used it with Codex and it works well.

Profilerating .md files need some attention though.

scellus · 2025-12-14T07:56:21 1765698981

In other words, permanent instructions and context well presented in *.md, planning and review before execution, agentic loops with feedback, and a good model.

You can do this with any agentic harness, just plain prompting and "LLM management skills". I don't have Claude Code at work, but all this applies to Codex and GH Copilot agents as well.

And agreed, Opus 4.5 is next level.

scellus · 2025-10-27T16:25:17 1761582317

No. In general the statistics look a bit amateurish, which is normal for a scientific paper. I'd actually like reanalyze the data, just out of curiosity. (Those p-values and other things can still be on the right ballpark even if the models and analyses are not top notch. I'm not exactly doubting them, and the results are interesting even without any correlation to UAP sightings or nukes.)