More

jason_oster · 2026-01-14T23:46:27 1768434387

> You could make up some obscure DSL whole cloth, that has therefore never been in the training data, feed it the docs and it will happily use it to create output in said DSL.

I have two stories, which I will attempt to tie together coherently in response.

I'm making a compiler right now. ChatGPT 4 was helpful in the early phases. Even back then, its capabilities with reading and modifying the grammar and writing boilerplate for a parser was a real surprise. Today 5.2-Codex is iterating on the implementation and specification as I extend the language and fill in gaps in the compiler.

Don't get me wrong, it isn't a "10x" productivity gain. Not even close. And the model makes decisions that I would not. I spent the last few days completely rewriting the module system that it spit out in an hour. Yeah, it worked, but it's not what I wanted. The downsides are circumstantial.

25 years ago, I was involved in a group whose shared hobby was "ROM hacking". In other words, unofficial modification of classic NES and SNES games. There was a running joke in our group that went something like this: Someone would join IRC and ask for an outlandish feature in some level editor that seemed hopelessly impossible at the time. Like generating a new level with new graphics.

We would extrapolate the request to adding a button labeled "Do My Hack For Me". Good times! Now this feature request seems within reach. It may forever be a pipe dream, who knows. But it went from "unequivocally impossible" to "ya know, with the right foundation and guidance, that might just be crazy enough to work!" Almost entirely all within the last 10 years.

I think the novelty or creativity criticism of AI is missing the point. Using these tools in novel or creative ways is where I would put my money in the coming decade. It is mind boggling that today's models can even appear to make sense of my completely made up language and compiler. But the job postings for adding those "Do My Hack For Me" buttons are the ones to watch for.

jason_oster · 2026-01-14T09:08:56 1768381736

https://www.everythingisaremix.info/watch-the-series/

jason_oster · 2026-01-14T05:49:23 1768369763

Quick napkin math time!

Steam reached a new peak of 42 million concurrent players today [1]. An average/mid-tier gaming PC uses 0.2 kWh per hour [2]. 42 million * 0.2 gives 8,400,000 kWh per hour, or 8,400 MWh per hour.

By contrast, training GPT3 was estimated to have used 1,300 MWh of energy [3].

This does not account for training costs of newer models, nor inference costs. But we know inference costs are extraordinarily inexpensive and energy efficient [2]. The lowest estimate of energy cost for 1 hour of Steam's peak concurrent player count uses 6.5x more energy than all of the energy that went into training GPT3.

[1]: https://www.gamespot.com/articles/steam-has-already-set-a-ne...

[2]: https://jamescunliffe.co.uk/is-gen-ai-bad-for-the-environmen...

[3]: https://www.theverge.com/24066646/ai-electricity-energy-watt...

y0eswddl · 2026-01-14T14:46:32 1768401992

it's very weird to compare LLM training with a subset of gamers.

Who lied to you and told you this was some kind of saving gotcha??

jason_oster · 2026-01-14T18:13:46 1768414426

Come again?

I was skeptical of the LLM energy use claim. I went looking for numbers on energy usage in a domain that most people do not worry about or actively perceive as a net negative. Gaming is a very big industry ($197 billion in 2025 [1], compare to the $252 billion in private AI investment for 2025 [2]) and mostly runs on the same hardware as LLMs. So it's a good gut check.

I have not seen evidence that LLM energy usage is out of control. It appears to be much less than gaming. But please feel free to provide sources that demonstrate this lie.

The question is whether claims of AI energy use have sustenance, or if there are other industries that should be more concerning. People are either truly concerned about the cost of energy or it's a misplaced excuse to reinforce their negative opinions.

[1]: https://gameworldobserver.com/2025/12/23/the-gaming-industry...

[2]: https://hai.stanford.edu/ai-index/2025-ai-index-report/econo...

xyzsparetimexyz · 2026-01-14T19:40:58 1768419658

I'd rather people play games, even extremely mediocre ones, than generate ai slop images or code.

jason_oster · 2026-01-14T20:58:54 1768424334

I don't have a preference. Both are valuable in their own way.

jason_oster · 2026-01-14T05:20:20 1768368020

Honestly? LLMs are currently above average at programming.

We've all been through The Daily WTF at least once. That's representative of the average. (Although some examples are more egregious than others.)

ben_w · 2026-01-15T09:24:06 1768469046

> The Daily WTF at least once. That's representative of the average

I'd say The Daily WTF are the more spectacularly weird and wrong, not representative; I've seen a few things deserving to be on their site in real life*, but the average I've seen has been much better than that.

It's difficult to be sure, but I think these models are roughly like someone with 1-3 years of experience, though the worst codebase(s) I've seen have been from person(s) who somehow lasted a decade or two in the industry.

* 1000 lines inside an always-true if-block, a pantheon of god classes that poorly re-invent the concept of named properties, copy-pasting class files rather than subclassing even after I'd added comments about needing to de-duplicate things in them, and that was all the same project.

jason_oster · 2026-01-14T04:56:57 1768366617

True, but nuanced. The model does not react to "you are an experienced programmer" kinds of prompts. It reacts to being given relevant information that needs to be reflected in the output.

See examples in https://arxiv.org/abs/2305.14688; They certainly do say things like "You are a physicist specialized in atomic structure ...", but the important point is that the rest of the "expert persona" prompt _calls attention to key details_ that improves the response. The hint about electromagnetic forces in the expert persona prompt is what tipped off the model to mention it in the output.

Bringing attention to key details is what makes this work. A great tip for anyone who wants to micromanage code with an LLM is to include precise details about what they wish to micromanage: say "store it in a hash map keyed by unsigned integers" instead of letting the model decide which data structure to use.

jason_oster · 2026-01-14T04:37:24 1768365444

You should try a similar task with a recent model (Opus 4.5, GPT 5.2, etc) and see if there are any improvements since your last attempt. I also encourage using a coding agent. It can use test files that you provide, and perform compile-test loops to work out any mistakes.

It sounds like you used an older model, and perhaps copy-pasted code from a chat session. (Just guessing, based on what you described.)

jason_oster · 2026-01-14T04:19:49 1768364389

You have the upper hand with familiarity of the code base. Any "domain expert" also necessarily has a head start knowing which parts of a bespoke complex system need adjustment when making changes.

On the other hand, a highly skilled worker who just joined the team won't have any of that tribal knowledge. There is a significant lag time getting ramped up, no matter how intelligent they are due to sheer scale (and complexity doesn't help).

A general purpose model is more like the latter than the former. It would be interesting to compare how a model fine tuned on the specific shape of your code base and problem domain performs.

jason_oster · 2026-01-13T00:03:11 1768262591

Can't, or you're afraid to?

camgunz · 2026-01-13T09:30:11 1768296611

If you're not afraid of pushing back against an entire industry you don't have a full appreciation of the risks.

Aside: I love your website! Cool games :)

jason_oster · 2026-01-14T08:05:15 1768377915

Thanks!

FWIW, I left my full time job some years ago to do my own thing, in part because pushing back on bad decisions was not really doing me any favors for my mental health. Glad to report I'm in a much better place after finding the courage to get out of that abusive relationship.

Some might argue the risk of not pushing back is far worse.

camgunz · 2026-01-14T10:14:13 1768385653

I was a contractor/consultant between 2020-2023; I have a problem w/ authority so it suited me. But work/life balance was awful--I have 2 kids now, and I can't do nothing for 6 weeks then work 100 hour weeks for 4 weeks. The maximum instability my life will tolerate is putting the kids to bed at 9 instead of 8:30 lol. I'm also in the Netherlands so there's also other benefits. Worker protections are very strong here, so it's highly unlikely I'll be fired or laid off; I can't be asked to work overtime; I can't be Slack'd after hours; I can drop down to 4 days a week no questions asked, when the kids were born I got a ton of paid leave, etc. Not to imply I work at some awful salt mine; I like my current gig and coworkers/leadership.

Anyway, this is a collective action problem. I don't take any responsibility for the huge plastic island in the Pacific, nor do I take any responsibility for the grift economy built on successive, increasingly absurd hype waves of tech (web 2.0, mobile, SPAs, big data, blockchain, VR, AI). I've also worked in social good, from Democratic presidential campaigns and recounts to helping connect people w/ pro bono legal services, which is to say I've done my time. There are too many problems for me to address, I get to pick which, if any, I battle, I am happy if my kids don't meltdown too much during the evening. Maybe when they're both in school I can take more risks or reformulate my work/life balance, but currently I'm focused on furthering the human race.

jason_oster · 2026-01-10T00:03:14 1768003394

Same here. Using Codex with GPT-5.2 and it has not once tried to make any git commits. I've only used it about 100 times over the last few months, though.

jason_oster · 2026-01-07T00:16:59 1767745019

That claim relies on an absurd "in the kernel" qualifier, making it difficult to agree with. Furthermore, your hypothesis is that "we all" agree with claims that rely on absurd conditions as a matter of course.