I don't think everything is for certain though. I think it's 50/50 on whether Anthropic/whoever figures out how to turn them into more than a boilerplate generator.
The imprecision of LLMs is real, and a serious problem. And I think a lot of the engineering improvements (little s-curve gains or whatever) have caused more and more of these. Every step or improvement has some randomness/lossiness attached to it.
Context too small?:
- No worries, we'll compact (information loss)
- No problem, we'll fire off a bunch of agents each with their own little context window and small task to combat this. (You're trusting the coordinator to do this perfectly, and cutting the sub-agent off from the whole picture)
All of this is causing bugs/issues?:
- No worries, we'll have a review agent scan over the changes (They have the same issues though, not the full context, etc.)
Right now I think it's a fair opinion to say LLMs are poison and I don't want them to touch my codebase because they produce more output I can handle, and the mistakes they make are too subtle that I can't reliably catch them.
It's also fair to say that you don't care, and your work allows enough bugs/imprecision that you accept the risks. I do think there's a bit of an experience divide here, where people more experienced have been down the path of a codebase degrading until it's just too much to salvage – so I think that's part of why you see so much pushback. Others have worked in different environments, or projects of smaller scales where they haven't been bit by that before. But it's very easy to get to that place with SOTA LLMs today.
There's also the whole cost component to this. I think I disagree with the author about the value provided today. If costs were 5x what they are now, I think it would be a hard decision for me to decide if they are worth it. For prototypes, yes. But for serious work, where I need things to work right and be reasonably bug free, I don't know if the value works out.
I think everyone is right that we don't have the right architecture, and we're trying to fix layers of slop/imprecision by slapping on more layers of slop. Some of these issues/limitations seem fundamental and I don't know if little gains are going to change things much, but I'm really not sure and don't think I trust anyone working on the problem enough to tell me what the answer is. I guess we'll see in the next 6-12 months.
> I do think there's a bit of an experience divide here, where people more experienced have been down the path of a codebase degrading until it's just too much to salvage – so I think that's part of why you see so much pushback.
When I look back over my career to date there are so many examples of nightmare degraded codebases that I would love to have hit with a bunch of coding agents.
I remember the pain of upgrading a poorly-tested codebase from Python 2 to Python 3 - months of work that only happened because one brave engineer pulled a skunkworks project on it.
One of my favorite things about working with coding agents is that my tolerance for poorly tested, badly structured code has gone way down. I used to have to take on technical debt because I couldn't schedule the time to pay it down. Now I can use agents to eliminate that almost as soon as I spot it.
I've used Claude Code to do the same (large refactor). It has worked fairly well but it tends to introduce really subtle changes in behaviour (almost always negative) which are very difficult to identify. Even worse if you use it to fix those issues it can get stuck in a loop of constantly reintroducing issues which are slightly different leading to fixing things over and over again.
Overall I like using it still but I can also see my mental model of the codebase has significantly degraded which means I am no longer as effective in stopping it from doing silly things. That in itself is a serious problem I think.
Yes, if you don't stay on top of things and rule with an iron fist, you will take on tons of hidden tech debt using even Opus 4.5. But if you manage to review carefully and intercede often, it absolutely is an insane multiplier, especially in unfamiliar domains.
My experience with them is limited, but I’m having issues with the LLMs ignoring the skills content. I guess it makes sense, it’s like any other piece of context.
But it’s put a damper in my dream of constraining them with well crafted skills, and producing high quality output.
Yeah, I'm still trying to figure out why some skills are used every day, while others are constantly ignored. I suspect partially overlapping skill areas might confuse it.
I've added a UserPromptSubmit hook that does basic regex matches on requests, and tries to interject a tool suggestion, but it's still not foolproof.
Yeah, the context window is a blunt instrument, everything competes for attention. I get better luck with shorter, more opinionated skills that front-load the key constraints vs. comprehensive docs that get diluted. Also explicitly invoking them (use the X skill) seems to help vs hoping they get picked up automatically
Yes, unfortunately the most reliable way is to inject them into the user prompt at a fresh session. My guess is that biasing towards checking for the tools availability too much affects performance, which might explain why it is quite rarer that I see it just choose to use a skill without previous prompting.
Yes, this is part of what is supposed to justify the premium prices, is that they can have a different business model.
But it seems Tim Cook can’t leave anything on the table. I’m really going to be irritated if we end up with a premium Siri. It’s going to undermine the privacy aspect, the hardware innovation, and everything else they have going for themselves despite missing the boat on AI
I think IE, ActiveX and the like were reused in tons of VB5/6 applications... at least with zillions of
Spanish shareware, such as amateur games, crossword puzzles, home agendas, book databases and the like.
They worked smooth enough, but in a crazy insecure way. Today it's the reverse; Chromium/Blink can do sandboxing but they bundle everything. Video and audio codecs, HTML renderers, a JS engine, a CSS engine, TTF rendering engines, 2D drawing engines, their own window and process managers... half an OS.
If these things can ever actually think and understand a codebase this mindset makes sense, but as of now it's a short-sighted way to work. The quality of the output is usually not great, and in some cases terrible. If you're just blindly accepting code with no review, eventually things are going to implode, and the AI is more limited than you are in understanding why. It's not going to save you in it's current form.
> increasing the level of abstraction developers can work at
Something is lost each step of the abstraction ladder we climb. And the latest rung uses natural language which introduces a lot of imprecision/slop, in a way that prior abstractions did not. And, this new technology providing the new abstraction is non-deterministic on top of that.
There's also the quality issue of the output you do get.
I don't think the analogy of the assembly -> C transition people like to use holds water – there are some similarities but LLMs have a lot of downsides.
LSP also kind of sucks. But the problem is all the big companies want big valuations, so they only chase generic solutions. That's why everything is a VS Code clone, etc..
Hearing people on tech twitter say that LLMs always produce better code than they do by hand was pretty enlightening for me.
LLMs can produce better code for languages and domains I’m not proficient in, at a much faster rate, but damn it’s rare I look at LLM output and don’t spot something I’d do measurably better.
These things are average text generation machines. Yes you can improve the output quality by writing a good prompt that activates the right weights, getting you higher quality output. But if you’re seeing output that is consistently better than what you produce by hand, you’re probably just below average at programming. And yes, it matters sometimes. Look at the number of software bugs we’re all subjected to.
And let’s not forget that code is a liability. Utilizing code that was “cheap” to generate has a cost, which I’m sure will be the subject of much conversation in the near future.
> These things are average text generation machines.
Funny... seems like about half of devs think AI writes good code, and half think it doesn't. When you consider that it is designed to replicate average output, that makes a lot of sense.
So, as insulting as OP's idea is, it would make sense that below-average devs are getting gains by using AI, and above-average devs aren't. In theory, this situation should raise the average output quality, but only if the training corpus isn't poisoned with AI output.
I have an anecdote that doesn't mean much on its own, but supports OP's thesis: there are two former coworkers in my linkedin feed who are heavy AI evangelists, and have drifted over the years from software engineering into senior business development roles at AI startups. Both of them are unquestionably in the top 5 worst coders I have ever worked with in 15 years, one of them having been fired for code quality and testing practices. Their coding ability, transition to less technical roles, and extremely vocal support for the power of vibe coding definitely would align with OP's uncharitable character evaluation.
After a certain experience level though, I think most of us get to the point of knowing what that difference in quality actually matters.
Some seniors love to bikeshed PRs all day because they can do it better but generally that activity has zero actual value. Sometimes it matters, often it doesn't.
Stop with the "I could do this better by hand" and ask "is it worth the extra 4 hours to do this by hand, or is this actually good enough to meet the goals?"
LLM generated code is technical debt. If you are still working on the codebase the next day it will bite you. It might be as simple as an inconvenient interface, a bunch of duplicated functions that could just be imported, but eventually you are going to have to pay it.
All code is technical debt though. We can't spend infinite hours finding the absolute minima of technical debt introduced for a change, so it is just finding the right balance. That balance is highly dependent on a huge amount of factors: how core is the system, what is the system used for, what stage of development is the system, etc.
I spend about half my day working on LLM-generated code and half my day working on non-LLM-generated code, some written by senior devs, some written by juniors.
The LLM-generated code is by far the worst technical debt. And a fair bit of that time is spent debugging subtle issues where it doesn't quite do what was prompted.
Untested undocumented LLM code is technical debt, but if you do specs and tests it's actually the opposite, you can go beyond technical debt and regenerate your code as you like. You just need testing to be so good it guarantees the behavior you care about, and that is easier in our age of AI coding agents.
> but if you do specs and tests it's actually the opposite, you can go beyond technical debt and regenerate your code as you like.
Having to write all the specs and tests just right so you can regenerate the code until you get the desired output just sounds like an expensive version of the infinite monkey theorem, but with LLMs instead of monkeys.
I use LLMs to generate tests as well, but sometimes the tests are also buggy. As any competent dev knows, writing high-quality tests generally takes more time than writing the original code.
A human SWE can use an LLM to refactor and reduce some of the debt just as easily too. I think fundamentally, the possible rate of new code and new technical debt introduced by LLMs is much higher than a human SWE. Left unchecked, a human still needs sleep and more humans can't be added with more compute.
There's an interesting aspect to the LLM debt being taken on though in that I'm sure some are taking it on now in the bet/hopes that further advancements in LLMs will make it more easily addressable in the future before it is a real problem.
So I can tell you don’t use these tools, or at least much, because at the speed of development with them you’ll be knee deep in tech debt in a day, not a month, but as a corollary can have the same agentic coding tools undergo the equivalent of weeks of addressing tech debt the next day. Well, I think this applies to greenfield AI-first oriented projects that work this way from the get go and with few humans in the loop (human to human communication definitely becomes the rate limiting step). But I imagine that’s not the nature of your work.
Yes if I went hard on something greenfield I'm sure I'll be knee deep in tech debt in less than a day.
That being said, given the quality of code these things produce, I just don't see that ever stopping being the case. These things require a lot of supervision and at some point you are spending more time asking for revisions than just writing it yourself.
There's a world of difference between an MPV which, in the right domain, you can get done much faster now, and a finished product.
I think you missed the your parent post's phrase "in the specific areas _I_ work in" ... LLMs are a lot better at crud and boilerplate than novel hardware interfaces and a bunch of other domains.
But why would it take a month to generate significant tech debt in novel domains, it would accrue even faster then right? The main idea I wanted to get across is that iteration speed is much faster so what's "tech debt" in the first pass, can be addressed much faster in future passes, which will happen on the order of days rather than sprints in the older paradigm. Yes the first iterations will have a bunch of issues but if you keep your hands on the controller you can get things to a decent state quickly. I think one of the biggest gaps I see in devs using these tools is what they do after the first pass.
Also, even for novel domains, using tools like deep research and the ability of these tools to straight up search through the internet, including public repos during the planning phase (you should be planning first before implementing right? You're not just opening a window and asking in a few sentences for a vaguely defined final product I hope) is a huge level up.
If there are repos, papers, articles, etc of your novel domain out there, there's a path to a successful research -> plan -> implement -> iterate path out there imo, especially when you get better at giving the tools ways to evaluate their own results, rather than going back and forth yourself for hours telling them "no, this part is wrong, no now this part is wrong, etc etc"
I mean, there's also, "this looks fine but if I actually had written this code I would've naturally spent more time on it which would have led me to anticipate the future of this code just a little bit more and I will only feel that awkwardness when I come back to this code in two weeks, and then we'll do it all over again". It's a spectrum.
And greenfield code is some of the most enjoyable to write, yet apparently we should let robots do the thing we enjoy the most, and reserve the most miserable tasks for humans, since the robots appear to be unable to do this.
I have yet to see an LLM or coding agent that can be prompted with "Please fix subtle bugs" or "Please retire this technical debt as described in issue #6712."
If you're willing to purchase enough tokens, you can prompt and agent to loop and fuzz its way to "retire* this technical debt as described in issue #6712". But then you still need to review it and make sure it's not just doing a "debt-swap", like some kind of metaverse financial swindler. So you're spending money to avoid fixing tech debt, but adding in the need to review "someone else's code". And to take ownership of that code!
*(Of course, depending on the issue, it could be doing anything from surpressing logs so existing tests pass, to making useless-but-passing tests, to brute-forcing special cases, to possibly actually fixing something.)
I think this is an opportunity for that bell curve/enlightenment meme. Of course as you get a little past junior, you often get hung up on the best way of doing things without worrying about best being the enemy of good enough. But truly senior devs know the difference. And those are the ones that by and large still think LLMs are bad at generating code where quality (reliability, sustainability, security, etc) matters. Everyone admits that LLMs are good for low stakes code.
now sometimes that's 4 hours, but I've had plenty of times where I'm "racing" people using LLMs and I basically get the coding done before them. Once I debugged an issue before the robot was done `ls`-ing the codebase!
The shape of the problem is super important in considering the results here
You have the upper hand with familiarity of the code base. Any "domain expert" also necessarily has a head start knowing which parts of a bespoke complex system need adjustment when making changes.
On the other hand, a highly skilled worker who just joined the team won't have any of that tribal knowledge. There is a significant lag time getting ramped up, no matter how intelligent they are due to sheer scale (and complexity doesn't help).
A general purpose model is more like the latter than the former. It would be interesting to compare how a model fine tuned on the specific shape of your code base and problem domain performs.
Yeah totally, for unknown codebases it can help kick you off in the right direction (though it can send you down a totally wrong path as well... projects with good docs tend to be ones where I've found LLMs be worse at their job on this ironically).
But well.. when working with coworkers on known projects it's a different story, right?
My stance is these tools are, of course, useful, but humans can most definitely be faster than the current iteration of these tools in a good number of tasks, and some form of debugging tasks are like that for me. The ones I've tried have been too prone to meandering and trying too many "top results on Google"-style fixes.
But hey maybe I'm just holding it wrong! Just seems like some of my coworkers are too
I worked for a relatively large company (around 400 employees there are programmers). The people who embraced LLM-generated code clearly share one trait: they are feature pushers who love to say "yes!" to management. You see, management is always right, and these programmers are always so eager to put their requirements, however incomplete, into a Copilot session and open a pull request as fast as possible.
The worst case I remember happened a few months ago when a staff (!) engineer gave a presentation about benchmarks they had done between Java and Kotlin concurrency tools and how to write concurrent code. There was a very large and strange difference in performance favoring Kotlin that didn't make sense. When I dug into their code, it was clear everything had been generated by a LLM (lots of comments with emojis, for example) and the Java code was just wrong.
The competent programmers I've seen there use LLMs to generate some shell scripts, small python automations or to explore ideas. Most of the time they are unimpressed by these tools.
I've been playing with vibe coding a lot lately and I think in most cases, the current SOTA LLM's don't produce code that I'd be satisfied with. I kind of feel like LLM's are really really good at hacking on a messy and fragile structure, because they can "keep track many things in their head"
BUT
An LLM can write a PNG decoder that works in whatever language I choose in one or a few shots. I can do that too, but it will take me longer than a minute!
(and I might learn something about the png format that might be useful later..)
Also, us engineers can talk about code quality all day, but does this really matter to non-engineers? Maybe objectively it does, but can we convince them that it does?
> Maybe objectively it does, but can we convince them that it does?
how long would you give our current civilisation if quality of software ceased to be important for:
- medical devices
- aircraft
- railway signalling systems
- engine management systems
- the financial system
- electrical grid
- water treatment
- and every other critical system
Yet it is rare someone ever needs to write a PNG decoder.
In the unlikely event you did, you would be doing something quite special to not be using an off-the-shelf library. Would an LLM be able to do whatever that special thing would be?
It's true that quality doesn't matter for code that doesn't matter. If you're writing code that isn't important, then quality can slip, and it's true an LLM is good candidate for generating that code.
I tried vibe coding a BMP decoder not too long ago with the rationale being “what’s simpler than BMP?”
What I got was an absolute mess that did not work at all. Perhaps this was because, in retrospect, BMP is not actually all that simple, a fact that I discovered when I did write a BMP decoder by hand. But I spent equal time vibe coding and real coding. At the end of the real coding session, I understood BMP, which I see as a benefit unto itself. This is perhaps a bit cynical but my hot take on vibe coders is that they place little value on understanding things.
Mind explaining the process you tried? As someone who’s generally not had any issue getting LLMs to sort out my side projects (ofc with my active involvement as well), I really wonder what people who report these results are trying. Did you just open a chat with claude code and try to get a single context window to one shot it?
Just out of curiousity (as someone fairly familiar with the BMP spec, and also PNG incidentally): what did you find to be the trickiest/most complex aspects?
None of this is fresh in my mind, so my recollections might be a little hazy. I think the only issue I personally had when writing a decoder was keeping the alignment of various fields right. I wrote the decoder in C# and if I remember correctly I tried to get fancy with some modern-ish deserialization code. I think I eventually resorted to writing a rather ugly but simple low-level byte reader. Nevertheless I found it to be a relatively straightforward program to write and I got most of what I wanted done in under a day.
The vibe coded version was a different story. For simplicity, I wanted to stick to an early version of BMP. I don’t remember the version off the top of my head. This was a simplified implementation for students to use and modify in a class setting. Sticking to early version BMPs also made it harder for students to go off-piste since random BMPs found on the internet probably would not work.
The main problem was that the LLM struggled to stick to a specific version of BMP. Some of those newer features (compression, color table, etc, if I recall correctly) have to be used in a coordinated way. The LLM made a real mess here, mixing and matching newer features with older ones. But I did not understand that this was the problem until I gave up and started writing things myself.
You should try a similar task with a recent model (Opus 4.5, GPT 5.2, etc) and see if there are any improvements since your last attempt. I also encourage using a coding agent. It can use test files that you provide, and perform compile-test loops to work out any mistakes.
It sounds like you used an older model, and perhaps copy-pasted code from a chat session. (Just guessing, based on what you described.)
I kinda like Theo's take on it (https://www.youtube.com/watch?v=Z9UxjmNF7b0): there's a sliding scale of how much slop should reasonably be considered acceptable and engineers are well advised to think about it more seriously. I'm less sold on the potential benefits (since some of the examples he's given are things that I would also find easy by hand), but I agree with the general principle that having the option to do things in a super-sloppy way, combined with spending time developing intuition around having that access (and what could be accomplished that way), can produce positive feedback loops.
In short: when you produce the PNG decoder, and are satisfied with it, it's because you don't have a good reason to care about the code quality.
> Maybe objectively it does, but can we convince them that it does?
I strongly doubt it, and that's why articles like TFA project quite a bit of concern for the future. If non-engineers end up accepting results from a low-quality, not-quite-correct system, that's on them. If those results compromise credentials, corrupt databases etc., not so much.
Claude Code is way better than I am at rummaging through Git history, handling merge conflicts, renaming things, writing SQL queries whose syntax I always forget (window functions and the like). But yeah. If I give it a big, non-specific task, it generates a lot of mediocre (or worse) code.
That's funny that's all the things I don't trust it to do. I actually use it the other way around, give it a big non-specific task, see if it works, specify better, retry, throw away 60% - 90% of the generated code, fix bugs in a bunch of places and out comes an implemented feature.
I give the agent the following standing instructions:
"Make the smallest possible change. Do not refactor existing code unless I explicitly ask."
That directive cut down considerably on the amount of extra changes I had to review. When it gets it right, the changes are close to the right size now.
The agent still tries to do too much, typically suggesting three tangents for every interaction.
As someone who: 1.) Has a brain that is not wired to think like a computer and write code. 2.) Falls asleep at the keyboard while writing code for more than an hour or two. 3.) Has a lot of passion for sticking with an idea and making it happen, even if that means writing code and knowing the code is crap.
So, in short, LLMs write better code than I do. I'm not alone.
LLMs are not "average text generation machines" once they have context. LLMs learn a distribution.
The moment you start the prompt with "You are an interactive CLI tool that helps users with software engineering at the level of a veteran expert" you have biased the LLM such that the tokens it produces are from a very non-average part of the distribution it's modeling.
True, but nuanced. The model does not react to "you are an experienced programmer" kinds of prompts. It reacts to being given relevant information that needs to be reflected in the output.
See examples in https://arxiv.org/abs/2305.14688; They certainly do say things like "You are a physicist specialized in atomic structure ...", but the important point is that the rest of the "expert persona" prompt _calls attention to key details_ that improves the response. The hint about electromagnetic forces in the expert persona prompt is what tipped off the model to mention it in the output.
Bringing attention to key details is what makes this work. A great tip for anyone who wants to micromanage code with an LLM is to include precise details about what they wish to micromanage: say "store it in a hash map keyed by unsigned integers" instead of letting the model decide which data structure to use.
It'd be rather surprising if you could train an AI on a bunch of average code, and somehow get code that's always above average. Where did the improvement come from?
We should feed the output code back in to get even better code.
AI generally can improve through reinforcement learning, but this requires it to be able to compare its output to some form of metric. There aren't a lot of people I'd trust to RLHF for code quality, and anything more automated than that is destined to collapse due to Goodhart's Law.
In my view they’re great for rough drafts, iterating on ideas, throwaway code, and pushing into areas I haven’t become proficient in yet. I think in a lot of cases they write ok enough tests.
How are you judging that you'd write "better" code? More maintainable? More efficient? Does it produce bugs in the underlying code it's generating? Genuinely curious where you see the current gaps.
- it adds superfluous logic that is assumed but isn’t necessary
- as a result the code is more complex, verbose, harder to follow
- it doesn’t quite match the domain because it makes a bunch of assumptions that aren’t true in this particular domain
They’re things that can often be missed in a first pass look at the code but end up adding a lot of accidental complexity that bites you later.
When reading an unfamiliar code base we tend to assume that a certain bit of logic is there for a good reason, and that helps you understand what the system is trying to do. With generative codebases we can’t really assume that anymore unless the code has been thoroughly audited/reviewed/rewritten, at which point I find it’s easier to just write the code myself.
This has been my experience as well. But, these are things we developers care about.
Coding aside, LLM's aren't very good at following nice practices in general unless explicitly prompted to. For example if you ask an LLM to create an error modal box from scratch, will it also implement the ability to select the text, or being able to ctrl c to copy the text, or perhaps a copy message button? Maybe this is a bad example, but they usually don't do things like this unless you explicitly ask them to. I don't personally care too much about this, but I think it's noteworthy in the context of lay people using LLM's to vibe code.
> if you’re seeing output that is consistently better than what you produce by hand, you’re probably just below average at programming
even though this statement does not mathematically / statistically make sense - vast majority of SWEs are “below average.” therein lies the crux of this debate. I’ve been coding since the 90’s and:
- LLM output is better than mine from the 90’s
- LLM output is better than mine from early 2000’s
- LLM output is worse than any of mine from 2010 onward
- LLM output (in the right hands) is better than 90% of human-written code I have seen (and I’ve seen a lot)
The most prolific coders are also more competent than average. Their proliferations are what have trained these models. These models are trained on incredibly successful projects written by masters of their fields. This is usually where I find the most pushback is that the most competent SWEs see it as theft and also useless to them since they have already spend years honing skills to work relentlessly and efficiently towards solutions -- sometimes at great expense.
> The most prolific coders are also more competent than average
This is absolutely not true lol, as anyone who's worked with a fabled 10X engineer will tell you. It's like saying the best civil engineer is the one that builds the most bridges.
I've worked with a 10x engineer and indeed they were significantly more competent than the rest of the team in their execution and design. They've seen so many problems and had a chance to discard bad patterns and adopt/try out new ones.
Nah, just better at marketing and coming up with plausible things that make them sound good. You learn to spot that type after a while. There are a rare few people who do get a lot done, but usually it's because they work 80 hour weeks.
I'd assume most of the code visible on the web leans amateur. A huge portion of github repos seem to be from students these days. You'll see GitHub's Education page listing "5 million students" (https://github.com/education), which I assume is an under-estimate, as that's only the formal program.
It's let me apply my general knowledge across domains, and do things in tech stacks or languages I don't know well. But that has also cost me hours debugging a solution I don't quite understand.
When working in my core stack though it's a nice force multiplier for routine changes.
If you are an experienced developer in general your code will in general be better than an LLMs even in languages you are not proficient. The LLM code might be more in keeping with the languages conventions but that does not make the code automatically superior. LLMs typically produce bad code in other dimensions.
If firing up old coal plants and skyrocketing RAM prices and $5000 consumer GPUs and violating millions of developers' copyrights and occasionally coaxing someone into killing themselves is the cost of Brian Who Got Peter Principled Into Middle Management getting to enjoy programming again instead of blaming his kids for why he watches football and drinks all weekend instead of cultivating a hobby, I guess we have no choice but to oblige him his little treat.
Asbestos is a miracle material with a serious and hard to really avoid downside.
It's so good that we are genuinely left with crappy options to replace it, and people have died in fires that could have been saved with the right application of asbestos.
Current AI hype is closer to the Radium craze back during the discovery of radioactivity. Yes it's a neat new thing that will have some interesting uses. No don't put it in everything and especially not in your food what are you doing oh my god!
Cloudflare works fine with public relay - they and Fastly provide infrastructure for that service (one half of the blinded pair) so it’s definitely something they test.
> JIT compilation has the potential to exceed native compiled speeds
The battlecry of Java developers riding their tortoises.
Don’t we have decades of real-world experience showing native code almost always performs better?
For most things it doesn’t matter, but it always rubs me the wrong way when people mention this about JIT since it almost never works that way in the real world (you can look at web framework benchmarks as an easy example)
It's not that surprising to people who are old enough to have lived through the "reality" of "interpreted languages will never be faster than about 2x compiled languages".
The idea that an absurdly dynamic language like JS, where all objects are arbitrary property bags with prototypical dependency chains that are runtime mutable, would execute at a tech budget under 2x raw performance was just a matter of fact impossibility.
Until it wasn't. And the technology reason it ended up happening was research that was done in the 80s.
It's not surprising to me that it hasn't happened yet. This stuff is not easy to engineer and implement. Even the research isn't really there yet. Most of the modern dynamic language JIT ideas which came to the fore in the mid 200X's were directly adapting research work on Self from about two decades prior.
Dynamic runtime optimization isn't too hot in research right now, and it never was to be honest. Most of the language theory folks tend to lean more in the type theory direction.
The industry attention too has shifted away. Browsers were cutting edge a while back and there was a lot of investment in core research tech associated with that, but that's shifting more to the AI space now.
Overall the market value prop and the landscape for it just doesn't quite exist yet. Hard things are hard.
You nailed it -- the tech enabling JS to match native speed was Self research from the 80s, adapted two decades later. Let me fill in some specifics from people whose papers I highly recommend, and who I've asked questions of and had interesting discussions with!
Vanessa Freudenberg [1], Craig Latta [2], Dave Ungar [3], Dan Ingalls, and Alan Kay had some great historical and fresh insights. Vanessa passed recently -- here's a thread where we discussed these exact issues:
Vanessa had this exactly right. I asked her what she thought of using WASM with its new GC support for her SqueakJS [1] Smalltalk VM.
Everyone keeps asking why we don't just target WebAssembly instead of JavaScript. Vanessa's answer -- backed by real systems, not thought experiments -- was: why would you throw away the best dynamic runtime ever built?
To understand why, you need to know where V8 came from -- and it's not where JavaScript came from.
David Ungar and Randall B. Smith created Self [3] in 1986. Self was radical, but the radicalism was in service of simplicity: no classes, just objects with slots. Objects delegate to parent objects -- multiple parents, dynamically added and removed at runtime. That's it.
The Self team -- Ungar, Craig Chambers, Urs Hoelzle, Lars Bak -- invented most of what makes dynamic languages fast: maps (hidden classes), polymorphic inline caches, adaptive optimization, dynamic deoptimization [4], on-stack replacement. Hoelzle's 1992 deoptimization paper blew my mind -- they delivered simplicity AND performance AND debugging.
That team built Strongtalk [5] (high-performance Smalltalk), got acquired by Sun and built HotSpot (why Java got fast), then Lars Bak went to Google and built V8 [6] (why JavaScript got fast). Same playbook: hidden classes, inline caching, tiered compilation. Self's legacy is inside every browser engine.
Brendan Eich claims JavaScript was inspired by Self. This is an exaggeration based on a deep misunderstanding that borders on insult. The whole point of Self was simplicity -- objects with slots, multiple parents, dynamic delegation, everything just another object.
JavaScript took "prototypes" and made them harder than classes: __proto__ vs .prototype (two different things that sound the same), constructor functions you must call with "new" (forget it and "this" binds wrong -- silent corruption), only one constructor per prototype, single inheritance only. And of course == -- type coercion so broken you need a separate === operator to get actual equality. Brendan has a pattern of not understanding equality.
The ES6 "class" syntax was basically an admission that the prototype model was too confusing for anyone to use correctly. They bolted classes back on top -- but it's just syntax sugar over the same broken constructor/prototype mess underneath. Twenty years to arrive back at what Smalltalk had in 1980, except worse.
Self's simplicity was the point. JavaScript's prototype system is more complicated than classes, not less. It's prototype theater. The engines are brilliant -- Self's legacy. The language design fumbled the thing it claimed to borrow.
Vanessa Freudenberg worked for over two decades on live, self-supporting systems [9]. She contributed to Squeak EToys, Scratch, and Lively. She was co-founder of Croquet Corp and principal engineer of the Teatime client/server architecture that makes Croquet's replicated computation work. She brought Alan Kay's vision of computing into browsers and multiplayer worlds.
SqueakJS [7] was her masterpiece -- a bit-compatible Squeak/Smalltalk VM written entirely in JavaScript. Not a port, not a subset -- the real thing, running in your browser, with the image, the debugger, the inspector, live all the way down. It received the Dynamic Languages Symposium Most Notable Paper Award in 2024, ten years after publication [1].
The genius of her approach was the garbage collection integration. It amazed me how she pulled a rabbit out of a hat -- representing Squeak objects as plain JavaScript objects and cooperating with the host GC instead of fighting it. Most VM implementations end up with two garbage collectors in a knife fight over the heap. She made them cooperate through a hybrid scheme that allowed Squeak object enumeration without a dedicated object table. No dueling collectors. Just leverage the machinery you've already paid for.
But it wasn't just technical cleverness -- it was philosophy. She wrote:
"I just love coding and debugging in a dynamic high-level language. The only thing we could potentially gain from WASM is speed, but we would lose a lot in readability, flexibility, and to be honest, fun."
"I'd much rather make the SqueakJS JIT produce code that the JavaScript JIT can optimize well. That would potentially give us more speed than even WASM."
Her guiding principle: do as little as necessary to leverage the enormous engineering achievements in modern JS runtimes [8]. Structure your generated code so the host JIT can optimize it. Don't fight the platform -- ride it.
She was clear-eyed about WASM: yes, it helps for tight inner loops like BitBlt. But for the VM as a whole? You gain some speed and lose readability, flexibility, debuggability, and joy. Bad trade.
This wasn't conservatism. It was confidence.
Vanessa understood that JS-the-engine isn't the enemy -- it's the substrate. Work with it instead of against it, and you can go faster than "native" while keeping the system alive and humane. Keep the debugger working. Keep the image snapshotable. Keep programming joyful. Vanessa knew that, and proved it!
Yeah I've heard this my whole career, and while it sounds great it's been long enough that we'd be able to list some major examples by now.
What are the real world chances that a) one's compiled code benefits strongly from runtime data flow analysis AND b) no one did that analysis at the compilation stage?
Some sort of crazy off label use is the only situation I think qualifies and that's not enough.
Compiled Lua vs LuaJIT is a major example imho, but maybe it's not especially pertinent given the looseness of the Lua language. I do think it demonstrates that the concept that it is possible to have a tighter type-system at runtime than at compile time (that can in turn result in real performant benefits) is a sound concept, however.
The major Javascript engines already have the concept of a type system that applies at runtime. Their JITs will learn the 'shapes' of objects that commonly go through hot-path functions and will JIT against those with appropriate bailout paths to slower dynamic implementations in case a value with an unexpected 'shape' ends up being used instead.
There's a lot of lore you pick up with Javascript when you start getting into serious optimization with it; and one of the first things you learn in that area is to avoid changing the shapes of your objects because it invalidates JIT assumptions and results in your code running slower -- even though it's 100% valid Javascript.
Totally agree on js, but it doesn't have the same easy same-language comparison that you get from compiled Lua vs LuaJIT. Although I suppose you could pre-compile JavaScript to a binary with eg QuickJS but I don't think this is as apples-to-apples comparison as compiled Lua to LuaJIT.
I’ve been thinking about this a lot lately and agree.
Also, it’s amazing how inefficient medium+ businesses are. I think we should see small businesses thriving due to the cost/weight of the bureaucracy they inflict on themselves, but we don’t.
I think healthcare costs/requirements and unfair access to capital keep the inefficient machine chugging along, cutting off routes for smart people to start businesses, innovate, and improve our economy.
This is one of the manifestations of Crony Capitalism. That is, manipulation (i.e., typically regulation) that puts its thumb on the scale and picks winners and losers.
Most people who complain about capitalism are actually complaining about Crony Capitalism. The fact that they don’t understand the difference\+* is what makes CC so “magical.”
** The NFL is not the Premier League, and vice versa. Both play football, yet no one would confuse the two. Capitalism and Crony Capitalism should have the same differentiation and clarity. The reason they do not is not accidental.
Capitalism evolves into Crony Capitalism just as Communism evolves into Dictatorship and Tyranny, Feudalism evolves into Serfdom, etc. I'm just surprised it lasted this long...
I don't think everything is for certain though. I think it's 50/50 on whether Anthropic/whoever figures out how to turn them into more than a boilerplate generator.
The imprecision of LLMs is real, and a serious problem. And I think a lot of the engineering improvements (little s-curve gains or whatever) have caused more and more of these. Every step or improvement has some randomness/lossiness attached to it.
Context too small?:
- No worries, we'll compact (information loss)
- No problem, we'll fire off a bunch of agents each with their own little context window and small task to combat this. (You're trusting the coordinator to do this perfectly, and cutting the sub-agent off from the whole picture)
All of this is causing bugs/issues?:
- No worries, we'll have a review agent scan over the changes (They have the same issues though, not the full context, etc.)
Right now I think it's a fair opinion to say LLMs are poison and I don't want them to touch my codebase because they produce more output I can handle, and the mistakes they make are too subtle that I can't reliably catch them.
It's also fair to say that you don't care, and your work allows enough bugs/imprecision that you accept the risks. I do think there's a bit of an experience divide here, where people more experienced have been down the path of a codebase degrading until it's just too much to salvage – so I think that's part of why you see so much pushback. Others have worked in different environments, or projects of smaller scales where they haven't been bit by that before. But it's very easy to get to that place with SOTA LLMs today.
There's also the whole cost component to this. I think I disagree with the author about the value provided today. If costs were 5x what they are now, I think it would be a hard decision for me to decide if they are worth it. For prototypes, yes. But for serious work, where I need things to work right and be reasonably bug free, I don't know if the value works out.
I think everyone is right that we don't have the right architecture, and we're trying to fix layers of slop/imprecision by slapping on more layers of slop. Some of these issues/limitations seem fundamental and I don't know if little gains are going to change things much, but I'm really not sure and don't think I trust anyone working on the problem enough to tell me what the answer is. I guess we'll see in the next 6-12 months.
reply