Hacker Newsnew | past | comments | ask | show | jobs | submit | theptip's commentslogin

Hassabis put forth a nice taxonomy of innovation: interpolation, extrapolation, and paradigm shifts.

AI is currently great at interpolation, and in some fields (like biology) there seems to be low-hanging fruit for this kind of connect-the-dots exercise. A human would still be considered smart for connecting these dots IMO.

AI clearly struggles with extrapolation, at least if the new datum is fully outside the training set.

And we will have AGI (if not ASI) if/when AI systems can reliably form new paradigms. It’s a high bar.


This is a sound tactical move to provide a hedge against future US economic threats.

Strategically, I do think you want to be coming up with a plan to shield core industries like auto, shipping, energy, and some parts of manufacturing (eg “factories for factories” rather than “factories for consumer goods”) from dumping / state subsidies.

It might be OK to let the PRC subsidize your solar cells, assuming you can build wind instead if they try to squeeze you. It’s probably not wise to depend on PRC for your batteries, drones, and cars, where these are key to strategic autonomy and you don’t have an alternative.


The world already relies on them for drones and batteries, often cars.

Yes, of course, that is what I’m saying you need to change.

You change that by developing a competitive industry. You can use China to help with that, and don't blame them, subsidizing everything is impossible, that's not the reason for their success. Instead, try to explain your own failure - like high barriers of entry, meager competition, etc.

It’s a good idea to have Claude write down the execution plan (including todos). Or you can use something like Linear / GH Issues to track the big items. Then small/tactical todos are what you track in session todos.

This approach means you can just kill the session and restart if you hit limits.

(If you hit context limits you probably also want to look into sub-agents to help prevent context bloat. For example any time you are running and debugging unit tests, it’s usually best to start with a subagent to handle the easy errors. )


Did you eval using screenshots or some sort of rendered visualization instead of the CLI? I wonder if Claude has better visual intelligence when viewing images (lots of these in its training set) rather than ascii schematics (probably very few of these in the corpus).

Computer use and screenshots are context intensive. Text is not. The more context you give to an LLM, the dumber it gets. Some people think at 40% context utilization, the LLM starts to get into the dumb zone. That is where the limitations are as of today. This is why CLI based tools like Claude Code are so good. And any attempt at computer use has fallen by the wayside.

There are some potential solutions to this problem that come to mind. Use subagents to isolate the interesting bits about a screenshot and only feed that to the main agent with a summary. This will all still have a significantly higher token usage compared to a text based interface, but something like this could potentially keep the LLM out of the dumb zone a little longer.


> And any attempt at computer use has fallen by the wayside.

You're totally right! I mean, aside from Anthropic launching "Cowork: Claude Code for the rest of your work" 5 days ago. :)

https://claude.com/blog/cowork-research-preview

https://news.ycombinator.com/item?id=46593022

More to the point though, you should be using Agents in Claude Code to limit context pollution. Agents run with their own context, and then only return salient details. Eg, I have an Agent to run "make" and return the return status and just the first error message if there is one. This means the hundreds/thousands of lines of compilation don't pollute the main Claude Code context, letting me get more builds in before I run out of context there.


I had tried the browser screenshotting feature for agents in Cursor and found it wasn't very reliable - screenshots eat a lot of context, and the agent didn't have a good sense for when to use them. I didn't try it in this project. I bet it would work in some specific cases.

I'm not sure if this proves anything, but i saw this article of Opus playign pokemon, and here they were given actual screenshots, and it still says it navigated visual space pretty poorly despite the advancements https://www.lesswrong.com/posts/u6Lacc7wx4yYkBQ3r/insights-i...

Claude helped me immensely getting an image converter to work. Giving it screenshots of wrong output (lots of layers had an unpredictable offsets that was not supposed to be there) and output as I expected it helped Claude understand the problems and it fixed the bugs immediately.

Claude Code has had this feature for a few months now.

> Why does dev-owned testing look so compelling in theory, yet fall short in almost every real-world implementation

Seems like a weird assertion. Plenty of startups do “dev owned testing” ie not hiring dedicated QA. Lots of big-tech does too. Indeed I’d venture it’s by far the most common approach on a market valuation-weighted basis.


It's very common and most software is extremely buggy, so that adds up.

I think most consumer software just would not stand the scrutiny of enterprise or professional applications. It just does not work a lot of the time, which is truly insane to think about.

I mean, imagine if your compiler just 10% of the time did not work and output a wrong program. But when I log into website XYZ, that's about the rate of failure.


Counterpoint - I think it’s going to become much easier for hobbyists and motivated small companies to make bigger projects. I expect to see more OSS, more competition, and eventually better quality-per-price (probably even better absolute quality at the “$0 / sell your data” tier).

Sure, the megacorps may start rotting from the inside out, but we already see a retrenchment to smaller private communities, and if more of the benefits of the big platforms trickle down, why wouldn’t that continue?

Nicbou, do you see AI as increasing your personal output? If it lets enthusiastic individuals get more leverage on good causes then I still have hope.


When it became cheaper to publish text did the quality go up?

When it became cheaper to make games did the quality go up?

When it became cheaper to mass produce X (sneakers, tshirts, anything really) did the quality go up?

It's a world that is made of an abundance of trash. The volume of low quality production saturates the market and drowns out whatever high quality things still remain. In such a world you're just better of reallocating your resources from the production quality towards the the shouting match of marketing and try to win by finding ways to be more visible than the others. (SEO hacking etc shenanigans)

When you drive down the cost of doing something to zero you you also effectively destroy the economy based around that thing. Like online print, basically nobody can make a living with focusing on publishing news or articles but alternative revenue streams (ads) are needed. Same for games too.


> When it became cheaper to … did the quality go up?

No, but the availability (more people can afford it) and diversity (different needs are met) increased. I would say that's a positive. Some of the expensive "legacy" things still exist and people pay for it (e.g. newspapers / professional journalism).

Of course low quality stuff increased by a lot and you're right, that leads to problems.


Well yeah more people can afford shitty things that end up in the landfill two weeks later. To me this is the essence of "consumerism".

Rather than think in terms of making things cheaper for people to afford we should think how to produce wealthier people who could afford better than the cheapest of cheapest crap.


But in the context of softwares, the landfill argument doesn't fit exactly well (well, sure someone can argue that storage on say, github might take more drives but the scale would be very cheaper than say landfill filled with physical things as well

> Rather than think in terms of making things cheaper for people to afford we should think how to produce wealthier people who could afford better than the cheapest of cheapest crap.

This problem actually runs deep and is systemic. I am genuinely not sure how one can do it when the basis of wealth derives from what exactly? The growth of stock markets which people call bubbles or the US debt crisis which is fueling up in recent years to basically fuel the consumerism spree itself. I am not sure.

If you were to make people wealthy, they might still buy cheapest of cheapest crap just at a 10x more magnitude in many cases (or atleast that's what I observed US to do with how many people buy and sell usually very simple saas tools at times)


Re software and landfill.. true to some extent but there are still ramifications as you pointed out electricity demand and hardware infrastructure to support it. Also in the 80's when the computer games market crashed they literally dumped games cartridges in a hole in the desert!

Maybe my opinion is just biased and I'm in the comfortable position to pass judgment but I'd like to believe that more people would be more ethical and conscious about their materialistic needs if things had more value and were better quality and instead of focusing on the "price" as the primary value proposition people were actually able to afford to buy other than the cheapest of things.

Wouldn't the economy also be in much better shape if more people could buy things such as handmade shoes or suits?


> Re software and landfill.. true to some extent but there are still ramifications as you pointed out electricity demand and hardware infrastructure to support it. Also in the 80's when the computer games market crashed they literally dumped games cartridges in a hole in the desert!

I hear ya but I wonder how that reflects on Open source software which was the GP request created by LLM let's say. Yes I know it can have bugs but its free of cost and you can own it and modify it with source code availability and run it on your own hardware

There really isn't much of a difference in terms of hardware/electricity just because of these Open source projects

But probably some for LLM's so its a little tricky but I feel like open source projects/ running far with ideas gets incentivized

Atleast I feel like its one of the more acceptable uses of LLM in so far. Its better because you are open sourcing it for others to run. If someone doesn't want to use it, that's their freedom but you built it for yourself or running with an idea which couldn't have existed if you didn't know the details on implementations or would have taken months or years for 0 gains when now you can do it in less time

It significantly improves to see which ideas would be beneficial or not and I feel like if AI is so worrying then if an idea is good and it can be tested, it can always be rewritten or documented heavily by a human. In fact there are even job posts about slop janitor on linkedin lol

> Wouldn't the economy also be in much better shape if more people could buy things such as handmade shoes or suits?

Yes but also its far from happening and would require a real shake up in all things and its just a dream right now. i agree with ya but its not gonna happen or not something one can change, trust me I tried.

This requires system wide change that one person is very unlikely to bring but I wish you best in your endeavour

But what I can do on a more individualistic freedom level is create open source projects via LLM's if there is a concept I don't know of and then open sourcing it for the general public and if even one to two people find it useful, its all good and I am always experimenting.


> Rather than think in terms of making things cheaper for people to afford we should think how to produce wealthier people who could afford better than the cheapest of cheapest crap.

I'm not trying to be snarky, but, if the principle is broadly applied, then what is the difference between these two? (I agree that, if it can only be applied to a limited population, making a few poor people wealthier might be better than making a few products cheaper.)


Newspapers and professional journalism are indeed doing well right now, nothing to worry about.

I think you found, but possibly didn't recognize, the problem. When availability goes up, but the quality of that which is widely available goes down, you get class stratification where the haves get quality, reliable journalism / software / games / etc. while the not-haves get slop. This becomes generational when education becomes part of this scenario.

When it became cheaper to publish text, for example with the invention of the printing press, the quality of what the average person had in his possession went up: you went from very few having hand-copied texts to Erasmus describing himself running into some border guard reading one of his books (in Latin). The absolute quality of texts published might have decreased a bit, but the quality per capita of what individuals owned went up.

When it became cheaper to mass produce sneakers, tshirts, and anything, the quality of the individual product probably did go down, but more people around the world were able to afford the product, which raised the standard of living for people in the aggregate. Now, if these products were absolute trash, life wouldn't make much sense, but there's a friction point in there between high quality and trash, where things are acceptable and affordable to the many. Making things cheaper isn't a net negative for human progress: hitting that friction point of acceptable affordability helps spread progress more democratically and raise the standard of living.

The question at hand is whether AI can more affordably produce acceptable technical writing, or if it's trash. My own experiences with AI make me think that it won't produce acceptable results, because you never know when AI is lying: catching those errors requires someone who might as well just write the documentation. But, if it could produce truthful technical writing affordably, that would not be a bad thing for humanity.


When it became cheaper to publish text, for example with the invention of the printing press, the quality of what the average person had in his possession went up: you went from very few having hand-copied texts to Erasmus describing himself running into some border guard reading one of his books (in Latin). The absolute quality of texts published might have decreased a bit, but the quality per capita of what individuals owned went up.

Today the situation is very different and I'm not quite sure why you compare a time in history where the average person was illiterate and (printed) books were limited to a very small audience who could afford them, with the current era where everybody is exposed to the written word all the time and is even dependent on it, in many cases even dependent on it's accuracy (think public services). The quality of AI writing in some cases is so subpar, it resembles word salad. Example goodreads: the blurb of this book https://www.goodreads.com/book/show/237615295-of-venom-and-v... was so surreal I wrote to the author to correct it (see in comments to the authors own review). It's better now, but it still has mistakes. This is in no way comparable with the pasts goes down a bit this is destroying trust even more than everything else, because it this gets to be the norm for official documents people are going to be hurt.


I agree with you, however:

One of the qualia of a product is cost. Another is contemporaneity.

If we put these together, we see a wide array of products which, rather than just being trash, hit a sweet spot for "up-to-date yet didn't break the wallet" and you end up with https://shein.com/

These are not thought of as the same people that subscribe to the Buy It For Life subreddit, but some may use Shein for a club shirt and BIFL for an espresso machine. They make a choice.

What's more, a “Technivorm Moccamaster” costs 10x a “Mr. Coffee” because of the build and repairability, not because of the coffee. (Amazon Basics cost ½ that again.)

Maybe Fashion was the original SEO hack. Whoever came up with the phrase "gone out of style" wrought much of this.


I think for 'technical' writing, there is going to be some end-state crash.

What happens when all the engineers left can't figure out something, and they start opening up manuals, and they are also all wrong and trash. And the whole world grinds to a halt because nobody knows anything.


>When it became cheaper to x did the quality go up? ...yes?

It introduces a lower barrier to entry, so more low-quality things are also created, but it also increases the quality of the higher-tier as well. It's important to note that in FOSS, we (Or atleast...I) don't generally care who wrote the code, as long as it compiles and isn't malicious. This overlays with the original discussion...If I was paying you to read your posts, I expect them to be hand-written. If I'm paying for software, it better not be AI Slop. If you're offering me something for free, I'm not really in a position to complain about the quality.

It's undeniable that, especially in software, cheaper costs and a lower barrier to get started will bring more great FOSS software. This is like one of the pillars of FOSS, right? That's how we got LetsEncrypt, OpenDNS, etc. It will also 100% bring more slop. Both can be true at the same time.


I'd say that those high quality things that still exist do so despite of the higher volume of junk and they mostly exist because of other reasons/unique circumstances. (Individual pride, craftsmanship, people doing things as a hobby/without financial constraints etc)

In a landscape where the market is mostly filled with junk by spending anything on "quality" any commercial product is essentially losing money.


>people doing things as a hobby/without financial constraints

Isn't this the exact point I was making...? I get you're arguing it's only a single factor, but I feel like the point still stands. More hobbyists, less financial constraints


The problem is that with the amount of low-quality stuff we're seeing, and with the expansion of the low-quality frenzy into the realm of information dissemination, it can become prohibitively difficult to distinguish the high-quality stuff. What matters is not the "total quality" but sort of like the expected value of the quality you can access in practice, and I feel like in at least some areas that has gone down.

> but it also increases the quality of the higher-tier

I truly don't see this happening anymore. Maybe it did before?

If there's real competition, maybe this does happen. We don't have it and it'll never last in capitalism since one or a few companies will always win at some point.

If you're a higher tier X, cheaper processes means you'll just enjoy bigger profit margins and eventually decide to start the enshittification phase since you're a monopoly/oligopoly, so why not?

As for FOSS, well, we'll have more crappy AI generated apps that are full of vulnerabilities and will become unmaintainable. We already have hordes of garbage "contributions" to FOSS generated by these AI systems worsening the lives of maintainers.

Is that really higher quality? I reckon it's only higher quantity with more potential to lower quality of even higher-tier software.


> When it became cheaper to publish text did the quality go up?

Obviously, yes? Maybe not the median or even mean, but peak quality for sure. If you know where to look there are more high-quality takes available now than ever before. (And perhaps more meaningfully, peak quality within your niche subgenre is better than ever).

> When it became cheaper to make games did the quality go up?

Yes? The quality and variety of indie games is amazing these days.

> When it became cheaper to mass produce X (sneakers, tshirts, anything really) did the quality go up?

This is the case where I don’t see a win, and I think it bears further thought; I don’t have a clear explanation. But I note this is the one case where production is not actually democratized. So it kinda doesn’t fit with the digital goods we are discussing.

> basically nobody can make a living with focusing on publishing news or articles

Is this actually true? Substack enables more independent career bloggers than ever before. I would love to see the numbers on professional indie devs. I agree these are very competitive fields, and an individual’s chances of winning are slim, but I suspect there are more professional indie creators than ever before.


I don't think peak quality is a very meaningful measure. As you say, it turns everything into "if you know where to look".

When was the last time that speed of development was the limiting factor? 15-20 years ago?

Nowadays the problem is that both technical and legal means are used to prevent adversarial interoperability. It doesn't matter if you (or AI) can write software faster if said software is unable to interface with the thing everyone else uses.


I suggest that you read my comment again. It will answer your question.

It’s a risk/convenience tradeoff. The biggest threat is Claude accidentally accesses and leaks your ssl keys, or gets prompt-hijacked to do the same. A simple sandbox fixes this.

There are theoretical risks of Claude getting fully owned and going rogue, and doing the iterative malicious work to escape a weaker sandbox, but it seems substantially less likely to me, and therefore perhaps not (currently) worth the extra work.


How does a simple sandbox fix this at all? If Claude has been prompt-hijacked you need a VM to be anywhere near safe.

Prompt-hijacking is unlikely. GP is most likely trying to prevent mistakes, not malicious behavior.

Does it matter? Hypothetically if these pre-training datasets disappeared, you can distill from the smartest current model, or have them write textbooks.

If LLMs happened 15 years ago, I guess that we wouldn’t have had the JS framework churn we had.

They are not getting worse, they are getting better. You just haven't figured out the scaffolding required to elicit good performance from this generation. Unit tests would be a good place to start for the failure mode discussed.

As others have noted, the prompt/eval is also garbage. It’s measuring a non-representative sub-task with a weird prompt that isn’t how you’d use agents in, say, Claude Code. (See the METR evals if you want a solid eval giving evidence that they are getting better at longer-horizon dev tasks.)

This is a recurring fallacy with AI that needs a name. “AI is dumber than humans on some sub-task, therefore it must be dumb”. The correct way of using these tools is to understand the contours of their jagged intelligence and carefully buttress the weak spots, to enable the super-human areas to shine.


So basically “you’re holding it wrong?”

Every time this is what I'm told. The difference between learning how to Google properly and then the amount of hoops and in-depth understanding you need to get something useful out of these supposedly revolutionary tools is absurd. I am pretty tired of people trying to convince me that AI, and very specifically generative AI, is the great thing they say it is.

It is also a red flag to see anyone refer to these tools as intelligence as it seems the marketing of calling this "AI" has finally sewn its way into our discourse that even tech forums think the prediction machine is intelligent.


I heard it best described to me that if you put in an hour of work, you get five hours of work out of it. Most people just type at it and don’t put in an hour of planning and discussion and scaffolding. They just expect it to work 100% of the time exactly like they want. But you wouldn’t expect that from a junior developer you would put an hour of work into them, teaching them things showing them where the documentation is your patterns how you do things and then you would set them off and they would probably make mistakes and you would document their mistakes for them so they wouldn’t make them again, but eventually, they’d be pretty good. That’s more or less where we are today that will get you success on a great many tasks.

Exactly my experience and how I leverage Claude where some of my coworkers remain unconvinced.

"The thing I've learned years ago that is actually complex but now comes easy to me because I take my priors for granted is much easier than the new thing that just came out"

Also, that "it's not really intelligence" horse is so dead, it has already turned into crude oil.


The point I am making is that this is supposed to be some revolutionary tool that threatens our very society in terms of labor and economics yet the fringe enthusiasts (yes, that is what HN and its users are, an extreme minority of users), and the very people plugged into the weekly changes and additions of model adjustments and tools to leverage them still struggle to show me the value of generative AI day to day. They make big claims, but I don't see them. In fact, I see negatives overwhelming the gains which goes without talking of the product and its usability.

In practice I have seen: flowery emails no one bothers to read, emoji filled summaries and documentation that no one bothers to read or check correctness on, prototypes that create more work for devs in the long run, a stark decline in code quality because it turns out reviewing code is a team's ultimate test of due diligence, ridiculous video generation... I could go on and on. It is blockchain all over again, not in terms of actual usefulness, but in terms of our burning desire to monetize it in irresponsible, anti-consumer, anti-human ways.

I DO have a use for LLMs. I use it to tag data that has no tagging. I think the tech behind generative AI is extremely useful. Otherwise, what I see is a collection of ideal states that people fail to demonstrate to me in practice when in reality, it wont be replacing anyone until "the normies" can use it without 1000 lines of instructions markdown. Instead it will just fool people in its casual authoritative and convincing language since that it was it was designed to do.


> reviewing code is a team's ultimate test of due diligence

Further even, if you are actually thinking about long-term maintenance during the code review you get seen as a nitpicky obstacle.


> Also, that "it's not really intelligence" horse is so dead, it has already turned into crude oil.

Why? Is it intelligence now? I think not.


Would you mind defining "intelligence" for me?

There are many types of intelligence. If you want to go to useless places, using certain definitions of intelligence, yes, we can consider AI “intelligent.” But it’s useless.

If you're the one saying it exists, you go first. :p

I’d say “skill issue” since this is a domain where there are actually plenty of ways to “hold it wrong” and lots of ink spilled on how to hold it better, and your phrasing connotes dismissal of user despair which is not my intent.

(I’m dismissive of calling the tool broken though.)


Remember when "Googling" was a skill?

LLMs are definitely in the same boat. It's even more specific where different models have different quirks so the more time you spend with one, the better the results you get from that one.


Those skills will age faster than Knockout.js.

Why would a skill that's being actively exercised against the state of the art, daily, age poorly?

Do you think it's impossible to ever hold a tool incorrectly, or use a tool in a way that's suboptimal?

If that tool is sold as "This magic wand will magically fix all your problems" then no, it's not possible to hold it incorrectly.

Gotcha. I don't see these tools as being a magic wand nor being able to magically fix every problem. I agree that anyone who sells them that way is overstating their usefulness.

If your position is that any product that doesn't live up to all its marketing claims is worthless, you're going to have a very limited selection.

Why does it matter how it's sold? Unless you're overpaying for what it's actually capable of, it doesn't really matter.

We all have skin in the game when how it’s sold is “automated intelligence so that we can fire all our knowledge workers”

Might be good in some timelines. In our current timeline this will just mean even more extreme concentration of wealth, and worse quality of life for everyone.

Maybe when the world has a lot more safety nets so that not having a job doesn’t mean homelessness, starvation, no healthcare, then society will be more receptive to the “this tool can replace everybody” message.


If a machine can do your job; whether it's harvesting corn or filing a TPS report then making a person sit and do it for the purpose of survival is basically just torture.

There are so many better things for humans to do.


I agree in theory. In practice people who are automated out of jobs are not taken care of by society in the transition period where they learn how to do a new job.

Once having a job is not intimately tied to basic survival needs then people will be much more willing to automate everything.

I, personally, would be willing to do mind numbing paperwork or hard labor if it meant I could feed myself and my family, have housing, rather than be homeless and starving.


You might as well stop being a software developer. Not because you'll be out of job, but because you're directly contributing to other people being out of jobs. We've been automating work (which is ultimately human labor) since the dawn of computers. And humans have been automating work for centuries now. We actually call that progress. So lets stop progressing entirely so people can do pointless labor.

If the problem is with society the solution is with society. We have stop pretending that it's anything else. AI is not even the biggest technological leap -- it's blip on the continuum.


>There are so many better things for humans to do.

For the time being, at least.


There will always be better things for people to do. We don't exist on this planet just to sit at a desk and hit buttons all day.

The only reason we exist is as a carrier for our genes to make more of our genes. Everything after that is an accidental byproduct.

I can already think of something more useful to pass on my genes than typing on keyboard all day.

I found this a pretty apt - if terse - reply. I'd appreciate someone explaining why it deserves being downvoted?

It’s just dismissive of the idea that you have to learn how use LLMs vs a design flaw in a cell phone that was dismissed as user error.

It’s the same as if he had said “I keep typing HTML into VS code and it keeps not displaying it for me. It just keeps showing the code. But it’s made to make webpages, right? people keep telling me I don’t know how to use it but it’s just not showing me the webpage.”


There are two camps who have largely made up their minds just talking past each other, instinctively upvoting/downvoting their camp, etc. These threads are nearly useless, maybe a few people on the fringes change their minds but mostly it's just the same tired arguments back and forth.

Because in its brevity it loses all ability to defend itself from any kind of reasonable rebuttal. It's not an actual attempt to continue the conversation, it's just a semantic stop-sign. It's almost always used in this fashion, not just in the context of LLM discussions, but in this specific case it's particularly frustrating because "yes, you're holding it wrong" is a good answer.

To go further into detail about the whole thing: "You're holding it wrong" is perfectly valid criticism in many, many different ways and fields. It's a strong criticism in some, and weak in others, but almost always the advice is still useful.

Anyone complaining about getting hurt by holding a knife by the blade, for example, is the strongest example of the advice being perfect. The tool is working as designed, cutting the thing with pressure on the blade, which happens to be their hand.

Left-handers using right-handed scissors provides a reasonable example: I know a bunch of left-handers who can cut properly with right-handed scissors and not with left-handed scissors. Me included, if I don't consciously adjust my behaviour. Why? Because they have been trained to hold scissors wrong (by positioning the hand to create opposite push/pull forces to natural), so that they can use the poor tool given to them. When you give them left-handed scissors and they try to use the same reversed push/pull, the scissors won't cut well because their blades are being separated. There is no good solution to this, and I sympathise with people stuck on either side of this gap. Still, learn to hold scissors differently.

And, of course, the weakest, and the case where the snark is deserved: if you're holding your iPhone 4 with the pad of your palm bridging the antenna, holding it differently still resolves your immediate problem. The phone should have been designed such that it didn't have this problem, but it does, and that sucks, and Apple is at fault here. (Although I personally think it was blown out of proportion, which is neither here nor there.)

In the case of LLMs, the language of the prompt is the primary interface -- if you want to learn to use the tool better, you need to learn to prompt it better. You need to learn how to hold it better. Someone who knows how to prompt it well, reading the kind of prompts the author used, is well within their rights to point out that the author is prompting it wrong, and anyone attempting to subvert that entire line of argument with a trite little four-sentence bit of snark in whatever the total opposite of intellectual curiosity is deserves the downvotes they get.


Except this was posted because the situation is akin to the original context in which this phrase was said.

Initial postulate: you have a perfect tool that anybody can use and is completely magic.

Someone says: it does not work well.

Answer: it’s your fault, you’re using it wrong.

In that case it is not a perfect tool that anybody can use. It is just yet another tool, with it flaws and learning curve, that may or may not work depending on the problem at hand. And it’s ok! It is definitely a valid answer. But the “it’s magic” narrative has got to go.


>Initial postulate: you have a perfect tool that anybody can use and is completely magic.

>Someone says: it does not work well.

Why do we argue with two people that are both building strawmen. It doesn't accomplish much. We keep calling AI 'unintelligent' but peoples eagar willingness to make incorrect arguments does put some doubts on humanity itself.


It's something of a thought terminating cliché in Hacker News discussions about large language models and agentic coding tools.

Needing the right scaffolding is the problem.

Today I asked 3 versions of Gemini “what were sales in December” with access to a sql model of sales data.

All three ran `WHERE EXTRACT(MONTH FROM date) = 12` with no year (except 2.5 flash did sometimes gave me sales for Dec 2023).

No sane human would hear “sales from December” and sum up every December. But it got numbers that an uncritical eye would miss being wrong.

That’s the type of logical error that these models produce that are bothering the author. They can be very poor at analysis in real world situations because they do these things.


I'm referring to these kind of articles as "Look Ma, I made the AI fail!"

Still I would agree we need some of these articles when other parts of the internet is "AI can do everything, sign up for my coding agent for $200/month"

"They are not getting worse, they are getting better. You just haven't figured out the scaffolding required to elicit good performance from this generation. Unit tests would be a good place to start for the failure mode discussed."

Isn't this the same thing? I mean this has to work with like regular people right?


I’ve seen some correlation between people who write clean and structured code, follow best practices and communicate well through naming and sparse comments, and how much they get out of LLM coding agents. Eloquence and depth of technical vocabulary seem to be a factor too.

Make of that what you will…


Having to prime it with more context and more guardrails seems to imply they're getting worse. That's fewer context and guardrails it can infer/intuit.

No, they are not getting worse. Again, look at METR task times.

The peak capability is very obviously, and objectively, increasing.

The scaffolding you need to elicit top performance changes each generation. I feel it’s less scaffolding now to get good results. (Lots of the “scaffolding” these days is less “contrived AI prompt engineering” and more “well understood software engineering best practices”.)


Why the downvotes, this comment makes sense. If you need to write more guardrails that does increase the work and at some point amount of guardrails needed to make these things work in every case would be just impractical. I personally dont want my codebase to be filled baby sitting instructions for code agents.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: