Hacker Newsnew | past | comments | ask | show | jobs | submit | visarga's commentslogin

Talking about embarrassing bugs, Claude chat (both web and iOS apps) lately tend to lose the user message when there is a network error. This happens every day to me lately. It is frustrating to retype a message from memory, first time you are "in the flow" second time it feels like unjust punishment.

With all the Claude Code in the world how come they don't write good enough tests to catch UI bugs? I have come to the point where I preemptively copy the message in clipboard to prevent retyping.


This is an old bug. I cant believe they haven't fixed it yet. My compliments for the Claude frontend start and end at artifacts.

Ctrl Z usually recovers the missing text, even across page refreshes

A quick glance over the 200 LOC impl - I see no error handling. This is the core of the agent loop, you need to pass some errors back to the LLM to adapt, while other errors should be handled by the code. There is also no model specific code for structured decoding.

This article could make for a good interview problem, "what is missing and what would you improve on it?"


Do you think the SDD approach is fundamentally wrong, or that Amazon's implementation was at fault?

It sounds like the initial spec is wrong, which compounds over time.

With SDD the spec should be really well thought about and considered, direct and clear.


I had an idea - take SerpAPI and save top-10 or 20 links for many queries (millions), and put that in a RAG database. Then it can power a local LLM do web search without ever touching Google.

The index would just point a local crawler towards hubs of resources, links, feeds, and specialized search engines. Then fresh information would come from the crawler itself. My thinking is that reputable sites don't appear every day, if you update your local index once every few months it is sufficient.

The index could host 1..10 or even 100M stubs, each one touching on a different topic, and concentrating the best entry points on the web for that topic. A local LLM can RAG-search it, and use an agent to crawl from there on. If you solve search this way, without Google, and you also have local code execution sandbox, and local model, you can cut the cord. Search was the missing ingredient.

You can still call regular search engines for discovery. You can build your personalized cache of search stubs using regular LLMs that have search integration, like ChatGPT and Gemini, you only need to do it once per topic.


Fetching web pages at the kind of volume needed to keep the index fresh is a problem, unless you're Googlebot. It requires manual intervention with whitelisting yourself with the likes of Cloudflare, cutting deals with the likes of Reddit and getting a good reputation with any other kind of potential bot blocking software that's unfamiliar with your user agent. Even then, you may still find yourself blocked from critical pieces of information.

No, I think we can get by with using CommonCrawl, pulling every few months the fresh content and updating the search stubs. The idea is you don't change the entry points often, you open them up when you need to get the fresh content.

Imagine this stack: local LLM, local search stub index, and local code execution sandbox - a sovereign stack. You can get some privacy and independence back.


CC is not on the same scale as Google and not nearly as fresh. It's around 100th of the size and not much chance of having recent versions of a page.

I imagine you'd get on just fine for short tail queries but the other cases (longer tail, recent queries, things that haven't been crawled) begin to add up.


> Anyone who understands how easy it is to copy bits should know that the original intent of copyright can't work anymore.

AI makes this even more stringent. You cannot protect the "vibe" of your works, AI can replicate it in seconds. If you make "vibe infringement" the new rule, then creativity becomes legally risky. A catch 22.

In 1930 judge Hand said in relation to Nichols v. Universal Pictures:

> Upon any work...a great number of patterns of increasing generality will fit equally well. At the one end is the most concrete possible expression...at the other, a title...Nobody has ever been able to fix that boundary, and nobody ever can...As respects play, plagiarism may be found in the 'sequence of events'...this trivial points of expression come to be included.

And since then a litany of judges and tests expanded the notion of infringement towards vibes and away from expression:

- Hand's Abstractions / The "Patterns" Test (Nichols v. Universal Pictures)

- Total Concept and Feel (Roth Greeting Cards v. United Card Co.)

- The Krofft Test / Extrinsic and Intrinsic Analysis

- Sequence, Structure, and Organization (Whelan Associates v. Jaslow Dental Laboratory)

- Abstraction-Filtration-Comparison (AFC) Test (Computer Associates v. Altai)

The trend has been to make infringement more and more abstract over time, but this makes testing it an impossible burden. How do you ensure you are not infringing any protected abstraction on any level in any prior work? Due diligence has become too difficult now.


Why train to pedal fast when we already got motorcycles? You are preparing for yesterday's needs. There will never be a time when we need to solve this manually like it's 2019. Even in 2019 we would probably have used Google, solving was already based on extensive web resources. While in 1995 you would really have needed to do it manually.

Instead of manual coding training your time is better invested in learning to channel coding agents, how to test code to our satisfaction, how to know if what AI did was any good. That is what we need to train to do. Testing without manual review, because manual review is just vibes, while tests are hard. If we treat AI-generated code like human code that requires a line-by-line peer review, we are just walking the motorcycle.

How do we automate our human in the loop vibe reactions?


> Why train to pedal fast when we already got motorcycles? You are preparing for yesterday's needs.

This is funny in the sense that in properly built urban environment bycicles are one of the best ways to add some physical activity in a time constrained schedule, as we're discovering.


> Instead of manual coding training your time is better invested in learning to channel coding agents

All channelling is broken when the model is updated. Being knowledgeable about the foibles of a particular model release is a waste of time.

> how to test code to our satisfaction

Sure testing has value.

> how to know if what AI did was any good

This is what code review is for.

> Testing without manual review, because manual review is just vibes

Calling manual review vibes is utterly ridiculous. It's not vibes to point out an O(n!) structure. It's not vibes to point out missing cases.

If your code reviews are 'vibes', you're bad at code review

> If we treat AI-generated code like human code that requires a line-by-line peer review, we are just walking the motorcycle.

To fix the analogy you're not reviewing the motorcycle, you're reviewing the motorcycle's behaviour during the lap.


> This is what code review is for.

My point is that visual inspection of code is just "vibe testing", and you can't reproduce it. Even you yourself, 6 months later, can't fully repeat the vibe check "LGTM" signal. That is why the proper form is a code test.


Yes and no.

Yes, I recon coding is dead.

No, that doesn't mean there's nothing to learn.

People like to make comparisons to calculators rendering mental arithmetic obsolete, so here's an anecdote: First year of university, I went to a local store and picked up three items each costing less than £1, the cashier rang up a total of more than £3 (I'd calculated the exact total and pre-prepared the change before reaching the head of the queue, but the exact price of 3 items isn't important enough to remember 20+ years later). The till itself was undoubtedly perfectly executing whatever maths it had been given, I assume the cashier mistyped or double-scanned. As I said, I had the exact total, the fact that I had to explain "three items costing less than £1 each cannot add up to more than £3" to the cashier shows that even this trivial level of mental arithmetic is not universal.

I now code with LLMs. They are so much faster than doing it by hand. But if I didn't already have experience of code review, I'd be limited to vibe-coding (by the original definition, not even checking). I've experimented with that to see what the result is, and the result is technical debt building up. I know what to do about that because of my experience with it in the past, and I can guide the LLM through that process, but if I didn't have that experience, the LLM would pile up more and more technical debt and grind the metaphorical motorbike's metaphorical wheels into the metaphorical mud.


> But if I didn't already have experience of code review, I'd be limited to vibe-coding (by the original definition, not even checking).

Code review done visually is "just vibe testing" in my book. It is not something you can reproduce, it depends on the context in your head this moment. So we need actual code tests. Relying on "Looks Good To Me" is hand waving, code smell level testing.

We are discussing vibe coding but the problem is actually vibe testing. You don't even need to be in the AI age to vibe test, it's how we always did it when manually reviewing code. And in this age it means "walking your motorcycle" speed, we need to automate this by more extensive code tests.


I agree that actual tests are also necessary, that code review is not enough by itself. As LLMs can also write tests, I think getting as close as is sane to 100% code coverage is almost the first thing people should be doing with LLM assistance (and also, "as close as is sane": make sure that it really is a question of "I thought carefully and have good reason why there's no point testing this" rather than "I'm done writing test code, I'm sure it's fine to not test this", because LLMs are just that cheap).

However, code review can spot things like "this is O(n^2) when it could be O(n•log(n))", or "you're doing a server round trip for each item instead of parallelising them" etc.

You can also ask an LLM for a code review. They're fast and cheap, and whatever the LLM catches is something you get without having to waste a coworker's time. But LLMs have blind spots, and more importantly all LLMs (being trained on roughly the same stuff in roughly the same way) have roughly the same blind spots, whereas human blind spots are less correlated and expand coverage.

And code smells are still relevant for LLMs. You do want to make sure they're e.g. using a centralised UI style system and not copy-pasting style into each widget, because duplication wastes tokens and is harder to correctly update with LLMs for much the same reason it is with humans: stuff gets missed during the process when it's copypasta.


I am personally working on formalizing the design stage as well, the core concepts being Architecture, Goal, Solution and Implementation. That would make something like the complexity of an algorithm an explicit decision in a graph. It would make constraints and dependencies explicitly formalized. You can track any code to its solution (design stage) and goals, account for everything top-down and bottom-up, and assign tests for all nodes.

Take a look here: https://github.com/horiacristescu/archlib/blob/main/examples... (but it's still WIP, I am not there yet)


Everyone is so fixated on the output as the commodity, whether it’s a blog post or a piece of code, that they fail to see the interaction itself as the locus of value. You can still do your rewarding work in a chat session, it can force you to think, challenge your ideas, and if you introduce your own spices into the soup it won't taste like slop. I like to explain my ideas until the LLM "gets it" and then ask it to "formalize" them in a nice piece of text, which I consume later as a meditation to deepen my thinking. I can't stand passive media anymore, need to be able to push back to feel satisfied, but this is only possible on forums and in AI chats.

Not all AI generated outputs are slop, usually it's the low effort prompts that create slop. When you bring in external data or extensive human curation it is almost certainly not slop. I think many people put all AI outputs in the slop bucket but this is unfair to those who put a lot of thinking in their AI interactions. Slop is not given by the LLM, but by the human effort associated to that task. For code, it is the quality of the testing framework that sets the bar.

I'd be happy if the original expression remains protected but abstractions are open to reuse. Right now a work blocks more than its specific expression, a whole space is forbidden around it, like fan fiction. In other words what is needed is freedom to build on top of existing works.

In the present moment, time passes slow when something boring or painful happens, and passes fast for exciting or pleasant experiences.

But in hindsight, time passes slow in periods where you had many new experiences and is almost missing in periods of routine.

So an exciting experience might be fast in the moment and slow in hindsight.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: