Hacker Newsnew | past | comments | ask | show | jobs | submit | dimitri-vs's commentslogin

A better way to put it is with this example: I put my symptoms into ChatGPT and it gives some generic info with a massive "not-medical-advice" boilerplate and refuses to give specific recommendations. My wife (an NP) puts in anonymous medical questions and gets highly specific med terminology heavy guidance.

That's all to say the learning curve with LLMs is how to say things a specific way to reliability get an outcome.


> Picture MS Word where the GUI is just a page and a sidebar for telling an LLM what you want it to do.

Done. And it seems absolutely awful.

"Please bold the text I have selected" instead of a preexisting bold button.

Oh wait I can just tell it all the tools I commonly use and where to put them... Hmmm topbar or side bar. Wow so much fun getting to make all these decisions!

Ok time to change fonts. "Please add a font picker so I can pick a font"


Hallucinations are not solved, memory is not solved, prompt injection is not solved, context limits are waaay too low at the same time tokens way too expensive to take advantage of context limits, etc. These problems have existed since the very early days of GPT-4 and there is no clear path to them being solved any time soon.

You basically need AGI and we are nowhere close to AGI.


All of the issue you talk about are true, and I don’t personally care about AGI it’s kind of a mishmash of a real thing and a nice package for investors but what I do care about is what has been released and what it can do

All of the issues you talk about: they aren’t solved but we’ve made amazing progress on all of them. Continual learning is a big one and labs are likely close to some POCs.

Token costs per unit performance rapidly goes down. GPT4 level perf costs you 10x less today than two years ago. This will continue to be the case as we just continually push efficiency up.

The AGI question “are we close” tbh to me these questions are just rabbit holes and bait for flame wars because no one can decide on what it means and then even if you do (e.g. super human perf on all economically viable tasks is maybe more of a solid staring point) everyone fights about the ecological validity of evals.

All I’m saying is: taking coding in a complete vacuum, we’re very very close to being at a point where it becomes so obviously beneficial and failure rates for many things fall below the critical thresholds that automating even the things people say make engineers unique (working with people to navigate ambiguous issues that they aren’t able to articulate well, making the right tradeoffs, etc) starts looking like less of a research challenge and more of an exercise in deployment


But millions discussions are needed and will always be needed?

"Create a copy of Amazon.com"

ok, how did you want to handle 3pl fulfilment and international red tape?

"No not that complicated, a minimal copy"

How minimal? How many servers should I provision? How vertically integrated should we get?

Etc.

I really want to see someone build an app of any value with minimal decisions made.


Amazon is not one app, its hundreds of them bundled in some giant monster.

You could easily replicate the store part of it minimally, at its core its just an index of products, a basket and checkout system. There are other parts that make up the whole thing of course.

There is a lot of room between no value and trillion dollar company


It would be great if LLM's did this (the relevant, and very pointed, follow-up questions). Instead, today they kind of just go "okay sure yeah here it is. here's as much of Amazon.com as I can code within my token budget. Good luck drawing the rest of the owl."

This assumes an educated, passionate and patient user that 99% of people are not. They wont ask for a hammer - they will ask for a rock tied to a stick and get pissed off when it doesn't work like a hammer. They will ask for paint that doesn't drip. They will ask for electrical sockets they can install in their bathtub.

The turn-key option is ostris ai-toolkit which has good tutorials on YT and can be run completely locally or via RunPod. Claude Code can set everything up for you (speaking from experience) and can even SSH into RunPod.

I wouldn't say the US gov has a good track record regarding it's "job" in healthcare.


Hell of a lot better than without the government though. Read more history.


In theory, yes. In practice the shit data you are working with (descriptions that are one or two words or the same word with ref id) really benefit from a) an agent that understands who you are and are likely spending money on b) has access to tool calls to dig deeper into what `01-03 PAYPAL TX REFL6RHB6O` actually is by cross referencing an export of PayPal transactions.

I think the smarter play is having an agent take the first crack at it, and build up a high confidence regex rule set. And then from there handle things that don't match and do periodic spot checks to maintain the rule set.


> "...then from there handle things that don't match..."

Curious, what's the inputs for an agent when handling your dataset? What can you feed the agent so it can later "learn" from your _manual labeling_?


Reality is we went from LLMs as chatbots editing a couple files per request with decent results. To running multiple coding agents in parallel to implement major features based on a spec document and some clarifying questions - in a year.

Even IF llms don't get any better there is a mountain of lemons left to squeeze in their current state.


I'm in Claude Code 30+ hr/wk and always have a at least three tabs of CC agents open in my terminal.

Agree with the other comments: pretty much running vanilla everything and only the Playwright MCP (IMO way better than the native chrome integration) and ccstatusline (for fun). Subagents can be as simple as saying "do X task(s) with subagent(s)". Skills are just self @-ing markdown files.

Two of the most important things are 1) maintaining a short (<250 lines) CLAUDE.md and 2) having a /scratch directory where the agent can write one-off scripts to do whatever it needs to.


I also specifically instruct Claude how to use a globally git ignored scratch folder “tmp” in each repo. Curious what your approach is


You store your project context in an ignored tmp folder? Share more plz - what does it look like? What do you store?


Not memory, I just instruct it to freely experiment with temporary scripts and artifacts in a specific folder.

This helps it organize temporary things it does like debugging scripts and lets it (or me) reference/build on them later, without filling the context window. Nothing fancy, just a bit of organization that collects in a repo (Git ignored)


How can you - or any human - review that much code?


When I'm coding I have about 6 instances of VSCode on the go at once; each with their own worktree and the terminal is a dangerous cc in docker. most of the time they are sitting waiting for me. Generally a few are doing spec work/reporting for me to understand something - sometimes with issue context; these are used to plan or redirect my attention if I might've missed something. A few will be just hacking on issues with little to no oversight - I just want it to iterate tests+code+screenshots to come up with a way to do a thing / fix a thing, I'll likely not use the code it generates directly. Then one or two are actually doing work that I'll end up PR'ing or if I'm reviewing they'll be helping me do the review - either mechanically (hey claude, give me a script to launch n instances with a configuration that would show X ... ok, launch them ... ok, change to this ... grab X from the db ... etc.) or insight based (hey claude, check issue X against code Y - does the code reflect their comments; look up the docs for A and compare to the usage in B, give me references).

I've TL'd and PM'd as well as IC'd. Now my IC work feels a lot more like a cross between being a TL and being a senior with a handful of exuberant and reasonably competent juniors. Lots of reviewing, but still having to get into the weeds quickly and then get out of their way.


>I've TL'd and PM'd as well as IC'd. Now my IC work feels a lot more like a cross between being a TL

Interesting... I've been in management for a few years now and recently doing some AI coding work. I've found my skills as a manager/TL are far more adaptable to getting the best out of AI agents than my skills as a coder.


Same, I was a very average dev coming out of CS, and a PM before this. I find that my product training has been more useful, especially with prototypes, but I do leave nearly all of the hard system, infra, and backend work to my much much more competent engineering teammates.


TBH I'm not building "production grade" apps depended on by hundreds of thousands of users - our clients want to get to a live MVP as fast as possible and love the ability to iterate quickly.

That said, it's well know that Anthropic uses CC for production. You just slow things down a bit, spend more time on the spec/planning stage and manually approve each change. IMO the main hurdle to broader Claude Code adoption isn't a code quality one, it's mostly getting over the "that's not how I would have written it" mindset.


From personal experience, most of my time in Claude Code is spent experimenting, iterating, and refining approaches. The amount of code it produces as it relates to time spent working on it tends to be pretty logarithmic in practice.


They don’t, they just push garbage, someone else quickly looks over it (or asks another llm to review for him), and merges.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: