Hacker Newsnew | past | comments | ask | show | jobs | submit | visarga's commentslogin

Thank you for posting the project, I was actively looking for a solution, even vibe coded a throw away one. One question - how do you pass the credentials for agents inside the cage? I would be interested in a way to use not just claude code, but also codex cli and other coding agents inside. Considering the many subscription types and storage locations credentials can have (like Claude), it can be complicated.

Of course the question comes because we always lack tokens and have to dance around many providers.


The credential have been a PITA. I was working on a PR this morning before work; I should have it tonight. You have to be careful because if you look like you're spoofing the client, you can get banned.

For Claude specifically, there are two places where it tracks state:

~/.claude.json -- contains a bunch of identity stuff and something about oauth

~/claude/ -- also contains something about oauth, plus conversation history, etc

If they're not _both_ present and well-formed, then it forces you back through the auth flow. On an ordinary desktop setup, that's transparent. But if you want to sandbox each thread, then sharing just the token requires a level of involvement that feels icky, even if the purpose is TOS-compliant.


codex have auth.json. claude is using credentials.json on Linux, Keychain on MacBook. I prefer to just use a long lived token instead for Claude due to this.

I have my own Docker image for similar purpose, which is for multiple agent providers. Works great so far.


> it's for the same reason that construction workers wear hard hats in environments that will eventually be safe for children.

Good response, but more practically, when you are developing a project you allow the agent to do many things on that VM, but when you deliver code it has to actually pass tests. The agent working is not being tested live, but the code delivered is tested before use. I think tests are the core of the new agent engineering skill - if you can have good tests you automated your human in the loop work to a large degree. You can only trust a code up to the level of its testing. LGTM is just vibes.


this is gpt, right?

There are grammatical mistakes and abbreviations, big tells that it's NOT ChatGPT.

I had a conversation (prompts) with Claude about this article because I didn't feel I could as succinctly describe my point alone.

> The judge always has to be you.

But you can automate much of that work by having good tests. Why vibe-test AI code when you can code-test it? Spend your extra time thinking how to make testing even better.


> But how much will it cost to maintain those features in the future?

Very little if they have good specs and tests.


I see a parallel to how Google search created incentives for SEO and social network feeds created incentive for attention grabbing slop. Platforms optimizing for their own interest at the expense of both upstream and downstream.

Is there any platform that does not use these dark patterns? I hope the agent era will allow users to bypass the crappy search responses and slop on feeds. But by the looks of it OpenAI is moving in the same conflict of interest direction to its users.


Of course they are. It was obvious from day one that ads were going to be shoved in. At first they’ll be obvious and clearly separated, then they’ll influence the responses you get without you even knowing. I can’t fathom why anyone ever believed that wouldn’t be the case.

Not a full platform, but F-Droid is the only app-store I know that feels customer first and isn't predatory.

The problem is when you use your "copy" as inspiration and actually create and publish something. It is very hard to be certain you are safe, besides literal expression close paraphrasing is also infringing, using world building elements, or using any original abstraction (AFC test). You can only know after a lawsuit.

It is impossible to tell how much AI any creator used secretly, so now all works are under suspicion. If copyright maximalists successfully copyright style (vibes), then creativity will be threatened. If they don't succeed, then copyright protection will be meaningless. A catch 22.


> close paraphrasing is also infringing, using world building elements, or using any original abstraction (AFC test)

World building elements? Do you have more details on that, because that feels wrong to me.

Unless you mean the specific names of things in the world like "Hobbits".


Well said, I have been saying the same. Besides helping agents code, it helps us trust the outcome more. You can't trust a code not tested, and you can't read every line of code, it would be like walking a motorcycle. So tests (back pressure, deterministic feedback) become essential. You only know something works as good as its tests show.

What we often like to do in a PR - look over the code and say "LGTM" - I call this "vibe testing" and think it is the real bad pattern to use with AI. You can't commit your eyes on the git repo, and you are probably not doing as good of a job as when you have actual test coverage. LGTM is just vibes. Automating tests removes manual work from you too, not just make the agent more reliable.

But my metaphor for tests is "they are the skin of the agent", allow it to feel pain. And the docs/specs are the "bones", allow it to have structure. The agent itself is the muscle and cerebellum, and the human in the loop is the PFC.


For anyone else who briefly got very lost at PFC, probably "prefrontal cortex".

The "pattern matching" perspective is true if you zoom in close enough, just like "protein reactions in water" is true for brains. But if you zoom out you see both humans and LLMs interact with external environments which provide opportunity for novel exploration. The true source of originality is not inside but in the environment. Making it be all about the model inside is a mistake, what matters more than the model is the data loop and solution space being explored.

But the trend line is less ambiguous, models got better year over year, much much better.

I don't dispute that the situation is rapidly evolving. It is certainly possible that we could achieve AGI in the near future. It is also entirely possible that we might not. Claims such as that AGI is close or that we will soon be replacing developers entirely are pure hype.

When someone says something to the effect of "LLMs are on the verge of replacing developers any day now" it is perfectly reasonable to respond "I tried it and it came up with crap". If we were actually near that point you wouldn't have gotten crap back when you tried it for yourself.


There's a big difference between "I tried it and it produced crap" and "it will replace developers entirely any day now"

People who use this stuff everyday know that people who are still saying "I tried it and it produced crap" just don't know how to use it correctly. Those developers WILL get replaced - by ones who know how to use the tool.


> Those developers WILL get replaced - by ones who know how to use the tool.

Now _that_ I would believe. But note how different "those who fail to adapt to this new tool will be replaced" is from "the vast majority will be replaced by this tool itself".

If someone had said that six (give or take) months ago I would have dismissed it as hype. But there have been at least a few decently well documented AI assisted projects done by veteran developers that have made the front page recently. Importantly they've shown clear and undeniable results as opposed to handwaving and empty aspirations. They've also been up front about the shortcomings of the new tool.


You probably mean antirez porting Flux to c. There were not too many shortcomings in his breakdown; his biggest one as I saw was that his knowledge and experience building large c programs really was a requirement. But given one of these experts, you don't see how that person and claude code just replaces a team. The less capable people on the team cannot do what he does so before they were just entering code and getting corrected in reviews or asking for help. Now the AI can do that, but on 10 projects in parallel. In a weekend you wont have time for that but not everything has to be done in a weekend.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: