Hacker Newsnew | past | comments | ask | show | jobs | submit | trjordan's commentslogin

This is going to sound sarcastic, but I mean this fully: why haven't they merged that PR.

The implied future here is _unreal cool_. Swarms of coding agents that can build anything, with little oversight. Long-running projects that converge on high-quality, complex projects.

But the examples feel thin. Web browsers, Excel, and Windows 7 exist, and they specifically exist in the LLM's training sets. The closest to real code is what they've done with Cursor's codebase .... but it's not merged yet.

I don't want to say, call me when it's merged. But I'm not worried about agents ability to produce millions of lines of code. I'm worried about their ability to intersect with the humans in the real world, both as users of that code and developers who want to build on top of it.


> This is going to sound sarcastic, but I mean this fully: why haven't they merged that PR.

I would go even further, why have they not created at least one less complex project that is working and ready to be checked out? To me it sounds like having a carrot dangle in front of the face of VC investors: 'Look, we are almost there to replace legions of software developers! Imagine the market size and potential cost reductions for companies.'

LLMs are definitely an exciting new tool and they are going to change a lot. But are they worth $B for everything being stamped 'AI'? The future will tell. Looking back the dotcom boom hype felt exactly the same.

The difference with the dotcom boom is that at the time there was a lot more optimism to build a better future. The AI gold rush seems to be focused on getting giga-rich while fscking the bigger part of humanity.


>> why haven't they merged that PR.

because it is absolutely impossible to review that code and there is gazillion issues there.

The only way it can get merged is YOLO and then fix issues for months in prod which kinda defeats the purpose and brings gains close to zero.


On the other hand, finding fixing issues for months is still training data

> Long-running projects that converge on high-quality, complex projects

In my experience agents don't converge on anything. They diverge into low-quality monstrosities which at some point become entirely unusable.


Yeah, I don't think they're built for that either, you need a human to steer the "convergtion", otherwise they indeed end up building monstrosities.

> Web browsers, Excel, and Windows 7 exist, and they specifically exist in the LLM's training sets.

There's just a bit over 3 browsers, 1 serious excel-like and small part of windows user side. That's really not enough for training for replicating those specific tasks.


> Long-running projects that converge

This is how I think about it. I care about asymptotics. What initial conditions (model(s) x workflow/harness x input text artefacts) causes convergence to the best steady state? The number of lines of code doesn't have to grow, it could also shrink. It's about the best output.


Pretty much everything exists in the training sets. All non-research software is just a mishmash of various standard modules and algorithms.

Not everything, only code-bases of existing (open-source?) applications.

But what would be the point of re-creating existing applications? It would be useful if you can produce a better version of those applications. But the point in this experiment was to produce something "from scratch" I think. Impressive yes, but is it useful?

A more practically useful task would be for Mozilla Foundation and others to ask AI to fix all bugs in their application(s). And perhaps they are trying to do that, let's wait and see.


You have to be careful which codebase to try this on. I have a feeling if someone unleashed agents on the Linux kernel to fix bugs it'd lead to a ban on agents there

Re-creating closed source applications as open source would have a clear benefit because people could use those applications in a bunch of new ways. (implied: same quality bar)

I'm interested in why Claude loses it's mind here,

but also, getting shut down for safety reasons seems entirely foreseeable when the initial request is "how do I make a bomb?"


That wasn't the request, that's how Claude understood the Armenian when it short-circuited.

Does Google also not handle this well?

Copy-pasted from the chat: https://www.google.com/search?q=translate+%D5%AB%D5%B6%D5%B9...


There's something about this that's unsatisfying to me. Like it's just a trivia trick.

My first read of this was "this seems impossible." You're asked to move bits around without any working space, because you're not allowed to allocate memory. I guess you could interpret this pedantically in C/C++ land and decide that they mean no additional usage of the heap, so there's other places (registers, stack, etc.) to store bits. The title is "in constant memory" so I guess I'm allowed some constant memory, which is vaguely at odds with "can you do this without allocating additional memory?" in the text.

But even with that constraint ... std::rotate allocates memory! It'll throw std::bad_alloc when it can't. It's not using it for the core algorithm (... which only puts values on the heap ... which I guess is not memory ...), but that function can 100% allocate new memory in the right conditions.

It's cool you can do this simply with a couple rotates, but it feels like a party trick.


To be fair, it originates from a time when memory was tighter. Is discussed with some motivating text in Programming Pearls. I can't remember the context, but I think it was in a text editor. I can look it up, if folks want some of that context here.


I did something similar back in the day to support block-move for an editor running on a memory constrained 8-bit micro (BBC Micro). It had to be done in-place since there was no guarantee you'd have enough spare memory to use a temporary buffer, and also more efficient to move each byte once rather than twice (in/out of temp buffer).


Also useful for cache locality, a more recent trend. But I guess that's just another slighlty diff case of tight mem; this time in the cache rather than RAM generally.


The problem seems less arbitrary if the chunks being rotated are large enough. Implicit in the problem is that any method that would require additional memory to be allocated would probably require memory proportional to the sizes of stuff being swapped. That could be unmanageable.

As for whether std::rotate() uses allocations, I can't say without looking. But I know it could be implemented without allocations. Maybe it's optimal in practice to use extra space. I don't think a method involving reversal of items is generally going to be the fastest. It might be the only practical one in some cases or else better for other reasons.


No - std::rotate is just doing this with in-place swaps.

Say you have "A1 A2 B1" and want to rotate (swap) adjacent blocks A1-A2 and B1, where WLOG the smaller of these is B1, and A1 is same size as B1.

What you do is first swap B1 with A1 (putting B1 into it's final place).

B1 A2 A1

Now recurse to swap A2 and A1, giving the final result:

B1 A1 A2

Swapping same-size blocks (which is what this algorithm always chooses to do) is easy since you can just iterate though both swapping corresponding pairs of elements. Each block only gets moved once since it gets put into it's final place.


You are thinking of std::swap, std::rotate does throw bad_alloc


I see it says that it may throw bad_alloc, but it's not clear why, since the algorithm itself (e.g see "Possible implementation" below) can easily be done in-place.

https://en.cppreference.com/w/cpp/algorithm/rotate.html

I'm wondering if the bad_alloc might be because a single temporary element (of whatever type the iterators point to) is going to be needed to swap each pair of elements, or maybe to allow for an inefficient implementation that chose not to do it in-place?


> But even with that constraint ... std::rotate allocates memory! It'll throw std::bad_alloc when it can't.

This feels kinda crazy. Is there a reason why this is the case?


That's only for the parallel overload. The ordinary sequential overload doesn't allocate: the only three ordinary STL algorithms that allocate are stable_sort, stable_partition, and (ironically) inplace_merge.


They mention this at the end of the article:

> And here is the delicious part: you can give the whole setup to the students and let them prepare for the exam by practicing it multiple times. Unlike traditional exams, where leaked questions are a disaster, here the questions are generated fresh each time. The more you practice, the better you get. That is... actually how learning is supposed to work.


The verification asymmetry framing is good, but I think it undersells the organizational piece.

Daniel works because someone built the regime he operates in. Platform teams standardized the patterns and defined what "correct" looks like and built test infrastructure that makes spot-checking meaningful and and and .... that's not free.

Product teams are about to pour a lot more slop into your codebase. That's good! Shipping fast and messy is how products get built. But someone has to build the container that makes slop safe, and have levers to tighten things when context changes.

The hard part is you don't know ahead of time which slop will hurt you. Nobody cares if product teams use deprecated React patterns. Until you're doing a migration and those patterns are blocking 200 files. Then you care a lot.

You (or rather, platform teams) need a way to say "this matters now" and make it real. There's a lot of verification that's broadly true everywhere, but there's also a lot of company-scoped or even team-scoped definitions of "correct."

(Disclosure: we're working on this at tern.sh, with migrations as the forcing function. There's a lot of surprises in migrations, so we're starting there, but eventually, this notion of "organizational validation" is a big piece of what we're driving at.)


This is a really good article. Don't get caught up in the tone of "anti-politics" or "slow is good." It's describing a brand of politics and impact that is just as mercurial as product development if you do it wrong. Infra and DevEx behaves fundamentally differently, and it can be a really great path if it suites your personality.

For context: my last job was PM for the infra team at Slack. I did it for 5 years. I didn't learn about Slack's product launch process until year 4. Everything until that point was internal work, on our k8s/service mesh and DB infrastructure.

The important insight here is about customer success and shadow management. Every successful engineer (and my own success) derived from figuring out what product engineers needed and delivering it. The "Shadow Hierarchy" feedback was make-or-break for those promotions. It's _hard_ to optimize for that, because you need to seek that feedback, understand if addressing it will actually fix the problem, and deliver it quickly enough to matter in the product org.

If you're willing to optimize for that internal success, you'll be rewarded, but in your career and in stability in the organization. I disagree this is only at Big Tech -- companies as small as 100 engineers have real and strong cultures in the right team, under the right manager.

But don't think this is some magical cheat code to ignoring what's important to the business. It's just a different, perhaps more palatable, route to managing the alignment and politics that are a necessary part of growth at any company.


I've seen the same dynamics play out at mid-sized companies


I agree with the author in regards to it being different career paths, but I think your take is better and more nuanced. I've done both. I started my career being a spotlight engineer, quickly advancing the ranks and being sent on a educational path for leadership. I didn't like it though. Don't get me wrong. I basked in the glory for the first year, but then when it got to be just a regular day, I really started to miss solving what I see as interesting problems. Which is often working on stability and safety, building tools and infrastructure which makes it easier for other people to serve the business.

Maybe it's because I've done the spotlight thing, but I don't really care about praise from management anymore. In fact I suspect that most of my direct managers have never really known what I did, but since they've seen the impact for other people, it's not really been an issue in regards to my reviews or pay. I don't get a lot of credit and I don't get a lot of praise. I don't necessarily see someone who's build a great product with the tools I've made as stealing my credit either. That's sort of the point of what I do.

I completely agree with you that this is not just Big Tech, or Enterprise or even in organisations that are focused on Tech. I also agree that it's not about ignoring the business, you're still going to want to build things that are useful. You're still going to do change management from the ground up to make sure people know the tools are useful and how to use them. You're still going to network and be friendly with your co-workers.

What you can skip is a lot of the corporate politics and frankly most of the "financial" information. I don't even think the price is very high, you don't get the publicity, but it's not like spotlight engineers necessarily get better pay or better career paths unless they want to go into management.


> the gap to GPT-4o, Gemini 2 ... is shrinking fast

Are you ... aware that OpenAI and Google have launched more recent models?


almost every single AI doomer i listen to hasnt updated any of their priors in the last 2 years. these people are completely unaware of what is actually happening at the frontier or how much progress has been made.


Their ignorance is your opportunity.


That jumped out at me too. Like a time-traveling comment or something!


Like "someone" who's knowledge cutoff is from a while back...


This is what happens when someone copies and pastes their old comment, note the other tells.


Nope, more like a LLM that doesn't know about GPT 5 and Gemini 3


Even the punctuation signs are telling this is an LLM.


Between this and http://myticker.com (posted recently), I want to share a theory of mine:

1) the internet is mostly made up of spaces where the median opinion is vanishingly rare among actual humans.

2) the median internet opinion is that of a person who is deep into the topic they're writing about.

The net result is that for most topics, you will feel moderate to severe anxiety about being "behind" about what you shuld be doing.

I'm 40, and I'm active. I ran a half marathon last weekend. I spent 5 hours climbing with my kids this weekend. My reaction to these articles, emotionally, was "I'm probably going to die of heart disease," because my cholesterol is a bit high and my BMI is 30. When I was biking 90 miles a week, my VO2 max was "sub-standard."

Let's assume this information is true. That's OK. It's all dialed up to 11, and you don't have to do anything about it right now.


Obesity is an independent risk factor, even if otherwise active/healthy. It's worth getting under control for lifespan and healthspan.


Across the population as a whole, BMI 30 is basically negligible increase in all-cause mortality. For someone otherwise reasonably active I wouldn't stress about the number. Ideal is somewhere around 27.


BMI is useful for screening purposes but on an individual basis it's meaningless as a predictor of all-cause mortality. What really matters is body composition, or more specifically amount of visceral fat (subcutaneous fat doesn't matter nearly as much).

https://my.clevelandclinic.org/health/diseases/24147-viscera...


Where are you getting this number? Over 27% body fat is a health risk. For an active but not muscular individual, 30 BMI is at least 33% body fat, likely higher.


From the multitude of studies, in particular the relatively recent huge multi-study analysis of all-cause mortality vs BMI.


BMI over 25 is overweight. Obese is 30 or higher.


Yes, those are the definitions we have assigned to the number. However, independent of the arbitrary labels, the actual impact on health matters more to me.


bmi doesn't mean much though.

size, body composition, ethnicity will give very different meanings to the same bmi.


Sure. I lost 20lbs in the last 12 months. I agree it’s worth working on, but not that it’s worth stressing about.


Don't feel bad about your VO2 max, the baseline and ceiling are largely genetic. Most people can only bump VO2 max by about 10-15% even with absurd training regimens. Same goes with many of the markers people track - you can control them to an extent, but some people just have high blood pressure or poor lipid profiles and thus need intervention.


Thanks for saying that. Even when I ran every day — with the occasional VO2max sprint day —, my Apple Watch never placed me anywhere but Below Average for VO2max. It was disheartening. Some of these metrics actually put you off training.


Appreciate it. The Apple-Watch-measured version came up to 44 since I’ve started running. I’ve been pleased.

None of my markers are high enough to trigger a doctor to care.


44 is nothing to sneeze at, that's a solid upper end of average range.


BMI isn't a great predictor of health. Waist-to-hip ratio and body-fat percentage are probably somewhat better indicators.

https://www.yalemedicine.org/news/why-you-shouldnt-rely-on-b...


I don't think we ever get away from the code being the source of truth. There has to be one source of truth.

If you want to go all in on specs, you must fully commit to allowing the AI to regenerate the codebase from scratch at any point. I'm an AI optimist, but this is a laughable stance with current tools.

That said, the idea of operating on the codebase as a mutable, complex entity, at arms length, makes a TON of sense to me. I love touching and feeling the code, but as soon as there's 1) schedule pressure and 2) a company's worth of code, operating at a systems level of understanding just makes way more sense. Defining what you want done, using a mix of user-centric intent and architecture constraints, seems like a super high-leverage way to work.

The feedback mechanisms are still pretty tough, because you need to understand what the AI is implicitly doing as it works through your spec. There are decisions you didn't realize you needed to make, until you get there.

We're thinking a lot about this at https://tern.sh, and I'm currently excited about the idea of throwing an agentic loop around the implementation itself. Adversarially have an AI read through that huge implementation log and surface where it's struggling. It's a model that gives real leverage, especially over the "watch Claude flail" mode that's common in bigger projects/codebases.


> There are decisions you didn't realize you needed to make, until you get there.

Is the key insight and biggest stumbling block for me at the moment.

At the moment (encourage by my company) I'm experimenting with as hands off as possible Agent usage for coding. And it is _unbelievably_ frustrating to see the Agent get 99% of the code right in the first pass only to misunderstand why a test is now failing and then completely mangle both it's own code and the existing tests as it tries to "fix" the "problem". And if I'd just given it a better spec to start with it probably wouldn't have started producing garbage.

But I didn't know that before working with the code! So to develop a good spec I either have to have the agent stopping all the time so I can intervene or dive into the code myself to begin with and at that point I may as well write the code anyway as writing the code is not the slow bit.


For sure. One of our first posts was called "You Have To Decide" -- https://tern.sh/blog/you-have-to-decide/

And my process now (and what we're baking into the product) is:

- Make a prompt

- Run it in a loop over N files. Full agentic toolkit, but don't be wasteful (no "full typecheck, run the test suite" on every file).

- Have an agent check the output. Look for repeated exploration, look for failures. Those imply confusion.

- Iterate the prompt to remove the confusion.

First pass on the current project (a Vue 3 migration) went from 45 min of agentic time on 5 files to 10 min on 50 files, and the latter passed tests/typecheck/my own scrolling through it.


>Adversarially have an AI read through that huge implementation log and surface where it's struggling.

That's a good idea, have a specification, divide into chunks, have an army of agents, each of them implementing a chunk, have an agent identify weak points, incomplete implementations, bugs and have an army of agents fixing issues.


The reason code can serve as the source of truth is that it’s precise enough to describe intent, since programming languages are well-specified. Compilers have freedom in how they translate code into assembly and two different compilers ( or even different optimization flags) will produce distinct binaries. Yet all of them preserve the same intent and observable behaviour that the programmer cares about. Runtime performance or instruction order may vary, but the semantics remain consistent.

For spec driven development to truly work, perhaps what’s needed is a higher level spec language that can express user intent precisely, at the level of abstraction where the human understanding lives, while ensuring that the lower level implementation is generated correctly.

A programmer could then use LLMs to translate plain English into this “spec language,” which would then become the real source of truth.


What about pseudocode? It is high level enough.


Right, but it needs to be formalized.


Tern looks very interesting.

On your homepage there is a mention that Tern “writes its own tools”, could you give an example on how this works?


If you're thinking about, e.g. upgrading to Django 5, there's a bunch of changes that are sort of code-mod-shaped. It's possible that there's not a codemod for it it that works for you.

Tern can write that tool for you, then use it. It gives you more control in certain cases than simply asking the AI to do something that might appear hundreds of times in your code.


My conversation

"Design a schema like Calendly" --> Did it

"OK let's scale this to 100m users" --> Tells me how it would. No schema change.

"Did you update the schema?" --> Updates the schema, tells me what it did.

We've been running into this EXACT failure mode with current models, and it's so irritating. Our agent plans migrations, so it's code-adjacent, but the output is a structured plan (basically: tasks, which are prompt + regex. What to do; where to do it.)

The agent really wants to talk to you about it. Claude wants to write code about it. None of the models want to communicate with the user primarily through tool use, even when (as I'm sure ChartDB is) HEAVILY prompted to do so.

I think there's still a lot of value there, but it's a bummer that we as users are going to have to remind all LLMs for a little bit to do keep using their tools beyond the 1st prompt.


I asked it to abstract a event-specific table to a GP "events" table which it did, but kept the specific table. I asked it to delete that table and it said it did, but did not. I got stuck in a loop asking it to remove the table that the LLM insisted was not part of the schema, but was present in the diagram.

It was easier to close the tab than fire a human, but other than that not a great experience.


isnt this what the agents are for, you assign them jobs to make changes then evaluate those changes. there is a necessary orchestration piece and maybe even a triage role to sort through things to do and errors to fix


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: