More

bjornsing · 2025-10-19T14:25:56 1760883956

Will Anthropic/OpenAI really hire anyone who can fine-tune an LLM?

gdiamos · 2025-10-19T14:32:26 1760884346

They will hire anyone who can produce a model better than GPT5, which is the bar for fine tuning

Otherwise, you should just use gpt5

Preparing a few thousands training examples and pressing fine tune can improve the base LLM in a few situations, but it also can make the LLM worse at other tasks in hard to understand ways that only show up in production because you didn’t build evals that are good enough to catch them. It also has all of the failure modes of deep learning. There is a reason why deep learning training never took off like LLMs did despite many attempts at building startups around it.

Andrej karpathy has a rant about it that captures some of the failure modes of fine tuning - https://karpathy.github.io/2019/04/25/recipe/

criemen · 2025-10-19T17:31:53 1760895113

> They will hire anyone who can produce a model better than GPT5, which is the bar for fine tuning

Depends on what you want to achieve, of course, but I see fine-tuning at the current point in time primarily as a cost-saving measure: Transfer GPT5-levels of skill onto a smaller model, where inference is then faster/cheaper to run. This of course slows down your innovation cycle, which is why generally this is imo not advisable.

gdiamos · 2025-10-19T19:41:59 1760902919

I agree this is the main case where it makes sense.

But a recent trend that cut into the cost savings is that foundation model companies have started releasing small models. So you can build a use case with qwen 235B, then shrink down to 30B, or even all the way down to 0.6B if you really want to.

The smaller models lose some accuracy, but some use cases are solvable even by these smaller and much more efficient models.

kgwgk · 2025-10-19T19:34:14 1760902454

> but it also can make the LLM worse at other tasks

The problem is easily avoided by not using it for other tasks.

gdiamos · 2025-10-19T19:39:00 1760902740

Users often found it hard to know exactly where the boundaries are.

This is a reason why general purpose models shine. You don’t have to carefully characterize a task and put guard rails around it.

kgwgk · 2025-10-19T19:51:44 1760903504

There is also a reason why you don’t have general purpose applications. Most users understand that Excel is for data tables and Paint is for images even though some people have fun playing with the boundary and creating Excel paintings.

gdiamos · 2025-10-19T20:06:49 1760904409

This is exactly the intuition that leads to excitement about fine tuning.

However, I personally think that this intuition applies to products and interfaces, not to AI.

Intelligence and learning is general. Intelligence without generalization is memorization, which seems to be less useful in practice.

kgwgk · 2025-10-19T21:40:20 1760910020

What people use are products and interfaces, not "AI".

fragmede · 2025-10-20T06:32:59 1760941979

Interesting you bring up Excel. ChatGPT's chat interface is going to be Excel for the AI era. Everyone knows there's a better interface to be had, but it just works.

kgwgk · 2025-10-20T11:26:54 1760959614

In the pre-AI era Excel was not « the » interface. Most people didn’t use Excel at all!

yunwal · 2025-10-19T14:47:09 1760885229

It’s quite easy to produce a model that’s better than GPT-5 at arbitrarily small tasks. As of right now, GPT-5 can’t classify a dog by breed based on good photos for all but the most common breeds, which is like an AI-101 project.

gdiamos · 2025-10-19T14:53:09 1760885589

Try doing a head to head comparison using all LLM tricks available including prompt engineering, rag, reasoning, inference time compute, multiple agents, tools, etc

Then try the same thing using fine tuning. See which one wins. In ML class we have labeled datasets with breeds of dogs hand labeled by experts like Andrej, in real life users don’t have specific, clearly defined, and high quality labeled data like that.

I’d be interested to be proven wrong

I think it is easy for strong ML teams to fall into this trap because they themselves can get fine tuning to work well. Trying to scale it to a broader market is where it fell apart for us.

This is not to say that no one can do it. There were users who produced good models. The problem we had was where to consistently find these users who were willing to pay for infrastructure.

I’m glad we tried it, but I personally think it is beating a dead horse/llama to try it today

mountainriver · 2025-10-19T20:19:45 1760905185

There are tons of problems this simply doesn’t apply to. In the limited API world this may be true but agents are far from reliable

yunwal · 2025-10-19T15:02:02 1760886122

I mean, at the point where you’re writing tools to assist it, we are no longer comparing the performance of 2 LLMs. You’re taking a solution that requires a small amount of expertise, and replacing it with another solution that requires more expertise, and costs more. The question is not “can fine tuning alone do better than every other trick in the book plus a SOTA LLM plus infinite time and money?” The question is: “is fine tuning useful?”

gdiamos · 2025-10-19T15:28:29 1760887709

Fair didn’t seem to matter to users who just wanted to build solutions with reasonable time and budget

echelon · 2025-10-19T14:59:53 1760885993

If your customers can't fine tune, do it for them instead.

gdiamos · 2025-10-19T15:13:01 1760886781

How can you hire enough people to scale that while making the economics work?

Why would they join you rather than founding their own company?

CaptainOfCoit · 2025-10-19T16:28:20 1760891300

> How can you hire enough people to scale that while making the economics work?

Once you (as in you the person) have the expertise, what you need all the people for exactly? To fine-tuning you need to figure out the architecture, how to train, how to infer, pick together the dataset and then run the training (optionally setup a pipeline so the customer can run the "add more data -> train" process themselves). What in this process you need to hire so many people for?

> Why would they join you rather than founding their own company?

Same as always, in any industry, not everyone wants to lead and not everyone wants to follow.

gdiamos · 2025-10-19T17:59:32 1760896772

llm.finetune(data) is a leaky abstraction

Read Andrej’s blog that I linked earlier in the thread if you want to understand why.

CaptainOfCoit · 2025-10-19T18:52:53 1760899973

If it works it works? :shrug:

gdiamos · 2025-10-19T19:44:24 1760903064

The problem is that it doesn’t always work and when it does fail it fails silently.

Debugging requires knowing some small detail about your data distribution or how you did gradient clipping which take time and painstakingly detailed experiments to uncover.

CaptainOfCoit · 2025-10-20T09:54:57 1760954097

> The problem is that it doesn’t always work and when it does fail it fails silently.

Right, but why does that mean you need more employees? You need to figure out how to surface failures, rather than just adding more meat to the problem.

echelon · 2025-10-19T15:24:36 1760887476

> How can you hire enough people to scale that while making the economics work?

Pick the right customers.

> Why would they join you rather than founding their own company?

The network effects of having enough resources in one place. For having other teams deal with the training data, infrastructure, deployment, etc.

gdiamos · 2025-10-19T15:30:39 1760887839

I think you are saying to go after the very high end of the market.

That’s fair, one market segment of this is sometimes called sovereign compute.

Another common model that I have seen is to become the deepmind for one very large and important customer.

I think this works.

Der_Einzige · 2025-10-20T17:22:01 1760980921

Yup! That's why civit.ai doesn't exist right?

They'll pay for anyone that can personalize models to be meaningfully diverse.

danielmarkbruce · 2025-10-20T02:36:08 1760927768

I think you misunderstand what they are saying - doing a good job of fine tuning is difficult.

Training an LLM from scratch is trivial - training a good one is difficult. Fine tuning is trivial - doing a good job is difficult. Hitting a golf ball is trivial - hitting a 300 yard drive down the middle of the fairway is difficult.

bjornsing · 2025-10-18T07:50:24 1760773824

I ran a small ISP around the same time that used this behavioral pattern to bring down the customer acquisition cost to near zero. Essentially we sold ADSL connections with Wi-Fi and a second SSID where anybody could connect and sign up for internet access. If too may signed up we sent out personal offers for ADSL service to some of them and wired up their homes too. Fun project, but stressful and not very profitable.

bjornsing · 2025-10-15T04:01:19 1760500879

Yes. But I don’t think the OP is suggesting this as an alternative to using an array. As I read / skimmed it the linked list is just a simplified example. You can use this trick in more complex situations too, eg if you’re searching a tree structure and you know that some paths through the tree are much more common than others.

bjornsing · 2025-10-15T03:56:36 1760500596

But that works on a different level, right? At least as I understand it data speculation is about prefetching from memory into cache. This trick is about using the branch predictor as an ultra-fast ”L0” cache you could say. At least that’s how I understand it.

stinkbeetle · 2025-10-15T06:43:43 1760510623

This is doing value speculation in software, using the branch predictor. The hardware of course does do that and instead uses different tables for deriving a predicted value, and misprediction will be detected and flushed in a slightly different way.

But the effect on the main sequence of instructions in the backend will be quite similar. In neither case is it a "prefetch" as such, it is actually executing the load with the predicted value and the result will be consumed by other instructions, decoupling address generation from dependency on previous load result.

bjornsing · 2025-10-15T09:43:40 1760521420

Yeah that’s sort of how I understand the OP too: The CPU will execute speculatively on the assumption that the next element in the linked list is consecutive in memory, so it doesn’t have to wait for L1 cache. It needs to check the real value in L1 of course, but not synchronously.

bjornsing · 2025-10-12T04:52:21 1760244741

Yeah I think this is a general principle. Just look at the quality of US presidents over time, or generations of top physicists. I guess it’s just a numbers game: the number of genuinely interested people is relatively constant while the number of gamers grows with the compensation and perceived status of the activity. So when compensation and perceived status skyrockets the ratio between those numbers changes drastically.

godelski · 2025-10-12T05:40:04 1760247604

I think the number of generally interested people goes up. Maybe the percent stays the same? But honestly, I think we kill passion for a lot of people. To be cliche, how many people lose the curiosity of a child? I think the cliche exists for a reason. It seems the capacity is in all of us and even once existed.

bjornsing · 2025-10-16T06:52:40 1760597560

To some extent I think that’s just human nature, or even animal nature. The optimal explore / exploit tradeoff changes as we age. When we’re children it’s beneficial to explore. As adults it’s often more beneficial to exploit. But you need cultural and organizational safeguards that protect those of us who are more childish and explorative from those that are more cynical and exploitative. Otherwise pursuits of truth aren’t very fruitful.

bjornsing · 2025-10-04T03:05:09 1759547109

Swedish banks (even the Riksbank linked above) regularly refuse to turn cash into digital money unless you can ”prove” where you got it from. It’s not sufficient to say (with immense credibility) that you worked hard all your life and saved it. Entire inheritances are regularly wiped out due to this, when high denomination bills are obsoleted by the Riksbank. So I’d say it’s not only a common sentiment but also government policy.

bjornsing · 2025-10-03T15:02:08 1759503728

You’d have to explain where that innate knowledge is stored though. The entire human genome is less than a GB if I remember correctly. Some of that being allocated to ”priors” for neural circuit development seems reasonable, but it can’t be very detailed across everything a brain does. The rest of the body needs some bytes too.

arcwhite · 2025-10-03T22:10:43 1759529443

Not really - that 1GB is the seed for a procedural generation mechanism that has been finely tuned to its unfolding in an environment over 4 billion years.

DNA is the ultimate demoscene exe

bjornsing · 2025-10-04T03:13:41 1759547621

Sure. But that’s just compression, right? I guess you could argue that some information is stored outside the genome, in the structure of proteins etc. But the counter argument is that that information is quickly lost in cell divisions. Only DNA has the error correcting mechanisms needed to reliably store information, is my impression.

bjornsing · 2025-10-03T06:17:40 1759472260

"Exactly-Once Event Processing" is possible if (all!) the processing results go into a transactional database along with the stream position marker in a single transaction. That’s probably the mechanism they are relying on.

bjornsing · 2025-10-03T06:14:06 1759472046

> Yes, in any durability framework there's still the possibility that a process crashes mid-step, in which case you have no choice but to restart the step.

Golem [1] is an interesting counterexample to this. They run your code in a WASM runtime and essentially checkpoint execution state at every interaction with the outside world.

But it seems they are having trouble selling into the workflow orchestration market. Perhaps due to the preconception above? Or are there other drawbacks with this model that I’m not aware of?

1. https://www.golem.cloud/post/durable-execution-is-not-just-f...

vineyardmike · 2025-10-03T10:33:56 1759487636

That still fundamentally suffers the same idempotency problem as any other system. When interacting with the outside world, you, the developer, need to be idempotent and enforce it.

For example, if you call an API (the outside world) to charge the user’s credit card, and the WASM host fails and the process is restarted, you’ll need to be careful to not charge again. This can happen after the request is issued, but before the response is received/processed.

This is no different than any other workflow library or service.

The WASM idea is interesting, and maybe lets you be more granular in how you checkpoint (eg for complex business logic that is self-contained but expensive to repeat). The biggest win is probably for general preemption or resource management, but those are generally wins for the provider not the user. Also, this requires compiling your application into WASM, which restricts which languages/libraries/etc you can use.

bjornsing · 2025-10-03T14:00:41 1759500041

The challenges around idempotency remain to some extent, yes. But you have that problem even in non-workflow code, so the usual patterns will just work with no extra mental effort from the developer.

qianli_cs · 2025-10-03T06:28:10 1759472890

I think one potential concern with "checkpoint execution state at every interaction with the outside world" is the size of the checkpoints. Allowing users to control the granularity by explicitly specifying the scope of each step seems like a more flexible model. For example, you can group multiple external interactions into a single step and only checkpoint the final result, avoiding the overhead of saving intermediate data. If you want finer granularity, you can instead declare each external interaction as its own step.

Plus, if the crash happens in the outside world (where you have no control), then checkpointing at finer granularity won't help.

bjornsing · 2025-10-03T10:20:16 1759486816

Sure you get more control with explicit state management. But it’s also more work, and more difficult work. You can do a lot of writes to NVMe for one developer salary.

jedberg · 2025-10-03T16:03:01 1759507381

It's not really more work to be explicit about the steps and workflows. You already have to break your code into steps to make your program run. Adding a single decorator isn't much extra work at all.

jedberg · 2025-10-03T16:01:05 1759507265

The biggest downsides to their methodology are that the snapshots can get really big really quickly, and that they are hard to introspect since they are binary blobs of memory dumps.

bjornsing · 2025-10-04T03:16:13 1759547773

Yeah the whole methodology depends on forgetting about state and treating it as a long-running program. If you need to look at the state then you connect a debugger, etc.

bjornsing · 2025-10-02T06:19:37 1759385977

But couldn’t an LLM search for documents in that enterprise knowledge base just like humans do, using the same kind of queries and the same underlying search infrastructure?

z3dd · 2025-10-02T06:53:46 1759388026

I wouldn't say humans are efficient at that so no reason to copy, other than as a starting point.

carlmr · 2025-10-02T07:35:38 1759390538

Maybe not efficient, but if the LLMs can't even reach this benchmark then I'm not sure.

zwaps · 2025-10-02T08:10:09 1759392609

Yes but that would be worse than many RAG approaches, which were implemented precisely because there is no good way to cleanly search through a knowledge base for a million different reasons.

At that point, you are just doing Agentic RAG, or even just Query Review + RAG.

I mean, yeah, agentic RAG is the future. It's still RAG though.