"Conclusion
Our monorepo isn't about following a trend. It's about removing friction between things that naturally belong together, something that is critical when related context is everything.
When a feature touches the backend API, the frontend component, the documentation, and the marketing site—why should that be four repositories, four PRs, four merge coordination meetings?
The monorepo isn't a constraint. It's a force multiplier."
Nit: regarding (2), Phil Blunsom did (same Blunsom from the article, and who was leading language modeling at DeepMind for about 7-8 years). He would often opine at Oxford (where he taught) that solving next word prediction is a viable meta path to AGI. Almost nobody agreed at the time. He also called out early that scaling and better data were the key, and they did end up being, although Google wasn’t as “risk on” as OpenAI on gathering the data for GPT-1/2. Had they been history could easily have been different. People forget the position OAI was in at the time. Elon/funding had left, key talent had left. Risk appetite was high for that kind of thing… and it paid off.
"This is not the reason, the reason is that this data is private. LLMs do not just learn from data, they can often reproduce it verbatim, you cannot give medical records or bank records of real people, that will put them at a very real risk."
(OP) You make great points. I think we're actually more in agreement than might be obvious. Part of the reason you need to "give" data to an LLM is because of the way LLMs are constructed... which creates the privacy risk.
The principle of attribution-based control suggested in this article would break that principle, enabling each data owner to control which AI predictions they make more intelligent (as opposed to only controlling which IA models they help train).
So to your point... this is a very rigorous privacy protection. Another way to TLDR the article is "if we get really good at privacy... there's a LOT more data out there... so let's start really caring about privacy"
Anyway... I agree with everything in your comment. Just thought I'd drop by and try to lend clarity to how the article agrees with you (sounds like there's room for improvement on how to describe attribution-based control though).
I think this is the right question to ask. I think it depends on the task. For example, if you want to predict whether someone has cancer, then access to avast amounts of medical information would be important.
This article is meant for a policy audience, so that does keep the technical depth pretty thin. It's rooted in more rigorous deep learning work. Happy to send your way if interested.
I agree with you in a way - that it seems likely that new data will be incorproated in more inference-like ways. RAG is a little extreme... but i think there's going to be middle grounds betweeen full pre-training and RAG. Git-rebasin, MoE, etc.
reply