They could still have trained the model in such a way as to focus on benchmarks, e.g. training on more examples of ARC style questions.
What I've noticed when testing previous versions of Grok, on paper they were better at benchmarks, but when I used it the responses were always worse than Sonnet and Gemini even though Grok had higher benchmark scores.
Occasionally I test Grok to see if it could become my daily driver but it's never produced better answers than Claude or Gemini for me, regardless of what their marketing shows.
They could still have trained the model in such a way as to focus on benchmarks, e.g. training on more examples of ARC style questions
That's kind of the idea behind ARC-AGI. Training on available ARC benchmarks does not generalize. Unless it does... in which case, mission accomplished.
Seems still possible to spend effort of building up an ARC-style dataset and that would game the test. The ARC questions I saw were not of some completely unknown topic, they were generally hard versions of existing problems in well-known domains. Not super familiar with this area in general though so would be curious if I'm wrong.
ARC-AGI isn't question- or knowledge-based, though, but "Infer the pattern and apply it to a new example you haven't seen before." The problems are meant to be easy for humans but hard for ML models, like a next-level CAPTCHA.
They have walked back the initial notion that success on the test requires, or demonstrates, the emergence of AGI. But the general idea remains, which is that no amount of pretraining on the publicly-available problems will help solve the specific problems in the (theoretically-undisclosed) test set unless the model is exhibiting genuine human-like intelligence.
Getting almost 16% on ARC-AGI-2 is pretty interesting. I wish somebody else had done it, though.
This is not hard to build datasets that have these types of problems in them, and I would expect LLMs to generalize this well. I don’t see how this is any different really than any other type of problem LLMs are good at given they have the dataset to study.
I get they keep the test updated with secret problems, but I don’t see how companies can’t game this just by investing in building their own datasets, even if it means paying teams of smart people to generate them.
The other question is if enough examples of this type of task are helpful and generalizable in some way. If so, why wouldn't you integrate that dataset into your training pipeline of an LLM.
I use Grok with repomix to review my code and it tends to give decent answers and is a bit better at giving actual actionable issues with code examples than, say Gemini 2.5 pro.
But the lack of a CLI tool like codex, claude code or gemini-cli is preventing it from being a daily driver. Launching a browser and having to manually upload repomixed content is just blech.
With gemini I can just go `gemini -p "@repomix-output.xml review this code..."`
As I said, either by benchmark contamination (it is semi-private and could have been obtained by persons from other companies which model have been benchmarked) or by having more compute.
I still dont understand why people point to this chart as any sort of meaning. Cost per task is a fairly arbitrary X axis and in no way representing any sort of time scale.. I would love to be told how they didn't underprice their model and give it an arbitrary amount of time to work.
The beginning of a new kind of discrimination - call it 'synthetic racism.' AI-generated music is being dismissed outright even before listening to it, not based on quality or enjoyment but purely on its artificial origin. Just as past prejudices dismissed art based on heritage rather than merit, we're now seeing a new bias against anything not 'human-made.'
Art is an emotional experience. Sometimes people enjoy art because it elicits an emotion in them. And sometimes they enjoy art because of the emotional effort that went into it from the creator.
It’s the same reason some people don’t like generic pop music due to its formulaic commercialism.
So if people are discriminating AI art because they want to experience the emotions that the authors put into the pieces, then I’m ok with that right up until it can be argued that AI experiences emotions.
Humans are not the center of the universe. There is nothing intrinsically magical in being a human. Humans don't have soul. We were not created in "God"'s image. We are not special. We are just like other animals and continuously evolving. AI is just the next step in the evolution.
All the emotion serve as a shortcut for behaviors that make evolutionary sense.
Things suck before they get better and then keep on getting better.
100% atheist here. Humans are the center of my universe, and humans are special to me. I'm team human. I don't care if AI is the next step in the evolution, I would absolutely kill it if necessary to save humanity.
You may get most of that by prompting. You can create really based or emotional characters even with “aligned” models (with a little realignment). Have you ever talked to an LLM that allows system prompting? Heard of [E]RP? You can even teach them to not produce idiotic bullet points.
I won’t argument on the art part, but these common AI stereotypes are not true.
The AI does not possess those attributes. It’s behaving as told. It has no experience. It has no senses. It has no thoughts or ability to reason. It has no motivation.
Could we not - for the sake of argument here - surmise that, since these AIs need prompts, and usually a few rounds of refinement, and then a selection for uploading to (in this case) YouTube, that the -human- in charge of prompting/refinement/uploading has a point of view, an opinion, an aesthetic preference, emotions?
After all, there are artists that collate "samples" from other artists and produce music from all those different samples. They did not play any instrument, they merely arranged and modified these samples into a product that they presumably find pleasing.
The only way we can make that assumption is if the -human- makes it obvious. Tell me the people mass producing AI slop for YouTube/Spotify are approaching this with sincere intent.
It’s not even told, it continues a text (or denoises an image) in a way that closely resembles what was in the training data. Experience, senses, thoughts, reasoning and motivation were all there in original human- and nature-produced data. https://news.ycombinator.com/item?id=43003186
Doesn’t mean the result is ideal, far from that. But your “bullet points” imply some specialness which has to be explained. Personally I don’t love the style of refusal you’re demonstrating, because it’s similar to if you know you know and other self-referential nonsense. At least add some becauses into your arguments, because “it has no X” is a simplification far below the level of usefulness here.
Anyway, how does that prevent creating AI personas again?
I am willing to bet that even when driverless taxis are operating in at least 50% of big cities around the world, you will still see comments like "auto driving is a pipe dream like NFT" on HN every other day.
This kind of hypocrisy only exists in a made up person. Anyone who is saying that autonomous vehicles are still a ways away are not taking about the very impressive but very much semi-autonomous vehicles deployed today. But instead vehicles that have no need for a human operator ever. The kind you could buy off the shelf, switch into taxi mode, and let it do its thing.
Semi-autonomous vehicles are impressive for the fact that one driver can now scale well beyond a single vehicle. Fully-autonomous vehicles are
impressive because they can scale limitlessly. The former is evolutionary, the latter is revolutionary.
Have we ever observed revolutionary change in tech which ran contrary to evolutionary change?
This seems like such an odd thing to expect will just "happen". Any other world-changing or impressive tech I'm familiar with has evolved to its current state over time, it's not like when Jobs announced the iPhone and changed the game there wasn't decades of mobile computing whose shoulders it stood on. Are you talking about something like crypto?
It's admittedly a bit confusing what you're asking for here.
What did Musk's promised driverless taxis provide that existing driverless taxis don't? The tech has arrived; it's a car that drives itself while the passenger sits in the back. Is the "gotcha" that the car isn't a Tesla?
He promised that you'd be able to turn your own Tesla into an autonomous taxi that would earn you money.
That is a massive lie, not splitting hairs.
Obviously, we're very desensitized to lying rats - but that's what he did.
Looks like it's still in the works. Sometimes when technologists promise something their timelines end up being overly optimistic. I'm sure this isn't news to you.
By your language and past commentary though this seems like the kind of thing which elicits a pretty emotional response from you, so I'm not sure this would be a productive conversation either way.
I don't understand it either. Why is Elon Musk a "terrorist"? And why is this the most upvoted post? Maybe being European limits my ability to comprehend American political rhetoric.
> The American left, who hates Musk because he’s an outspoken and unrepentant capitalist (among other reasons)
Ah yes, all the criticism I hear about Musk from my liberal friends has everything to do with his "unrepentant capitalism".
You either have a small or very peculiar social circle. Your comment isn't technically wrong since you were thoughtful enough to add "(among other reasons)" but it almost beggars belief that you'd be serious here. Do I really have to spell out why the left hate Musk?
First: I did not use the word terrorist, nor do I know anyone that has. I am sure some people feel that way, particularly those that are close to communities in which he is inspiring fear and anxiety.
As for why I and "the left" don't like him? To be honest, I don't really want to have that conversation in an internet forum, and your incuriosity on your original comment coupled with inserting words into my mouth gives me zero belief that this will be a good faith conversation. Just ask ChatGPT if you really want to know, or I'm sure there's reddit threads on the subject.
How about a consequentialist argument? In some fields, AI has already surpassed physicians in diagnosing illnesses. If breaking copyright laws allows AI to access and learn from a broader range of data, it could lead to earlier and more accurate diagnoses, saving lives. In this case, the ethical imperative to preserve human life outweighs the rigid enforcement of copyright laws.
No, the development of an artificial general intelligence does seem like a special case compared to usual IP debates, particularly in the potential multiplicative positive-sum effects on society overall.
https://x.com/arcprize/status/1943168950763950555