Hacker Newsnew | past | comments | ask | show | jobs | submit | more z7's commentslogin

How do you explain Grok 4 achieving new SOTA on ARC-AGI-2, nearly doubling the previous commercial SOTA?

https://x.com/arcprize/status/1943168950763950555


They could still have trained the model in such a way as to focus on benchmarks, e.g. training on more examples of ARC style questions.

What I've noticed when testing previous versions of Grok, on paper they were better at benchmarks, but when I used it the responses were always worse than Sonnet and Gemini even though Grok had higher benchmark scores.

Occasionally I test Grok to see if it could become my daily driver but it's never produced better answers than Claude or Gemini for me, regardless of what their marketing shows.


They could still have trained the model in such a way as to focus on benchmarks, e.g. training on more examples of ARC style questions

That's kind of the idea behind ARC-AGI. Training on available ARC benchmarks does not generalize. Unless it does... in which case, mission accomplished.


Seems still possible to spend effort of building up an ARC-style dataset and that would game the test. The ARC questions I saw were not of some completely unknown topic, they were generally hard versions of existing problems in well-known domains. Not super familiar with this area in general though so would be curious if I'm wrong.


ARC-AGI isn't question- or knowledge-based, though, but "Infer the pattern and apply it to a new example you haven't seen before." The problems are meant to be easy for humans but hard for ML models, like a next-level CAPTCHA.

They have walked back the initial notion that success on the test requires, or demonstrates, the emergence of AGI. But the general idea remains, which is that no amount of pretraining on the publicly-available problems will help solve the specific problems in the (theoretically-undisclosed) test set unless the model is exhibiting genuine human-like intelligence.

Getting almost 16% on ARC-AGI-2 is pretty interesting. I wish somebody else had done it, though.


I’ve seen some of the problems before, like https://o3-failed-arc-agi.vercel.app/

This is not hard to build datasets that have these types of problems in them, and I would expect LLMs to generalize this well. I don’t see how this is any different really than any other type of problem LLMs are good at given they have the dataset to study.

I get they keep the test updated with secret problems, but I don’t see how companies can’t game this just by investing in building their own datasets, even if it means paying teams of smart people to generate them.


The other question is if enough examples of this type of task are helpful and generalizable in some way. If so, why wouldn't you integrate that dataset into your training pipeline of an LLM.


I use Grok with repomix to review my code and it tends to give decent answers and is a bit better at giving actual actionable issues with code examples than, say Gemini 2.5 pro.

But the lack of a CLI tool like codex, claude code or gemini-cli is preventing it from being a daily driver. Launching a browser and having to manually upload repomixed content is just blech.

With gemini I can just go `gemini -p "@repomix-output.xml review this code..."`


Well try it again and report back.


As I said, either by benchmark contamination (it is semi-private and could have been obtained by persons from other companies which model have been benchmarked) or by having more compute.


I still dont understand why people point to this chart as any sort of meaning. Cost per task is a fairly arbitrary X axis and in no way representing any sort of time scale.. I would love to be told how they didn't underprice their model and give it an arbitrary amount of time to work.


"Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9%."

"This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA."

https://x.com/arcprize/status/1943168950763950555


Quoting Chollet:

>I have repeatedly said that "can LLM reason?" was the wrong question to ask. Instead the right question is, "can they adapt to novelty?".

https://x.com/fchollet/status/1866348355204595826



Why are you hallucinating feelings? Also, appeal to authority. ("Why are your feelings relevant to the wizarding laws of Hogwarts?")


>For starters, this completely blocks generation of anything remotely related to copy-protected IPs

It did Dragon Ball Z here:

https://old.reddit.com/r/ChatGPT/comments/1jjtcn9/the_new_im...

Rick and Morty:

https://old.reddit.com/r/ChatGPT/comments/1jjtcn9/the_new_im...

South Park:

https://old.reddit.com/r/ChatGPT/comments/1jjyn5q/openais_ne...


The beginning of a new kind of discrimination - call it 'synthetic racism.' AI-generated music is being dismissed outright even before listening to it, not based on quality or enjoyment but purely on its artificial origin. Just as past prejudices dismissed art based on heritage rather than merit, we're now seeing a new bias against anything not 'human-made.'


Art is an emotional experience. Sometimes people enjoy art because it elicits an emotion in them. And sometimes they enjoy art because of the emotional effort that went into it from the creator.

It’s the same reason some people don’t like generic pop music due to its formulaic commercialism.

So if people are discriminating AI art because they want to experience the emotions that the authors put into the pieces, then I’m ok with that right up until it can be argued that AI experiences emotions.


I've favorited this comment because it could easily pass for satire of certain types of comments on HN.


Too Poe For Poe (“synthetic racism” is unreal!)


Show me an AI with a point of view.

Show me an AI with an opinion; an aesthetic preference.

Show me an AI that has emotions.

Show me an AI that’s chosen to make sacrifices to produce its art.

AI cannot produce art. Images are not intrinsically art. Sound is not intrinsically art. Art requires thought and intent. AI is not capable of either.


Humans are not the center of the universe. There is nothing intrinsically magical in being a human. Humans don't have soul. We were not created in "God"'s image. We are not special. We are just like other animals and continuously evolving. AI is just the next step in the evolution.

All the emotion serve as a shortcut for behaviors that make evolutionary sense.

Things suck before they get better and then keep on getting better.


100% atheist here. Humans are the center of my universe, and humans are special to me. I'm team human. I don't care if AI is the next step in the evolution, I would absolutely kill it if necessary to save humanity.


Gee let’s hope free will is real, otherwise this take gets really awkward.


I’m not worried about questions we can’t answer. I’m worried about businesses destroying artists.


Ok, your original comment seemed more concerned with the genesis of opinion and creativity, but that’s good to know too.


You may get most of that by prompting. You can create really based or emotional characters even with “aligned” models (with a little realignment). Have you ever talked to an LLM that allows system prompting? Heard of [E]RP? You can even teach them to not produce idiotic bullet points.

I won’t argument on the art part, but these common AI stereotypes are not true.


The AI does not possess those attributes. It’s behaving as told. It has no experience. It has no senses. It has no thoughts or ability to reason. It has no motivation.


Could we not - for the sake of argument here - surmise that, since these AIs need prompts, and usually a few rounds of refinement, and then a selection for uploading to (in this case) YouTube, that the -human- in charge of prompting/refinement/uploading has a point of view, an opinion, an aesthetic preference, emotions?

After all, there are artists that collate "samples" from other artists and produce music from all those different samples. They did not play any instrument, they merely arranged and modified these samples into a product that they presumably find pleasing.


The only way we can make that assumption is if the -human- makes it obvious. Tell me the people mass producing AI slop for YouTube/Spotify are approaching this with sincere intent.


It’s not even told, it continues a text (or denoises an image) in a way that closely resembles what was in the training data. Experience, senses, thoughts, reasoning and motivation were all there in original human- and nature-produced data. https://news.ycombinator.com/item?id=43003186

Doesn’t mean the result is ideal, far from that. But your “bullet points” imply some specialness which has to be explained. Personally I don’t love the style of refusal you’re demonstrating, because it’s similar to if you know you know and other self-referential nonsense. At least add some becauses into your arguments, because “it has no X” is a simplification far below the level of usefulness here.

Anyway, how does that prevent creating AI personas again?


There is no intent, so it is no more discriminatory than removing autumn leaves is discrimination against trees.


Waymo's driverless taxis are currently operating in San Francisco, Los Angeles and Phoenix.


I am willing to bet that even when driverless taxis are operating in at least 50% of big cities around the world, you will still see comments like "auto driving is a pipe dream like NFT" on HN every other day.


This kind of hypocrisy only exists in a made up person. Anyone who is saying that autonomous vehicles are still a ways away are not taking about the very impressive but very much semi-autonomous vehicles deployed today. But instead vehicles that have no need for a human operator ever. The kind you could buy off the shelf, switch into taxi mode, and let it do its thing.

Semi-autonomous vehicles are impressive for the fact that one driver can now scale well beyond a single vehicle. Fully-autonomous vehicles are impressive because they can scale limitlessly. The former is evolutionary, the latter is revolutionary.


Have we ever observed revolutionary change in tech which ran contrary to evolutionary change?

This seems like such an odd thing to expect will just "happen". Any other world-changing or impressive tech I'm familiar with has evolved to its current state over time, it's not like when Jobs announced the iPhone and changed the game there wasn't decades of mobile computing whose shoulders it stood on. Are you talking about something like crypto?

It's admittedly a bit confusing what you're asking for here.


Notably, not Musk's, and very different promised functionality.


What did Musk's promised driverless taxis provide that existing driverless taxis don't? The tech has arrived; it's a car that drives itself while the passenger sits in the back. Is the "gotcha" that the car isn't a Tesla?

Seems like we're splitting hairs a bit here.


He promised that you'd be able to turn your own Tesla into an autonomous taxi that would earn you money. That is a massive lie, not splitting hairs. Obviously, we're very desensitized to lying rats - but that's what he did.


https://www.reuters.com/technology/tesla-robotaxis-by-june-m...

Looks like it's still in the works. Sometimes when technologists promise something their timelines end up being overly optimistic. I'm sure this isn't news to you.

By your language and past commentary though this seems like the kind of thing which elicits a pretty emotional response from you, so I'm not sure this would be a productive conversation either way.


I don't understand it either. Why is Elon Musk a "terrorist"? And why is this the most upvoted post? Maybe being European limits my ability to comprehend American political rhetoric.


[flagged]


> The American left, who hates Musk because he’s an outspoken and unrepentant capitalist (among other reasons)

Ah yes, all the criticism I hear about Musk from my liberal friends has everything to do with his "unrepentant capitalism".

You either have a small or very peculiar social circle. Your comment isn't technically wrong since you were thoughtful enough to add "(among other reasons)" but it almost beggars belief that you'd be serious here. Do I really have to spell out why the left hate Musk?


Yes, I’m apparently not the only one wondering why you consider him a terrorist, so please do.


First: I did not use the word terrorist, nor do I know anyone that has. I am sure some people feel that way, particularly those that are close to communities in which he is inspiring fear and anxiety.

As for why I and "the left" don't like him? To be honest, I don't really want to have that conversation in an internet forum, and your incuriosity on your original comment coupled with inserting words into my mouth gives me zero belief that this will be a good faith conversation. Just ask ChatGPT if you really want to know, or I'm sure there's reddit threads on the subject.


We are responding below a comment that called him a financial terrorist. Did you not even read it?

And, predictably, you have nothing of substance to offer.


[flagged]


The guy that is privately accumulating as much money as possible is trying to benefit us all?


How about a consequentialist argument? In some fields, AI has already surpassed physicians in diagnosing illnesses. If breaking copyright laws allows AI to access and learn from a broader range of data, it could lead to earlier and more accurate diagnoses, saving lives. In this case, the ethical imperative to preserve human life outweighs the rigid enforcement of copyright laws.


There’s nothing particular to AI about your comment, it’s a general downside of IP.


No, the development of an artificial general intelligence does seem like a special case compared to usual IP debates, particularly in the potential multiplicative positive-sum effects on society overall.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: