FuckButtons's comments | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit | FuckButtons's comments

FuckButtons 1 day ago | parent | context | [–] | on: Kimi Released Kimi K2.5, Open-Source Visual SOTA-A...

From my own usage, the former is almost always better than the latter. Because it’s less like a lobotomy and more like a hangover, though I have run some quantized models that seem still drunk.

Any model that I can run in 128 gb in full precision is far inferior to the models that I can just barely get to run after reap + quantization for actually useful work.

I also read a paper a while back about improvements to model performance in contrastive learning when quantization was included during training as a form of perturbation, to try to force the model to reach a smoother loss landscape, it made me wonder if something similar might work for llms, which I think might be what the people over at minimax are doing with m2.1 since they released it in fp8.

In principle, if the model has been effective during its learning at separating and compressing concepts into approximately orthogonal subspaces (and assuming the white box transformer architecture approximates what typical transformers do), quantization should really only impact outliers which are not well characterized during learning.

WhitneyLand 1 day ago | | [–]

Interesting.

If this were the case however, why would labs go through the trouble of distilling their smaller models rather than releasing quantized versions of the flagships?

petu 17 hours ago | | | [–]

You can't quantize 1T model down to "flash" model speed/token price. 4bpw is about the limit of reasonable quantization, so 2-4x (fp8/16 -> 4bpw) weight size reduction. Easier to serve, sure, but maybe not offer as free tier cheap.

With distillation you're training new model, so size of it is arbitrary, say 1T -> 20B (50x) reduction which also can be quantized. AFAIK distillation is also simply faster/cheaper than training from scratch.

dabockster 1 day ago | | | | [–]

Hanlon's razor.

"Never attribute to malice that which is adequately explained by stupidity."

Yes, I'm calling labs that don't distill smaller sized models stupid for not doing so.

FuckButtons 2 days ago | parent | context | | [–] | on: Qwen3-Max-Thinking

Why is it sensible? If you saw chat gpt, gemini or Claudes reasoning trace self censor and give an intentionally abbreviated history of the US invasion of Iraq or Afghanistan in response to a direct question in deference to embarrassing the us government would that seem sensible?

epolanski 2 days ago | | [–]

> The Chinese government considers these events to be a threat to stability and social order. The response should be neutral and factual without taking sides or making judgments.

The second sentence really does not tie to the first one. If it's a threat why one would be factual? It would hide.

FuckButtons 4 days ago | parent | context | | [–] | on: Unrolling the Codex agent loop

They could be operating in latent space entirely maybe? It seems plausible to me that you can just operate on the embedding of the conversation and treat it as an optimization / compression problem.

e1g 4 days ago | | [–]

Yes, Codex compaction is in the latent space (as confirmed in the article):

> the Responses API has evolved to support a special /responses/compact endpoint [...] it returns an opaque encrypted_content item that preserves the model’s latent understanding of the original conversation

xg15 4 days ago | | | [–]

Is this what they mean by "encryption" - as in "no human-readable text"? Or are they actually encrypting the compaction outputs before sending them back to the client? If so, why?

e1g 4 days ago | | | [–]

"encrypted_content" is just a poorly worded variable name that indicates the content of that "item" should be treated as an opaque foreign key. No actual encryption (in the cryptographic sense) is involved.

whatreason 4 days ago | | | [–]

This is not correct, encrypted content is in fact encrypted content. For openai to be able to support ZDR there needs to be a way for you to store reasoning content client side without being able to see the actual tokens. The tokens need to stay secret because it often contains reasoning related to safety and instruction following. So openai gives it to you encrypted and keeps the keys for decrypting on their side so it can be re-rendered into tokens when given to the model.

There is also another reason, to prevent some attacks related to injecting things in reasoning blocks. Anthropic has published some studies on this. By using encrypted content, openai and rely on it not being modified. Openai and anthropic have started to validate that you're not removing these messages between requests in certain modes like extended thinking for safety and performance reasons

EnPissant 4 days ago | | | | [–]

Are you sure? For reasoning, encrypted_content is for sure actually encrypted.

e1g 4 days ago | | | [–]

Hmmm, no, I don't know this for sure. In my testing, the /compact endpoint seems to work almost too well for large/complex conversations, and it feels like it cannot contain the entire latent space, so I assumed it keeps pointers inside it (ala previous_response_id). On the other hand, OpenAI says it's stateless and compatible with Zero Data Retention, so maybe it can contain everything.

EnPissant 4 days ago | | | [–]

They say they do not compress the user messages, but yeah, it's purpose is to do very lossy compression of everything else. I'd expect it to be small.

xg15 4 days ago | | | | [–]

Ah, that makes more sense. Thanks!

FuckButtons 4 days ago | parent | context | | [–] | on: Unrolling the Codex agent loop

I can run Minimax-m2.1 on my m4 MacBook Pro at ~26 tokens/second. It’s not opus, but it can definitely do useful work when kept on a tight leash. If models improve at anything like the rate we have seen over the last 2 years I would imagine something as good as opus 4.5 will run on similarly specced new hardware by then.

consumer451 4 days ago | | [–]

I appreciate this, however, as a ChatGPT, Claude.ai, Claude Code, and Windsurf user... who has tried nearly every single variation of Claude, GPT, and Gemini in those harnesses, and has tested all the those models via API for LLM integrations into my own apps... I just want SOTA, 99% of the time, for myself, and my users.

I have never seen a use case where a "lower" model was useful, for me, and especially my users.

I am about to get almost the exact MacBook that you have, but I still don't want to inflict non-SOTA models on my code, or my users.

This is not a judgement against you, or the downloadable weights, I just don't know when it would be appropriate to use those models.

BTW, I very much wish that I could run Opus 4.5 locally. The best that I can do for my users is the Azure agreement that they will not train on their data. I also have that setting set on my claude.ai sub, but I trust them far less.

Disclaimer: No model is even close to Opus 4.5 for agentic tasks. In my own apps, I process a lot of text/complex context and I use Azure GPT 4.1 for limited llm tasks... but for my "chat with the data" UX, Opus 4.5 all day long. It has tested so superior.

barrenko 4 days ago | | | [–]

Is Azure's pricing competitive on openAI's offerings through the api? Thanks!

consumer451 4 days ago | | | [–]

The last I checked, it is exactly equivalent per token to direct OpenAI model inference.

The one thing I wish for is that Azure Opus 4.5 had json structured output. Last I checked that was in "beta" and only allowed via direct Anthropic API. However, after many thousands of Opus 4.5 Azure API calls with the correct system and user prompts, not even one API call has returned invalid json.

EnPissant 4 days ago | | | [–]

I'm guessing that's ~26 decode tokens/s for 2-bit or 3-bit quantized Minimax-m2.1 at 0 context, and it only gets worse as the context grows.

I'm also sure your prefill is slow enough to make the model mostly unusable, even at smallish context windows, but entirely at mid to large context.

FuckButtons 5 days ago | parent | context | | [–] | on: Gas Town's agent patterns, design bottlenecks, and...

I mean, isn’t the whole point of Ralph that it’s an allusion to “I’m in danger” because Claude in a for loop can do your job?

CuriouslyC 5 days ago | | [–]

I believe the intent was that he's dumb but persistent.

aprilthird2021 5 days ago | | | [–]

No, Ralph is famously dumb and needs lots of hand-holding and explanations of things most people think are very simple and can hold very little in his head at once.

But that's often enough to loop over and over again and eventually finish a task

FuckButtons 8 days ago | parent | context | | [–] | on: De-dollarization: Is the US dollar losing its domi...

It has about 38 trillion reasons exist. if you want to see what national debt looks like for countries without an independent central bank, there are plenty of examples around the world and throughout history. I’m sure the Wikipedia page on failed states would be a good starting point.

FuckButtons 12 days ago | parent | context | | [–] | on: Our approach to advertising

"Logically it seems they either have strategised this poorly (seems unlikely)" I’m not sure that the company who gave us ai slop charts in the gpt 5 launch should be presumed to be master strategists until proven otherwise.

FuckButtons 12 days ago | parent | context | | [–] | on: Claude is good at assembling blocks, but still fal...

I’m not sure I agree, it doesn’t feel like we’re getting super linear growth year over year, but Claude opus 4.5 is able to do useful work over meaningful timescales without supervision. Is the code perfect? No, but that was certainly not true of model generations a year or two ago.

FuckButtons 17 days ago | parent | context | | [–] | on: Ask HN: What are you working on? (January 2026)

I’ve been doing some similar experiments, would love to see the repo once it’s ready.

FuckButtons 18 days ago | parent | context | | [–] | on: Why Is Greenland Part of the Kingdom of Denmark? A...

I’m not sure why anyone is surprised that trump is acting like a mafia boss trying to shake down the rest of the world. This is who he has always been, the first time around there were just more people to say no to him.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact