More

eterm · 2026-01-25T11:30:12 1769340612

Much greater than now, given the open discoverability of the original post here, versus the walled-off content we have today, locked away in discord servers and the like.

Furthermore, the act of replying to that post will have bumped it right back to the top for everyone to see.

kasperset · 2026-01-25T17:58:58 1769363938

I agree with this. We are much missing these forums with civil replies and clouded behind "influencer" culture, which is optimized for incentives. Pure discussions as in this example are such a stalwarts of open web.

On the other hand, small websites and forums can disappear but that openness allows platform like archive.org to capture and "fossilize" them.

YokoZar · 2026-01-25T20:27:57 1769372877

These forums still exist. Typically with much older and mature discussions, as the users have aged alongside the forums. Nothing is stopping you from joining them now.

My Something Awful forums account is over 25 years old at this point. The software and standards and moderation style is approximately unchanged, complete with 10 dollar sign-up fee to keep out the spam.

darepublic · 2026-01-25T19:45:55 1769370355

Like mosquitos trapped in amber, preserving hidden blocks of knowledge

eterm · 2026-01-23T19:35:36 1769196936

That is a well recognised part of the LLM cycle.

A model or new model version X is released, everyone is really impressed.

3 months later, "Did they nerf X?"

It's been this way since the original chatGPT release.

The answer is typically no, it's just your expectations have risen. What was previously mind-blowing improvement is now expected, and any mis-steps feel amplified.

quentindanjou · 2026-01-23T19:44:47 1769197487

This is not always true. LLMs do get nerfed, and quite regularly, usually because they discover that users are using them more than expected, because of user abuse or simply because it attract a larger user base. One of the recent nerfs is the Gemini context window, drastically reduced.

What we need is an open and independent way of testing LLMs and stricter regulation on the disclosure of a product change when it is paid under a subscription or prepaid plan.

landl0rd · 2026-01-23T19:55:30 1769198130

There's at least one site doing this: https://aistupidlevel.info/

Unfortunately, it's paywalled most of the historical data since I last looked at it, but interesting that opus has dipped below sonnet on overall performance.

dudeinhawaii · 2026-01-24T04:37:23 1769229443

Interesting! I was just thinking about pinging the creator of simple-bench.com and asking them if they intend to re-benchmark models after 3 months. I've noticed, in particular, Gemini models dramatically reducing in quality after the initial hype cycle. Gemini 3 Pro _was_ my top performer and has slowly reduced to 'is it worth asking', complete with gpt-4o style glazing. It's been frustrating. I had been working on a very custom benchmark and over the course of it Gemini 3 Pro and Flash both started underperforming by 20% or more. I wondered if I had subtle broken my benchmark but ultimately started seeing the same behavior in general online queries (Google AI Studio).

Analemma_ · 2026-01-23T19:48:14 1769197694

> What we need is an open and independent way of testing LLMs

I mean, that's part of the problem: as far as I know, no claim of "this model has gotten worse since release!" has ever been validated by benchmarks. Obviously benchmarking models is an extremely hard problem, and you can try and make the case that the regressions aren't being captured by the benchmarks somehow, but until we have a repeatable benchmark which shows the regression, none of these companies are going to give you a refund based on your vibes.

judahmeek · 2026-01-24T21:06:01 1769288761

How hard is benchmarking models actually?

We've got a lot of available benchmarks & modifying at least some of those benchmarks doesn't seem particularly difficult: https://arc.markbarney.net/re-arc

To reduce cost & maintain credibility, we could have the benchmarks run through a public CI system.

What am I missing here?

Maxious · 2026-01-24T03:25:43 1769225143

Except the time that it was to the point Anthropic had to acknowledge it? Which also revealed they don't have monitoring?

https://www.anthropic.com/engineering/a-postmortem-of-three-...

jampa · 2026-01-23T19:51:42 1769197902

I usually agree with this. But I am using the same workflows and skills that were a breeze for Claude, but are causing it to run in cycles and require intervention.

This is not the same thing as a "omg vibes are off", it's reproducible, I am using the same prompts and files, and getting way worse results than any other model.

eterm · 2026-01-23T20:05:08 1769198708

When I once had that happen in a really bad way, I discovered I had written something wildly incorrect into the readme.

It has a habit of trusting documentation over the actual code itself, causing no end of trouble.

Check your claude.md files (both local and ~user ) too, there could be something lurking there.

Or maybe it has horribly regressed, but that hasn't been my experience, certainly not back to Sonnet levels of needing constant babysitting.

F7F7F7 · 2026-01-24T01:52:32 1769219552

I’m a x20 Max user who’s on it daily. Unusable the last 2 days. GLM in OpenCode and my local Qwen were more reliable. I wish I was exaggerating.

mrguyorama · 2026-01-23T21:33:22 1769204002

Also people who were lucky and had lots of success early on but then start to run into the actual problems of LLMs will experience that as "It was good and then it got worse" even when it didn't actually.

If LLMs have a 90% chance of working, there will be some who have only success and some who have only failure.

People are really failing to understand the probabilistic nature of all of this.

"You have a radically different experience with the same model" is perfectly possible with less than hundreds of thousands of interactions, even when you both interact in comparable ways.

olao99 · 2026-01-24T09:57:26 1769248646

Just because it's been true in the past doesn't mean it will always the case

ojr · 2026-01-24T10:13:25 1769249605

Opus was a non-deterministic probability machine in the past, present and the foreseeable future. The variance eventually shows up when you push it hard.

spike021 · 2026-01-23T20:41:19 1769200879

Eh, I've definitely had issues where Claude can no longer easily do what it's previously done. That's with constant documenting things in appropriate markdown files well and resetting context here and there to keep confusion minimal.

eterm · 2026-01-21T08:00:11 1768982411

Indeed, increasing the incentive for companies to reject ( and then sometimes silently fix anyway ) even the valid reports would only increase further misery for everyone.

eterm · 2026-01-20T23:02:33 1768950153

This makes me think LLMs would be interesting to set up in a game of Diplomacy, which is an entirely text-based game which soft rather than hard requires a degree of backstabbing to win.

The findings in this game that the "thinking" model never did thinking seems odd, does the model not always show it's thinking steps? It seems bizarre that it wouldn't once reach for that tool when it must be being bombarded with seemingly contradictory information from other players.

qbit42 · 2026-01-20T23:14:34 1768950874

https://noambrown.github.io/papers/22-Science-Diplomacy-TR.p...

eterm · 2026-01-20T23:30:10 1768951810

Thanks, it would be fascinating to repeat that today, a lot has changed since 2022 especially with respect to consistency of longer term outcomes.

open-paren · 2026-01-21T01:15:17 1768958117

It’s been done before

https://every.to/diplomacy (June 2025)

eterm · 2026-01-20T23:05:40 1768950340

Reading more I'm a little disappointed that the write-up has seemingly leant so heavily on LLMs too, because it detracts credibility from the study itself.

lout332 · 2026-01-20T23:51:17 1768953077

Fair point. The core simulation and data collection was done programmatically - 162 games, raw logs, win rates. The analysis of gaslighting phrases and patterns was human-reviewed. I used LLMs to help with the landing page copy, which I should probably disclose more clearly. The underlying data and methodology is solid, you can check it here: https://github.com/lout33/so-long-sucker

eterm · 2026-01-17T16:50:03 1768668603

So, this link is actually 5 days old, if you hover the "2 hours ago" you'll see the date 5 days ago.

HN second-chance pool shenanigans.

alt227 · 2026-01-17T18:34:19 1768674859

Can you point to any documentation which explains how this works?

Genuinely interested.

azhenley · 2026-01-17T18:51:19 1768675879

Dang gave some explanation here: https://news.ycombinator.com/item?id=26998308

eterm · 2026-01-15T16:38:50 1768495130

There was one much more successful EV, although it too was niche: The UK had "perhaps 40,000 milk floats" in the 1970s and 1980s before supermarkets took over as primary milk distributors. ( https://zavanak.com/transport-topics/british-electric-cv-his... )

dboreham · 2026-01-15T17:57:58 1768499878

When I was a kid in Edinburgh no milk was delivered by ICE vehicle. It was either electric or horse. Also Sean Connery's first job..

eterm · 2026-01-15T13:20:34 1768483234

If only there were some kind of international system of standard units.

blitzar · 2026-01-15T13:25:25 1768483525

Olympic swimming pools for liquids, times around the the earth for length and number of double decker busses for height.

literalAardvark · 2026-01-15T13:48:32 1768484912

You jest, but times around the Earth is the actual origin of the Meter. Kinda.

The history is quite interesting and well worth checking out.

I can't recommend a book on the subject, but I do heartily recommend "Longitude", which is about the challenges of inventing the first maritime chronometers for the purpose of accurately measuring longitude.

EA · 2026-01-15T14:18:46 1768486726

The original meter (1790s France) was defined as 1/10,000,000 of the distance from the equator to the North Pole along a meridian.

literalAardvark · 2026-01-15T14:32:55 1768487575

Not sure if you're correcting me, but yes, that is "a" path around the Earth.

It's not the most aesthetic one, but it was at the time the most able to be measured.

RobotToaster · 2026-01-15T13:36:10 1768484170

For smaller lengths and radiation bananas are also acceptable.

lostlogin · 2026-01-15T14:09:47 1768486187

A good physicist can calculate banana equivalent dose other head. Always ask for it when dealing with radiation.

jasomill · 2026-01-15T18:28:48 1768501728

Don't forget packs of cigarettes as a more convenient unit for measuring volumes significantly smaller than Olympic-sized pools.

There is, of course, no more need to standardize on a specific brand or style of cigarette than on a specific depth of Olympic-sized pool.

brookst · 2026-01-15T13:43:47 1768484627

Don’t forget cheetahs for velocity and elephants for weight.

blitzar · 2026-01-15T18:33:32 1768502012

A horse as a measure of power and a crocodile bite as a measure of compression strength

dredmorbius · 2026-01-15T20:10:24 1768507824

Or vice versa.

coldcode · 2026-01-15T14:18:35 1768486715

I thought all measurements in data centers were in US football fields.

Nifty3929 · 2026-01-15T16:46:43 1768495603

For the floor area or length/width it is, but if you want the height then that's in Empire State Buildings.

paulddraper · 2026-01-15T16:27:39 1768494459

There are standard units, yes.

eterm · 2026-01-14T19:02:52 1768417372

Business rates are a devolved matter, Scotland set their own rates.

eterm · 2026-01-14T18:19:40 1768414780

I always think by law any ISP that advertises speed and a has a cap must express the cap in terms of the advertised speed.

So telcos can advertise "Up to 200Mbps" for their package.

But then if they have a 2GB cap, they also need to say, "Caps at 80 seconds of usage".

Because that's what you're paying for at that speed, 80 seconds of usage per month.

Sure, you're not always (or indeed never) doing 200Mbps, but then you're not getting the speed you paid for.

throawayonthe · 2026-01-14T19:02:14 1768417334

i don't think that makes sense, most connections you make never reach 200Mbps because they don't need to

eterm · 2026-01-14T19:12:07 1768417927

That's kind of my point, ISPs use that max speed in their advertising when it isn't really relevant, especially if it hits your cap in a minute or two.

bscphil · 2026-01-14T21:14:36 1768425276

It is relevant, though. I have 1.2 Gbps down with a 2 TB monthly cap. I've never hit the monthly cap even once, but by your standard I have "1.2 Gbps down for 3 hours, 42 minutes".

But that doesn't change the reality that it matters to me that a 20 GB video that a friend took at my wedding downloads in just 2 minutes rather than the ~30 minutes it would take if I had a 100 Mbps connection.

eterm · 2026-01-14T21:35:31 1768426531

Right, but 3+ hours of top speed per month is a lot, 80 seconds isn't.

Your cap is over 150 times that equivalent. If you had an 80 second hard cap, you couldn't even download that 20GB video.

digiown · 2026-01-15T02:11:02 1768443062

1.2Gbps down but only 2TB cap? I hope that's really cheap since if I pay for that I'd expect to do stuff like downloading LLMs, etc, all the time.

eterm · 2026-01-13T21:50:16 1768341016

The lichess one might be in "multi-line" mode