> Second, and no less important, AI Studio is genuinely the best chat interface on the market. It was the first platform where you could edit any message in the conversation, not just the last one, and I think it's still the only platform where you can edit AI responses as well! So if the model goes on an unnecessary tangent, you can just remove it from the context. It's still the only platform where if you have a long conversation like R(equest)1, O(utput)1, R2, O2, R3, O3, R4, O4, R5, O5, you can click regenerate on R3 and it will only regenerate O3, keeping R4 and all subsequent messages intact.
Isn't discussion editing a standard feature in chat interfaces? I've been using koboldcpp since i first tried LLMs (mainly because it is written in C++ and largely self-contained) and you can edit the entire discussion as a single text buffer, but even the example HTTP server for llama.cpp allows editing the discussion.
And yeah it can be useful for coding since you can edit the LLM's response to fix mistakes (and add minor features/tweaks to the code) and pretend it was correct from the get go instead of trying to roleplay with someone who makes mistakes you then have to correct :-P
> you can click regenerate on R3 and it will only regenerate O3, keeping R4 and all subsequent messages intact.
What's a use case for this? I'm trying to imagine why you'd want that, but I can't see it. Is it for the horny people? If you're trying to do anything useful, having messages edited should re-generate the following conversation as well (tool calls, etc).
Imagine in R2 you ask it to write a pong game in C using SDL, in R3 you ask it to write a CMakefile, in R4 you ask it to make the paddles red and green but then around R6 you want to modify the structure and you realize what a catastrophic mistake on your sanity cmake was, so you ask it to use premake for R3 instead so that R6 will only show how to update the premake file for that, wiping clean the existence of cmake (from the discussion and your project).
in R3 you ask to implement feature 1, then you move on to building stuff on top, you request feature 2 in R4 and after looking at O4 you see that there was an unintended consequence of a particular design choice in O3, so you can go back, update prompt R3, regenerate O3, and have your detailed prompt R4 remain in place.
Im sceptical of these google made Ai builders, I just had a bad experience with firebase studio that was stuck on a vulnerable version of nextjs and gemini couldn't update it to a non vulnerable version properly.
Its tries to force vendor lock in from the start. Guh.. avoid.
Social movements don't need to be quantifiably better to take off.
When the relevant audience is bored enough to be open to something new, it only takes a few influential people to tip the scales.
People don't want to be truly revolutionary; that takes actual risk. They want the appearance of being revolutionary with minimal downside and social reassurance.
(w/r/t GitHub there's already enough buzz in the right circles and it will likely happen this year.)
We're not yet to the point where a single PCIe device will get you anything meaningful; IMO 128 GB of ram available to the GPU is essential.
So while you don't need a ton of compute on the CPU you do need the ability address multiple PCIe lanes. A relatively low-spec AMD EPYC processor is fine if the motherboard exposes enough lanes.
There is plenty that can run within 32/64/96gb VRAM.
IMO models like Phi-4 are underrated for many simple tasks.
Some quantized Gemma 3 are quite good as well.
There are larger/better models as well, but those tend to really push the limits of 96gb.
FWIW when you start pushing into 128gb+, the ~500gb models really start to become attractive because at that point you’re probably wanting just a bit more out of everything.
IDK all of my personal and professional projects involve pushing the SOTA to the absolute limit. Using anything other than the latest OpenAI or Anthropic model is out of the question.
Smaller open source models are a bit like 3d printing in the early days; fun to experiment with but really not that valuable for anything other than making toys.
Text summarization, maybe? But even then I want a model that understands the complete context and does a good job. Even things like "generate one sentence about the action we're performing" I usually find I can just incorporate it into the output schema of a larger request instead of making a separate request to a smaller model.
It seems to me like the use case for local GPUs is almost entirely privacy.
If you buy a 15k AUD rtx 6000 96GB, that card will _never_ pay for itself on a gpt-oss:120b workload vs just using openrouter - no matter how many tokens you push through it - because the cost of residential power in Australia means you cannot generate tokens cheaper than the cloud even if the card were free.
> because the cost of residential power in Australia
This so doesn't really matter to your overall point which I agree with but:
The rise of rooftop solar and home battery energy storage flips this a bit now in Australia, IMO. At least where I live, every house has a solar panel on it.
Not worth it just for local LLM usage, but an interesting change to energy economics IMO!
- You can use the GPU for training and run your own fine tuned models
- You can have much higher generation speeds
- You can sell the GPU on the used market in ~2 years time for a significant portion of its value
- You can run other types of models like image, audio or video generation that are not available via an API, or cost significantly more
- Psychologically, you don’t feel like you have to constrain your token spending and you can, for instance, just leave an agent to run for hours or overnight without feeling bad that you just “wasted” $20
- You won’t be running the GPU at max power constantly
This is simply not true. Your heuristic is broken.
The recent Gemma 3 models, which are produced by Google (a little startup - heard of em?) outperform the last several OpenAI releases.
Closed does not necessarily mean better. Plus the local ones can be finetuned to whatever use case you may have, won't have any inputs blocked by censorship functionality, and you can optimize them by distilling to whatever spec you need.
Anyway all that is extraneous detail - the important thing is to decouple "open" and "small" from "worse" in your mind. The most recent Gemma 3 model specifically is incredible, and it makes sense, given that Google has access to many times more data than OpenAI for training (something like a factor of 10 at least). Which is of course a very straightforward idea to wrap your head around, Google was scrapign the internet for decades before OpenAI even entered the scene.
So just because their Gemma model is released in an open-source (open weights) way, doesn't mean it should be discounted. There's no magic voodoo happening behind the scenes at OpenAI or Anthropic; the models are essentially of the same type. But Google releases theirs to undercut the profitability of their competitors.
DDR5 is ~8GT/s, GDDR6 is ~16GT/s, GDDR7 is ~32GT/s. It's faster but the difference isn't crazy and if the premise was to have a lot of slots then you could also have a lot of channels. 16 channels of DDR5-8200 would have slightly more memory bandwidth than RTX 4090.
Yeah, so DDR5 is 8GT and GDDR7 is 32GT.
Bus width is 64 vs 384. That already makes the VRAM 4*6 (24) times faster.
You can add more channels, sure, but each channel makes it less and less likely for you to boot. Look at modern AM5 struggling to boot at over 6000 with more than two sticks.
So you’d have to get an insane six channels to match the bus width, at which point your only choice to be stable would be to lower the speed so much that you’re back to the same orders of magnitude difference, really.
Now we could instead solder that RAM, move it closer to the GPU and cross-link channels to reduce noise. We could also increase the speed and oh, we just invented soldered-on GDDR…
The bus width is the number of channels. They don't call them channels when they're soldered but 384 is already the equivalent of 6. The premise is that you would have more. Dual socket Epyc systems already have 24 channels (12 channels per socket). It costs money but so does 256GB of GDDR.
> Look at modern AM5 struggling to boot at over 6000 with more than two sticks.
The relevant number for this is the number of sticks per channel. With 16 channels and 64GB sticks you could have 1TB of RAM with only one stick per channel. Use CAMM2 instead of DIMMs and you get the same speed and capacity from 8 slots.
But it would still be faster than splitting the model up on a cluster though, right? But I’ve also wondered why they haven’t just shipped gpus like cpus.
Man I'd love to have a GPU socket. But it'd be pretty hard to get a standard going that everyone would support. Look at sockets for CPUs, we barely had cross over for like 2 generations.
But boy, a standard GPU socket so you could easily BYO cooler would be nice.
The problem isn't the sockets. It costs a lot to spec and build new sockets, we wouldn't swap them for no reason.
The problem is that the signals and features that the motherboard and CPU expect are different between generations. We use different sockets on different generations to prevent you plugging in incompatible CPUs.
We used to have cross-generational sockets in the 386 era because the hardware supported it. Motherboards weren't changing so you could just upgrade the CPU. But then the CPUs needed different voltages than before for performance. So we needed a new socket to not blow up your CPU with the wrong voltage.
That's where we are today. Each generation of CPU wants different voltages, power, signals, a specific chipset, etc. Within the same +-1 generation you can swap CPUs because they're electrically compatible.
To have universal CPU sockets, we'd need a universal electrical interface standard, which is too much of a moving target.
AMD would probably love to never have to tool up a new CPU socket. They don't make money on the motherboard you have to buy. But the old motherboards just can't support new CPUs. Thus, new socket.
Would that be worth anything, though? What about the overhead of clock cycles needed for loading from and storing to RAM? Might not amount to a net benefit for performance, and it could also potentially complicate heat management I bet.
It might seem minor, but the little things add up. Make your dev environment mirror prod from the start will save you a bunch of headaches. Then, when you're ready to deploy, there is nothing to change.
Even better, stage to a production-like environment early, and then deploy day can be as simple as a DNS record change.
Just a warning for those not on the max plan; if you pay by the token or have the lower tier plans you can easily blow through $100s or cap your plan in under an hour. The rates for paying by the token are insane and the scaling from pro to max is also pretty crazy.
They made pro have many times more value than paying per token and then they made max again have 25x more tokens than pro on the $200 plan.
It’s a bit like being offered rice at $1 per grain (pay per token) or a tiny bag of rice for $20 (pro) or a truck load for $200. That’s the pricing structure right now.
So while i agree you can’t easily exceed the quota on the big plans it’s a little crazy how they’ve tiered pricing. I hope no one out there’s paying per token!
Some companies are. Yes, for Claude Code. My co used to be like that as it's an easy ramp up instead of giving devs who might not use it that much a $150/mo seat; if you use it enough you can have a seat and save money, but if you're not touching $150 in credits a month just use the API. Oxide also recommends using API pricing. [0]
Wait, if I am providing essential data to your service, why am I paying you?
Perfect opportunity to run a project that benefits it's users (monetarily) if you only did the leg work to market that value to map consumers. And, as a consumer, you don't need the sophisticated hardware, anyway.
"you don't need the sophisticated hardware, anyway."
It depends on what kind of map you are building for which use cases and how passive you want it to be. Sure, you can use an iPhone or Android device but its not very passive (requires starting up, etc.) and it will quickly overheat when it gets hot. We tried it, and most people gave up after a few weeks given the fact that its not passive.
For most commercial fleets there is real value in the services we provide, eg monitoring, accident detection, remote video retrieval in case of accident, ELD compliance, etc.
You should read the article about rewards/incentives as it talks about that.
That's definitely a possible future abstraction and one are about the future of technology I'm excited about.
First we get to tackle all of the small ideas and side projects we haven't had time to prioritize.
Then, we start taking ownership of all of the software systems that we interact with on a daily basis; hacking in modifications and reverse engineering protocols to suit our needs.
Finally our own interaction with software becomes entirely boutique: operating systems, firmware, user interfaces that we have directed ourselves to suit our individual tastes.
It doesn't have to be easy to be factual. You simply are not owed entry into any country if you are not a citizen of that country, that is a fundamental part of what things like "citizenship" and "sovereign state" mean in the modern world.
And then wonder if they'll try to take your citizenship away anyway - the exact boat I'm in. Naturalized after almost 20 years of holding a GC, because I expected trouble with this administration - and now wondering they'll try to take away my citizenship because I did it recently.
I actually expected to leave and have my right to come back not dependent on GC status (which expires after 6 months), but due to family have stayed so far. by the by - I'm a citizen of that dangerous country bordering the US - Canada.
I started using Django before the official 1.0 release and used it almost exclusively for years on web projects.
Lately I prefer to mix my own tooling and a couple major packages in for backends (FastAPI, SQLAchemy) that are still heavily inspired by patterns I picked up while using Django. I end up with a little more boilerplate, but I also end up with a little more stylistic flexibility.
And then goes on to recommend AI Studio is a primary dev tool?! Baffling.
reply