Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I agree that it's kind of magical that you can download a ~10GB file and suddenly your laptop is running something that can summarize text, answer questions and even reason a bit.

The trick is balancing model size vs RAM: 12B–20B is about the upper limit for a 16GB machine without it choking.

What I find interesting is that these models don't actually hit Apple's Neural Engine, they run on the GPU via Metal. Core ML isn't great for custom runtimes and Apple hasn't given low-level developer access to the ANE afaik. And then there is memory bandwidth and dedicated SRAM issues. Hopefully Apple optimizes Core ML to map transformer workloads to the ANE.



I feel like Apple needs a new CEO, I've felt this way for a long time. If I had been in charge of Apple I would have embraced local LLMs and built an inference engine that optimizes models that are designed for Nvidia, I also would have probably toyed around with the idea of selling server-grade Apple Silicon processors and opening up the GPU spec so people can build against it. Seems like Apple tries to play it too safe. While Tim Cook is good as a COO, he's still running Apple as a COO. They need a man of vision, not a COO at the helm.


I think if Cook had vision, he could have started something called Apple Enterprise and sold Apple Silicon as a server and made AI chips. I agree he’s too conservative and has no product vision. Great manager though.


I was pleasantly surprised Apple Silicon came out at all. Someone has their eye on long term vision at Apple at least, they just didn't do this on a whim.


Or someone told Tim "we can save $XYZ per phone if we switch to custom designed silicon, and potentially expand it to Mac as well so we no longer have Intel overheating our Macbooks."

He was after-all more of an operations guy than a product guy before moving into the CEO role.


The unified GPU and unified memory design was pretty important. They just didn’t go and replace intel, they replaced AMD/NVIDIA also. The GPUs in high end Apple silicon are even good enough for mid model inference, and unified memory makes it somewhat cost effective…that advantage probably wasn’t planned and comes from just a lot of good execution and smart R&D.


To be fair, Apple HATES Nvidia after the 8400M and 9400M debacle. They probably saw replacing Nvidia as a bigger benefit than replacing Intel.


They did have Xserve back in the day. As great as Apple silicon is for running local llms along with being a general-purpose computing device, it’s not clear that Apple silicon have enough of a differentiating advantage over a rack of nvidia gpus to make it worthwhile in enterprise…


Strange to be saying this about Apple products, but its advantage is that it's way, way cheaper.


Would probably be different if NVIDIA viewed it as competition for data center market share


I think that would spread Apple’s chip team too thinly between competing priorities - and require them to do E2E stuff they’d never be interested in doing. What’s always happened, even during Jobs, is that Apple would do something nice and backend-y, and then not be able to keep it up as they’d pour resources into some consumer product. (See: WebObjects, Xserve, Mac OS Server.)


WebObjects was from NeXT. The Xserve saw regular updates until 2010. Mac OS Server was a GUI for a bunch of open source tools, and where it wasn’t (Workgroup Manager), they replaced with the MDM.

Apple had more money when they killed these than they did when these products were introduced. It’s not a resources issue. It’s a care issue. Same reason they fired the Mac OS Automation team. Same reason their documentation sucks hot diarrhea and all their good stuff is in the “documentation archive”. Penny pinching. “Shareholder value”. New blood destroying shit they didn’t understand.


Apple silicon does not compete well in multicore spaces. People seem to think that because it can run single core things really well on a laptop, it can do anything. Servers regularly have 100-200 cpu cores maxing out of rapid fire threads. This is not what Apple silicon excels at.

On top of that, it only performs so well on consumer devices because they control the hardware and OS and can tune both together. Creating server hardware would mean allowing linux to be installed on it, and would need to run equally well. Apple would never put the development time into linux kernel/drivers to make this happen.


Both Intel and AMD sell server CPUs with fewer than 100, hell, fewer than 32 cores.

There is of course a market for that. Not everyone needs a $4000 electric bill. Apple just can’t take the typical lions share of the profits in that market so they don’t bother.


I know off the top of my head at least 3 places that would happily purchase a couple of XServers (one of which probably still has one) running MacOS Server. Linux isn't as hard of a requirement as you think.


Hell... I can think of loads of places running servers on WINDOWS (namely all of my employers, including F500 companies) I am not surprised that someone would run macOS as a server. At least MacOS is Unix based ;)


> This is not what Apple silicon excels at

Not at the moment, no. I feel like the Apple silicon team probably would rise to that challenge though


> Apple silicon does not compete well in multicore spaces.

Can you elaborate on this? Maybe with some useful metrics?


Is expansion to all possible markets really a sign of product vision? Windows is in everything from ATMs to servers to cheap laptops, and I am not sure it’s a better product for it OR that Microsoft makes more money that way. Certainly the support burden for a huge number of long tail applications is huge.

And I suppose we’re giving credit to other people for Watch, AirPods, Vision Pro?


It doesn't just end with AI, but it seems the most blatant. At a bare minimum, he could assign someone to fulfill that vision for AI. Google has their own chips which they scale. Apple doesn't need to rebuild ChatGPT, but they could very much do what Microsoft does with Phi and provide Apple Silicon trained and optimized base models for all their users. It seems they are already doing something for XCode and Swift, but they're just barely scratching the surface.

I remember when the iPhone X became a thing, it was because consumers were extremely underwhelmed by Apple at the time. It's like they kicked it up less than a notch sadly.

If Tim Cook decided to be a little more of a visionary, I would say keep him. I would at least prefer he would delegate someone to do the visionary work, he will eventually need a successor.


[flagged]


By calling everyone who buys Apple products 80 IQ, you are lowering the quality of the discourse here. Please don't do that.


[flagged]


Don't do this here either.


Anyone who doesn’t happen to do exactly what I do and have the same interests as me is ‘80 IQ’ — whatever that means. Got it.


Doesn't Google sell $2000 phones? I really dont get the argument here.


Oh look, it's a poor, green-text Google apologist who thinks phones with preinstalled crapware, an energy management model that doesn't stop any app from saturating your bandwidth, CPU or battery draw, and a security model that ensures you stand a good chance of becoming part of a crypto farm or botnet just because you downloaded an emulator from a third-party app store, means you have above an 80 IQ! LOL, way to virtue-signal your poverty, bro. These are tough times, I get it... But the first 2 Android phones I ever tried, I crashed within 5 minutes just by... get this... turning on their fucking Bluetooth. WHAT QUALITY. More like "what Chinese shovelware," amirite?

(How does it feel? Literally turning around your inane opinion back onto you.)


It feels like you are particularly insecure and didn't need to spout that any more than the parent did.


Nope. I just want to show a douche what it looks like.


Oh, good. Your comment was materially indistinct from someone who took the "iPad and Vision Pro are toys" thing a bit too personally.


Local llm.. everybody is scared of privacy.. many people don’t want to buy subscriptions (still).

Just sell a proper HomePod with 64GB-128GB ram, which handles everything including your personal LLM, Time Machine if needed, back to Mac (Tailscale/zerotier)

+ they can compete efficiently with the other. Cloud providers.


It’s a mistake to generalize from the HN population.

Most people don’t care about privacy (see: success of Facebook and TikTok). Most people don’t care about subscriptions (see: cable TV, Netflix).

There may be a niche market for a local inference device that costs $1000 and has to be replaced every year or two during the early days of AI, but it’s not a market with decent ROI for Apple.


An iPhone, Macbook, etc all cost in the $1000 range.

There was a post about the new iphone using A19, which includes a feature that makes local inference much easier.

If that makes it to M5, I think the local inference case continues to grow with each M processor.


> Just sell a proper HomePod with 64GB-128GB ram

The same Homepod that almost sold as poorly as Vision Pro despite a $349.99 MSRP? Apple charges $400 to upgrade an M4 to 64GB and a whopping $1,200 for the 128GB upgrade.

The consumer demand for a $800+ device like this is probably zilch, I can't imagine it's worth Apple's time to gussy up a nice UX or support it long-term. What you are describing is a Mac with extra steps, you could probably hack together a similar experience with Shortcuts if you had enough money and a use-case. An AI Homepod-server would only be efficient at wasting money.


> The same Homepod that almost sold as poorly as Vision Pro despite a $349.99 MSRP?

The HomePod did poorly because competitor offerings with similar and better performing features were priced under $100. The difference in sound quality was not worth the >3x markup.


Have a team pushing out opitmised open source models. Over time this thing could become the house AI. Basically Star Treks computer.


They have local LLMs, apple foundation models: https://developer.apple.com/documentation/FoundationModels


Apple often wants to do it their way. Unfortunately, their foundation models are way behind even the open models.


There are local LLM coding models that ship with XCode now too.


Sounds like you’ve got a solid handle on things - go do it!


Give me a majority share in AAPL if that's what you want ;)


There are a things we can all do in our own lives, not necessarily running Apple. I for one am grateful not to be in the public spotlight running Apple! Everyone has opinions. It’s what you do with them that counts.


I think shareholders are fine with Tim Cook as a CEO.


I sometimes read posts on here and just laugh.

Its easy to sit in the armchair and say "just be a visionary bro" when they forget Tim worked under Steve for awhile before his death - he has some sense and understanding of what it takes to get a great product out of the door.

Nvidia is generating a lot of revenue, sure - but what is the downstream impact on its customers with the hardware? All they have right now is negative returns to show for their spending. Could this change? Maybe. Is it likely? Not in my view.

As it stands, Apple has made the absolute right choice in not wasting its cash and is demonstrating discipline. Which when all this LLM mania quietens, shareholders will respect.


Arguably, it’s why investors go in for Apple in the first place: Apple’s revenue fundamentally comes from consumer spending, whose prospects are relatively well understood by the average investor.

(I think it’s why big shareholders don’t get angry that Apple doesn’t splash their cash around: their core value proposition is focused in a dizzying tech market; take it or leave it. It’s very Warren Buffett.)


This. I wouldn’t exactly give them bonus points for the handling of Apple Intelligence, but beyond that, they’ve taken a much more measured and evidence-based approach to LLMs than the rest of big tech.

If it ends up that we are in a bubble and it pops, Apple may be among the least impacted in big tech.


Friend of mine, used to work for Apple.

He told me that a popular Apple saying is "We're late to the party, but always best-dressed."

I understand this. I'm not sure their choice of outfit has always been the best, but they have had enough success to continue making money.


Toyota did this with the EV mania until they lost their nerve and got rid of Toyoda as CEO. I hope Apple doesn't fall into the same trap. (I never thought Toyota would give in either.)


Yes. And everyone is glossing over the benefit of unified memory for LLM applications. Apple may not have the models, but it has customer goodwill, a platform, and the logistical infrastructure to roll them out. It probably even has the cash to buy some AI companies outright; maybe not the big ones (for a reasonable amount, anyway) but small to midsize ones with domain-specific models that could be combined.

Not to mention the “default browser” leverage it has with with iPhones, iPods, and watches.


Unified memory and examples like the M1 Ultra still being able to hold it's own years later might be one of the things that not all Mac users and non-mac users alike have experienced.

It's nice to see 16 Gb becoming the minimum, to me it should have been 32 for a long time.


Not to mention, build a car with all that cash they have. Xiaomi makes awesome cars, Apple branded electric could scoop all the brand equity that Elon passed away.


One does not simply put a 5090 into an existing chip.


Not what I am suggesting. However, having trained a few different things on a modest M4 Pro chip (so not even their absolute most powerful chips mind you), and using it for local-first AI inference, I can see the value. A single server could serve an LLM for a small business and cost a lot less than running the same inference through a 5090 in terms of power usage.

I could also see universities giving this type of compute access to students for cheaper to work on more basic less resource intensive models.


I think a 5090 will handily beat it on power usage.


I'm glad Tim is the CEO instead of you.


Why? This is something that plays into all of Apple's supposed strengths: Privacy/no strict cloud dependency/on-device compute, hardware/software optimization while owning the stack and combine that with good UI/UX for a broad target audience without sacrificing too much for the power users. OP never said that local AI would be the only topic a new CEO should focus on.


Under Cook, Apple’s market cap has increased 10x, at a CAGR of 18%.

Do you really think that they need something different? As a shareholder would you bet on your vision of focusing on server parts?


Software-wise, it makes sense: Nvidia has the IP lead, industry buy-in and supports the OSes everyone wants to use.

Hardware-wise though, I actually agree - Apple has dropped the ball so hard here that it's dumbfounding. They're the only TSMC customer that could realistically ship a comparable volume of chips as Nvidia, even without really impacting their smartphone business. They have hardware designers who can design GPUs from scratch, write proprietary graphics APIs and fine-tune for power efficiency. The only organizational roadblock that I can see is the executive vision, which has been pretty wishy-washy on AI for a while now. Apple wants to build a CoreML silo in a world where better products exist everywhere, it's a dead-end approach that should have died back in 2018.

Contextually it's weird too, I've seen tons of people defend Cook's relationship with Trump as "his duty to shareholders" and the like. But whenever you mention crypto mining or AI datacenter markets, people act like Apple is above selling products that people want. Future MBAs will be taught about this hubris once the shape of the total damages come into view.


> They have hardware designers who can design GPUs from scratch, write proprietary graphics APIs and fine-tune for power efficiency. The only organizational roadblock that I can see is the executive vision, which has been pretty wishy-washy on AI since for a while now.

The vision since Jobs has always been “build a great consumer product and own as much as you can while doing so”. That’s exactly how all of the design parameters of Ax/Mx series were determined and relentlessly optimized for - the fact that they have a highly competitive uarch was a salutary side-effect, but not a planned goal.


> But whenever you mention crypto mining or AI datacenter markets, people act like Apple is above selling products that people want.

People also want comfortable mattresses and high quality coffee machines. Should Apple make them too?

Apple not being in a particular industry is a perfectly valid choice, which is not remotely comparable to protecting their interests in the industries they are currently in. Selling datacenter-bound products is something Apple is not _remotely_ equipped for, and staffing up to do so at reasonable scale would not be a trivial task.

As for crypto mining... JFC.


Apple is perfectly well equipped to sell datacenter products. They've done it in the past, even supporting Nvidia's compute drivers along the way. If they have the staff to design consumer-facing and developer-facing experiences, why wouldn't they address the datacenter?

Money is money. 10 years ago people would have laughed at the notion of Nvidia abandoning the gaming market, now it's their most lucrative option. Apple can and should be looking at other avenues of profit while the App Store comes under scrutiny and the Mac market share refuses to budge. It should be especially urgent if unit margins are going down as suppliers leave China.


> They've done it in the past, even supporting Nvidia's compute drivers along the way. If they have the staff to design consumer-facing and developer-facing experiences, why wouldn't they address the datacenter?

They did a horrific job of it before. The staff to design consumer facing experiences are busy doing exactly that. The developer facing experiences are very lean. The bandwidth simply isn't there to do DC products. Nor is the supply chain. Nor is the service supply chain. Etc, etc.


Apple makes more profit on iPhones than Nvidia does on its entire datacenter business. Why would they want to enter a highly competitive market that they have no expertise in on a whim?


From reverse engineered information (in the context of Asahi Linux, which can have raw hardware access to the ANE) it seems that the M1/M2 Apple Neural Engine provides exclusively for statically scheduled MADD's of INT8 or FP16 values.[0] This wastes a lot of memory bandwidth on padding in the context of newer local models which generally are more heavily quantized.

(That is, when in-memory model values must be padded to FP16/INT8 this slashes your effective use of memory bandwidth, which is what determines token generation speed. GPU compute doesn't have that issue; one can simply de-quantize/pad the input in fast local registers to feed the matrix compute units, so memory bandwidth is used efficiently.)

The NPU/ANE is still potentially useful for lowering power use in the context of prompt pre-processing, which is limited by raw compute as opposed to the memory bandwidth bound of token generation. (Lower power usage in this context will save on battery and may help performance by avoiding power/thermal throttling, especially on passively-cooled laptops. So this is definitely worth going for.)

[0] Some historical information about bare-metal use of the ANE is available from the Whisper.cpp pull req: https://github.com/ggml-org/whisper.cpp/pull/1021 Even older information at: https://github.com/eiln/ane/tree/33a61249d773f8f50c02ab0b9fe... .

More extensive information at https://github.com/tinygrad/tinygrad/tree/master/extra/accel... (from the Tinygrad folks) seems to basically confirm the above.

(The jury is still out for M3/M4 which currently have no Asahi support - thus, no current prospects for driving the ANE bare-metal. Note however that the M3/Pro/Max ANE reported performance numbers are quite close to the M2 version, so there may not be a real improvement there either. M3 Ultra and especially the M4 series may be a different story.)


I too found that interesting that Apple's Neural Engine doesn't work with local LLMs. Seems like Apple, AMD, and Intel are missing the AI boat by not properly supporting their NPUs in llama.cpp. Any thoughts on why this is?


Most NPUs are almost universally too weak to use for serious LLM inference. Most of the time you get better performance-per-watt out of GPU compute shaders, the majority of NPUs are dark silicon.

Keep in mind - Nvidia has no NPU hardware because that functionality is baked-into their GPU architecture. AMD, Apple and Intel are all in this awkward NPU boat because they wanted to avoid competition with Nvidia and continue shipping simple raster designs.


Apple is in this NPU boat because they are optimized for mobile first.

Nvidia does not optimize for mobile first.

AMD and Intel were forced by Microsoft to add NPUs in order to sell “AI PCs”. Turns out the kind of AI that people want to run locally can’t run on an NPU. It’s too weak like you said.

AMD and Intel both have matmul acceleration directly in their GPUs. Only Apple does not.


Nvidia's approach works just fine on mobile. Devices like the Switch have complex GPGPU pipelines and don't compromise whatsoever on power efficiency.

Nonetheless, Apple's architecture on mobile doesn't have to define how they approach laptops, destops and datacenters. If the mobile-first approach is limiting their addressable market, then maybe Tim's obsessing over the wrong audience?


MacBooks benefit from mobile optimization. Apple just needs to add matmul hardware acceleration into their GPUs.


Perhaps due to sizes? AI/NN models before LLM were magnitudes smaller, as evident in effectively all LLMs carrying "Large" in its name regardless of relative size differences.


I guess that hardware doesn’t make things faster (¿yet?). If so I guess they would have mentioned it in https://machinelearning.apple.com/research/core-ml-on-device.... That is updated for Sequoia and says

“This technical post details how to optimize and deploy an LLM to Apple silicon, achieving the performance required for real time use cases. In this example we use Llama-3.1-8B-Instruct, a popular mid-size LLM, and we show how using Apple’s Core ML framework and the optimizations described here, this model can be run locally on a Mac with M1 Max with about ~33 tokens/s decoding speed. While this post focuses on a particular Llama model, the principles outlined here apply generally to other transformer-based LLMs of different sizes.”


If it uses a lot less power it could still be a win for some use cases, like while on battery you might still want to run transformer based speech to text, RTX voice-like microphone denoising, image generation/infill in photo editing programs. In some use cases like RTX-voice like stuff during multiplayer gaming, you might want the GPU free to run the game even if it still suffers some memory bandwidth impact from having it running.


There is no NPU "standard".

Llama.cpp would have to target every hardware vendor's NPU individually and those NPUs tend to have breaking changes when newer generations of hardware are released.

Even Nvidia GPUs often have breaking changes moving from one generation to the next.


I think OP is suggesting that Apple / AMD / Intel do the work of integrating their NPUs into popular libraries like `llama.cpp`. Which might make sense. My impression is that by the time the vendors support a certain model with their NPUs the model is too old and nobody cares anyway. Whereas llama.cpp keeps up with the latest and greatest.


I think I saw something that got Ollama to run models on it? But it only works with tiny models. Seems like the neural engine is extremely power efficient but not fast enough to do LLMs with billions of parameters.


I am running Ollama with 'SimonPu/Qwen3-Coder:30B-Instruct_Q4_K_XL' on a M4 pro MBP with 48 GB of memory.

From Emacs/gptel, it seems pretty fast.

I have never used the proper hosted LLMS, so I don't have a direct comparison. But the above LLM answered coding questions in a handful of seconds.

The cost of memory (and disk) upgrades in apple machines is exorbitant.



> Hopefully Apple optimizes Core ML to map transformer workloads to the ANE.

If you want to convert models to run on the ANE there are tools provided:

> Convert models from TensorFlow, PyTorch, and other libraries to Core ML.

https://apple.github.io/coremltools/docs-guides/index.html


I thought Apple MLX can do that if you convert your model using it https://mlx-framework.org/


MLX does not support the ANE.

https://github.com/ml-explore/mlx/issues/18


Yes it does.

That’s just an issue with stale and incorrect information. Here are the docs https://opensource.apple.com/projects/mlx/


No, it categorically doesn't. Not just that, it's CPU support is quite lacking (fp32 only). Currently, there are two ways to support the ANE: CoreML and MPSGraph.


Nothing in that documentation says anything about the Apple Neural Engine. MLX runs on the GPU.


None of that uses the ANE.


It does indeed, and is more modern than Core ML.


It is less about conversion and more about extending ANE support for transformer-style models or giving developers more control.

The issue is in targeting specific hardware blocks. When you convert with coremltools, Core ML takes over and doesn't provide fine-grained control - run on GPU, CPU or ANE. Also, ANE isn't really designed with transformers in mind, so most LLM inference defaults to GPU.


Neural Engine is optimized for power efficiency, not performance.

Look for Apple to add matmul acceleration into the GPU instead. Thats how to truly speed up local LLMs.


I can run GLM 4.5 Air and gpt-oss-120b both very reasonably. GPT OSS has particularly good latency.

I'm on a 128GB M4 macbook. This is "powerful" today, but it will be old news in a few years.

These models are just about getting as good as the frontier models.


You're better served using Apple's MLX if you want to run models locally.


ONNX Runtime purports to support CoreML: https://onnxruntime.ai/docs/execution-providers/CoreML-Execu... , which gives a decent amount of compatibility for inference. I have no idea to what extent workloads actually end up on the ANE though.

(Unfortunately ONNX doesn't support Vulkan, which limits it on other platforms. It's always something...)


I find surprising that you can also do that from the browser (e.g. WebLLM). I imagine that in the near future we will run these engines locally for many use cases, instead of via APIs.


Don't try 12-20B on 16GB. You should stick with 4-8B instead. You'll get way too slow tps and marginal perf improvements going higher on a 16GB machine.


Don't get me started. Many new computers come with an NPU of some kind, which is superfluous to a GPU.

But what's really going on is that we never got the highly multicore and distributed computers that could have started going mainstream in the 1980s, and certainly by the late 1990s when high-speed internet hit. So single-threaded performance is about the same now as 20 years ago. Meanwhile video cards have gotten exponentially more powerful and affordable, but without the virtual memory and virtualization capabilities of CPUs, so we're seeing ridiculous artificial limitations like not being able to run certain LLMs because the hardware "isn't powerful enough", rather than just having a slower experience or borrowing the PC in the next room for more computing power.

To go to the incredible lengths that Apple went to in designing the M1, not just wrt hardware but in adding yet another layer of software emulation since the 68000 days, without actually bringing multicore with local memories to the level that today's VLSI design rules could allow, is laughable for me. If it wasn't so tragic.

It's hard for me to live and work in a tech status quo so far removed from what I had envisioned growing up. We're practically at AGI, but also mired in ensh@ttification. Reflected in politics too. We'll have the first trillionaire before we solve world hunger, and I'm bracing for Skynet/Ultron before we have C3P0/JARVIS.



[flagged]


It's useless to mention number of parameters without also mentioning quantization, and to a lesser-but-still-significant extent context size, which determine how much RAM is needed.

"It will run" is a different thing than "it will run without swapping or otherwise hitting a slow storage access path". That makes a speed difference of multiple orders of magnitude.

This is one thing Ollama is good for. Possibly the only thing, if you listen to some of its competitors. But the choice of runner does nothing to avoid the fact that all LLMs are just toys.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: