Hacker Newsnew | past | comments | ask | show | jobs | submit | edunteman's commentslogin

Good question, I imagine you’d need to set up an ngrok endpoint to tunnel to local LLMs.

In those cases perhaps an open source (maybe even local) version would make more sense. For our hosted version we’d need to charge something, given storage requirements to run such a service, but especially for local models that feels wrong. I’ve been considering open source for this reason.


I’d love your opinion here!

Right now, we assume first call is correct, and will eagerly take the first match we find while traversing the tree.

One of the worst things that could currently happen is we cache a bad run, and now instead of occasional failures you’re given 100% failures.

A few approaches we’ve considered - maintain a staging tree, and only promote to live if multiple sibling nodes (messages) look similar enough. Decision to promote could be via tempting, regex, fuzzy, semantic, or LLM-judged - add some feedback APIs for a client to score end-to-end runs so that path could develop some reputation


I’d assume RL would be baked in to the request structure. I’m surprised OAI spec doesn’t include it, but I suppose you could hijack a conversation flow to do so


Very, very common approach!

Wrote more on that here: https://blog.butter.dev/the-messy-world-of-deterministic-age...


What a great overview!

I’d love your thoughts on my addition, autolearn.dev — voyager behind MCP.

The proxy format is exactly what I needed!

Thanks


Awesome to hear you’ve done similar. JSON artifacts from runs seem to be a common approach for building this in house, similar to what we did with the muscle mem. Detecting cache misses is a bit hard without seeing what the model sees, part of what inspired this proxy direction.

Thanks for the nice words!


I feel the same - we’ll use it as long as we can since it’s customer aligned but I wouldn’t be surprised if competitive or COGs costs force us to change in the future.


It’s bring-your-own-key, so any calls proxied to OpenAI just end up billing directly to your account as normal.

You’d only pay Butter for calls that don’t go to the provider. That’d be a separate billing account with butter.


I couldn’t see how it wouldn’t be, as it’s a free market opt-in decision to use Butter


it wouldn't be the first API service to disallow someone from selling a cache layer for their API. After all, this should likely result in OpenAI (or whatever provider) making less money


Ah yes that makes sense, have heard of those cases too but hadn’t put much thought into it. Thanks for pointing it out!


I've seen the OpenRouter guys here on HN before, so you can probably ask them what to look out for.


I've got a blog on this from the launch of Muscle Mem, which should paint a better picture https://erikdunteman.com/blog/muscle-mem

Computer use agents (as an RPA alternative) is the easiest example to reach to: UIs change but not often, so the "trajectory" of click and key entry tool calls is mostly fixed over time and worth feeding to the agent as a canned trajectory. I discuss the flaws of computer use and RPA in the blog above.

A counterexample is coding agents: it's a deeply user-interractive workflow reading from a codebase that's evolving. So the set of things the model is inferencing on is always different, and trajectories are never repeated.

Hope this helps


Still not clear - the tool calls come from the model, so what is being cached by Muscle Memory?

Also:

  After my time building computer-use agents, I’m convinced that the hybrid approach of Muscle Memory is the only viable way to offer 100% coverage on an RPA workload.
100% coverage of what?

I guess it'd be great if you could clarify the value proposition, many folks will be even less patient than myself.

Best of luck!


Thanks! For langchain you can repoint your base_url in the client. Autogpt I'm not as familiar with. Closed loop robotics using LLMs may be a stretch for now, especially since vision is a heavy component, but theoretically the patterns baked into small language models running on-device or hosted LLMs at higher level planning loops, could be emulated by a butter cache if observed in high enough volume.


An interesting alternative product to offer is injecting prompt cache tokens into requests where they could be helpful; not bypassing generations but at least low hanging fruit for cost savings


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: