Good question, I imagine you’d need to set up an ngrok endpoint to tunnel to local LLMs.
In those cases perhaps an open source (maybe even local) version would make more sense. For our hosted version we’d need to charge something, given storage requirements to run such a service, but especially for local models that feels wrong. I’ve been considering open source for this reason.
Right now, we assume first call is correct, and will eagerly take the first match we find while traversing the tree.
One of the worst things that could currently happen is we cache a bad run, and now instead of occasional failures you’re given 100% failures.
A few approaches we’ve considered
- maintain a staging tree, and only promote to live if multiple sibling nodes (messages) look similar enough. Decision to promote could be via tempting, regex, fuzzy, semantic, or LLM-judged
- add some feedback APIs for a client to score end-to-end runs so that path could develop some reputation
I’d assume RL would be baked in to the request structure. I’m surprised OAI spec doesn’t include it, but I suppose you could hijack a conversation flow to do so
Awesome to hear you’ve done similar. JSON artifacts from runs seem to be a common approach for building this in house, similar to what we did with the muscle mem. Detecting cache misses is a bit hard without seeing what the model sees, part of what inspired this proxy direction.
I feel the same - we’ll use it as long as we can since it’s customer aligned but I wouldn’t be surprised if competitive or COGs costs force us to change in the future.
it wouldn't be the first API service to disallow someone from selling a cache layer for their API. After all, this should likely result in OpenAI (or whatever provider) making less money
Computer use agents (as an RPA alternative) is the easiest example to reach to: UIs change but not often, so the "trajectory" of click and key entry tool calls is mostly fixed over time and worth feeding to the agent as a canned trajectory. I discuss the flaws of computer use and RPA in the blog above.
A counterexample is coding agents: it's a deeply user-interractive workflow reading from a codebase that's evolving. So the set of things the model is inferencing on is always different, and trajectories are never repeated.
Still not clear - the tool calls come from the model, so what is being cached by Muscle Memory?
Also:
After my time building computer-use agents, I’m convinced that the hybrid approach of Muscle Memory is the only viable way to offer 100% coverage on an RPA workload.
100% coverage of what?
I guess it'd be great if you could clarify the value proposition, many folks will be even less patient than myself.
Thanks! For langchain you can repoint your base_url in the client. Autogpt I'm not as familiar with. Closed loop robotics using LLMs may be a stretch for now, especially since vision is a heavy component, but theoretically the patterns baked into small language models running on-device or hosted LLMs at higher level planning loops, could be emulated by a butter cache if observed in high enough volume.
An interesting alternative product to offer is injecting prompt cache tokens into requests where they could be helpful; not bypassing generations but at least low hanging fruit for cost savings
In those cases perhaps an open source (maybe even local) version would make more sense. For our hosted version we’d need to charge something, given storage requirements to run such a service, but especially for local models that feels wrong. I’ve been considering open source for this reason.