We need someone to write a compatibility layer for those APIs, so it can be used across various "multiplayer-providers" and distribution platforms. Call it Sroton. Or maybe someone could email Gabe and ask if he could just straight up open source it and let others implement it too.
Commit and push to test small incremental changes, self-hosted runners' time still count towards CI minutes, and an ecosystem hellbent on presenting security holes as new features. I'm a bit unimpressed :)
That the entire ecosystem seems to have moved to GitHub Actions is such a loss for productivity. I remember when CircleCI first launched, and you could "Rebuild with SSH" which gave you a bash command to connect to the running instance whenever you wanted, was such a no-brainer, and I'm sure why many of us ended up using CircleCI for years. Eventually CircleCI became too expensive, but I still thought that if other services learnt anything from CircleCI, it would be this single feature, because of the amount of hours it saved thousands of developers.
Lo and behold, when GitHub Actions first launched, that feature was nowhere to be seen, and I knew from that moment on that betting on GitHub Actions would be a mistake, if they didn't launch with such a table-stakes feature. Seems still Microsoft didn't get their thumb out, and wasting countless developer's time with this, sad state of affairs.
Thank you pbiggar for the time we got with CircleCI :) Here's to hoping we'll have CircleCI.V2 appearing at some point in the future, I just know it involves DAGs and "Rebuild with SSH" somehow :)
I am surprised Docker didn't launch into the CI market. Running a container build as CI seems like it would both be a boon for simplifying CI caching and also debugging since it's ~reproducible locally.
They _are_ in the CI market. Two of their products are the Docker Build Cloud and Testcontainers Cloud. IIRC Docker Hub also came with automated builds at some point (not sure if it still does).
I do get your sentiment tough. For the position they are in, a CircleCI-like product would seem to be quite fitting.
This could've been a "change runs-on to be this" like all the other faster GHA startup products, but instead the way they set it up I would have to keep paying for GHA while also paying for their build cloud. No fun!
I've gotten used to this essential feature too via Semaphore CI, and I just can't stand not being able to SSH into a GitHub Action. Debugging is so slow.
I've seen people spend something like 2 hours fixing something that can be fixed in minutes if you had a normal feedback cycle instead of the 5 minute "change > commit > push > wait > see results" feedback cycle GitHub Action forces people into. It's baffling until you realize Microsoft charges per usage, so why fix it? I guess the baffling part is how developers put up with it anyways.
Does not sound like a GitHub failure, sounds it is the company's failure. They haven't invested in the developer experience and they have developers who cannot run stuff locally and are having to push to CI in order to get feedback.
Can't do much about that when there is something you're troubleshooting about the CI platform itself. Say you're troubleshooting why the deployment doesn't work, somehow got the environment variable wrong for whatever reason. So you edit and add a "env | sort" before that, commit it, push it, so on. With "rebuilt with ssh", you literally are inside the "job" as it runs.
Yes you can't really debug CI specific stuff locally, like if your setting up build caching or something. But it seems like 90%+ of the time people are complaining about builds/tests that should have local feedback loops.
Yeah, fair point, I see that a lot in the wild too. I guess I kind of assumed we all here had internalized the practice of isolating everything into one command that runs remotely, like "make test" or whatever, rather than what some people do and put entire shellscripts-but-yaml in their pipeline configs.
You can't run a GitHub CI pipeline locally (in general; there are some projects to try but they're limited). Even if you make as much of it runnable locally as possible (which you should) you're inevitably going to end up debugging some stuff by making commits and pushing them. Release automation. Test reporting. Artifact upload. Pipeline triggers. Permissions.
Count yourself lucky you've never had to deal with any of that!
Yes there are a few things you can't do locally. But the vast majority of complaints I see 90%+ are for builds/tests etc that should have the same local feedback loops. CI shouldn't be anything special, it should be a 'shell as a service' with some privileged credentials for pushing artefacts.
> Release automation. Test reporting. Artifact upload.
Those I can actually all do locally for my open source projects on GitHub, if I the correct credentials in my env. It is all automated(which I developed/tested locally) but I can break glass if needed.
Still using CircleCI. I do not love YAML at all, in fact I hate it because it's basically a 1980s text preprocessor on steroids and with dependency management. Too much logic applied to config that depends on implicit syntax and unintuitive significant whitespace.
I mean, I had an issue once where this broke the pipeline:
key:
- value 1
- value 2
But this was fine:
key:
- value 1
- value 2
Fuck that noise!
Otherwise it works just as good as it ever did and I don't miss Github Actions where every pipeline step is packaged into a dependency. I think Github has stagnated harder than CircleCI.
> I mean, I had an issue once where this broke the pipeline:
It seems fair to dislike YAML (I dislike it too), but I don't understand how this broke for you unless CircleCI (or whoever) isn't actually using a legal YAML parser.
irb(main):009:0> YAML.load <<EOD
irb(main):010:0" key:
irb(main):011:0" - value 1
irb(main):012:0" - value 2
irb(main):013:0" EOD
=> {"key"=>["value 1", "value 2"]}
irb(main):014:0> YAML.load <<EOD
irb(main):015:0" key:
irb(main):016:0" - value 1
irb(main):017:0" - value 2
irb(main):018:0" EOD
=> {"key"=>["value 1", "value 2"]}
(This works for any number of leading spaces, so long as the spacing is consistent.)
There shouldn't be any difference between those two values. I'm not saying you are wrong and it didn't break but it's definitely surprising a parser would choke on that vs YAML itself being the problem.
Don't get me wrong I can empathise with whitespace formatting being annoying and having both forms be valid just adds confusion it's just surprising to see this was the problem.
> Authority state (what constraints are actively enforced)
This, I'm not sure what to do about, I think LLMs might just not be a good fit for this.
> Temporal consistency (constraints that persist across turns)
This can be solved by stop using LLMs as "can take turns" and only use them as "one-shot answer otherwise wrong" machines, as prompt following is the best early in a conversation, and gets really bad quickly as the context grows. Personally, I never go beyond two messages in a chat (one user message, one assistant message), and if it's wrong, I clear everything, iterate on the first prompt, and try again. Tends to make the whole "follow system prompt instructions" a lot better.
> Hierarchical control (immutable system policies vs. user preferences)
This I think at least was attempted to be addresses in the release of GPT-OSS, where instead of just having system prompt and user prompt, it now has developer, system and user prompt, so there is a bigger difference in how the instructions are being used. This document shares some ideas about separating the roles more than just system/user: https://cdn.openai.com/spec/model-spec-2024-05-08.html
Yep, you nailed the problem: context drift kills instruction following.
That's why I’m thinking authority state should be external to the model. If we rely on the System Prompt to maintain constraints ("Remember you are read-only"), it fails as the context grows. By keeping the state in an external Ledger, we decouple enforcement from the context window. The model still can't violate the constraint, because the capability is mechanically gone.
Looks like fun and educational toy, interesting find. But why the mention of it being popular in homeschooling circles? Mentioning that in the same context makes it seem like you're not recommending the product because of that :P
> maybe in your social circles people have a problem with homeschooling
In general I think it's a very American thing, and considering the education problem the US suffers from, probably explains a part of that. Most other countries have very limited amount of homeschooling even allowed, because of all the drawbacks with it.
Huh? I don't think that's true, there usually is some sort of structural elements inside of the package, meant to be thrown away (usually made with cardboard/paper), and all Ikea boxes definitively have lots of air inside of them, not sure what would make you say otherwise, unless it's some joke I'm missing?
A box that contained a fully assembled kitchen table would contain a lot more air. I think that comment just meant IKEA designs items that can be packaged into a minimal volume.
Ah yes, on second reading it's actually pretty obvious that is what parent meant and I was reading it too literally. Thanks for the clarification, that's certainly correct :)
Like most standards: "Because it's a standard". Kind of like setting a .body for a GET request, you can kind of do that, but why not do it the way it's intended to instead? Use POST :)
Sending a URL encoded form or some JSON in a POST request is also easier for most people to understand than the myriad ways you might format a query string in the URL (which may have a stricter limit on size).
You only have to look at how different services handle arrays in query strings to understand that serialising it is conceptually easier.
Comes up a lot in search or filter APIs. I'm sure there was some effort many moons ago to create a QUERY method for that.
Yeah, and also because of firewalls sometimes stripping body of GET requests (not responses mind you, we're talking requests) to a server, and also because it's really uncommon to put a body on a GET request ;)
> They may give you a different result when you run the same spec through the LLM a second time.
Yes kind of, but only different results (maybe) for the things you didn't specify. If you ask for A, B and C, and the LLM automatically made the choice to implement C in "the wrong way" (according to you), you can retry but specify exactly how you want C to be implemented, and it should follow that.
Once you've nailed your "spec" enough so there isn't any ambiguity, the LLM won't have to make any choices for you, and then you'll get exactly what you expected.
Learning this process, and learning how much and what exactly you have to instruct it to do, is you building up your experience learning how to work with an LLM, and that's meaningful, and something you get better with as you practice it.
> Yes kind of, but only different results (maybe) for the things you didn't specify.
No. They will produce a different result for everything, including the things you specify.
It's so easy to verify that I'm surprised you're even making this claim.
> Once you've nailed your "spec" enough so there isn't any ambiguity, the LLM won't have to make any choices for you, and then you'll get exactly what you expected
1. There's always ambiguity, or else you'll end up an eternity writing specs
2. LLMs will always produce different results even if the spec is 100% unambiguous for a huge variety of reasons, the main one being: their output is non-deterministic. Except in the most trivial of cases. And even then the simple fact of "your context window is 80% full" can lead to things like "I've rewritten half of your code even though the spec only said that the button color should be green"
> It's so easy to verify that I'm surprised you're even making this claim.
Well, to be fair, I'm surprised you're even trying to say this claim isn't true, when it's so easy to test yourself.
If I prompt "Create a function with two arguments, a and b, which returns adding those two together", I'll get exactly what I specify. If I feel like it using u8 instead of u32 was wrong, I add "two arguments which are both u8", then you now get this.
Is this not the experience you get when you use LLMs? How does what you get differ from that?
> 1. There's always ambiguity, or else you'll end up an eternity writing specs
There isn't though, at one point it does end. If it's worth going so deep into specifying the exact implementation is up to you and what you're doing, sometimes it is, sometimes it isn't.
> LLMs will always produce different results even if the spec is 100% unambiguous for a huge variety of reasons, the main one being: their output is non-deterministic.
Again, it's so easy to verify that this isn't true, and also surprising you'd say this, because earlier you say "always ambiguity" yet somehow you seem to also know that you can be 100% unambiguous.
Like with "manual" programming, the answer is almost always "divide and conquer", when you apply that with enough granularity, you can reach "100% umambiguity".
> And even then the simple fact of "your context window is 80% full" can lead to things like "I've rewritten half of your code even though the spec only said that the button color should be green"
Yes, this is a real flaw, once you go beyond two messages, the models absolutely lose track almost immediately. Only workaround for this is constantly restarting the conversation. I never "correct" an agent if they get it wrong with more "No, I meant", I rewrite my first message so there are no corrections needed. If your context goes beyond ~20% of what's possible, you're gonna get shit results basically. Don't trust the "X tokens context length", because "what's possible" is very different from "what's usable".
> If I prompt "Create a function with two arguments, a and b, which returns adding those two together", I'll get exactly what I specify. If I feel like it using u8 instead of u32 was wrong, I add "two arguments which are both u8", then you now get this.
This is actually a good example of how your spec will progress:
First pass: "Create a function [in language $X] with two arguments, a and b, which returns adding those two together"
Second pass: "It must take u8 types, not u32 types"
Third pass: "You are not handling overflows. It must return a u8 type."
Fourth pass: "Don't clamp the output, and you're still not handling overflows"
Fifth pass: "Don't panic if the addition overflows, return an error" (depending on the language, this could be "throw an exception" or return a tuple with an error field, or use an out parameter for the result or error)
For just a simple "add two numbers" function, the specification can easily exceed the actual code. So you can probably understand the skepticism when the task is not trivial, and depends on a lot of existing code.
So you do know how the general "writing specification" part is working, you just have the wrong process. Instead of iterating and adding more context on top, restructure your initial prompt to include the context.
DONT DO:
First pass: "Create a function [in language $X] with two arguments, a and b, which returns adding those two together"
Second pass: "It must take u8 types, not u32 types"
INSTEAD DO:
First pass: "Create a function [in language $X] with two arguments, a and b, which returns adding those two together"
Second pass: "Create a function [in language $X] with two arguments, a and b, both using u8, which returns adding those two together"
----
What you don't want to do, is adding additional messages/context on top of "known bad" context, so instead you should take the clue that the LLM didn't understand correctly as "I need to edit my prompt" not "I need to now after their reply, add more context to correct what was wrong". The goal should be to completely avoid anything bad, not correct it.
Together with this, you build up a system/developer prompt you can reuse across projects/scopes, that follows how you code. In that, you add stuff as you discover what's needed to be added, like "Make sure to always handle Exceptions in X way" or similar.
> > For just a simple "add two numbers" function, the specification can easily exceed the actual code. So you can probably understand the skepticism when the task is not trivial, and depends on a lot of existing code.
Yes, please be skeptical, I am as well, which I guess is why I am seemingly more effective at using LLMs than others who are less skeptical. It's a benefit here to be skeptical, not a drawback.
And yes, it isn't trivial to verify work that others have done for you, when you have a concrete idea of how it should be exactly. But as I managed to work with outsourced/contracting developers before, or even collaborate with developers in the same company as me, I also learned to use LLMs in a similar way where you have to review and ensure code follow the architecture/design you intended.
> First pass: "Create a function [in language $X] with two arguments, a and b, which returns adding those two together"
> Second pass: "Create a function [in language $X] with two arguments, a and b, both using u8, which returns adding those two together"
So it will create two different functions (and LLMs do love to ignore anything that came before and create a lot of stuff from scratch again and again). Now what.
What? No, I think you fundamentally misunderstand what workflow I'm suggesting here.
You ask: "Do X". The LLM obliges, gives you something you don't want. At this point, don't accept/approve it, so nothing has changed, you still have an empty directory, or whatever.
Then you start a brand new context, with iteration on the prompt: "Do X with Y", and the LLM again tries to do it. If something is wrong, repeat until you get what you're happy with, extract what you can into reusable system/developer prompts, then accept/approve the change.
Then you end up with one change, and one function, exactly as you specified it. Then if you want, you can re-run the exact same prompt, with the exact same context (nothing!) and you'll get the same results.
"LLMs do love to ignore anything that came before" literally cannot happen in this workflow, because there is nothing that "came before".
> No, I think you fundamentally misunderstand what workflow I'm suggesting here.
Ah. Basically meaningless monkey work of baby sitting an eager junior developer. And this is for a simple thing like adding two numbers. See how it doesn't scale at all with anything remotely complex?
> "LLMs do love to ignore anything that came before" literally cannot happen in this workflow, because there is nothing that "came before".
Of course it can. Because what came before is the project you're working on. Unless of course you end up specifying every single utility function and every single library call in your specs. Which, once again, doesn't scale.
> See how it doesn't scale at all with anything remotely complex?
No, I don't. Does outsourcing not work for you with "anything remotely complex"? Then yeah, LLMs won't help you, because that's a communication issue. Once you figure out how to communicate, using LLMs even for "anything remotely complex" becomes trivial, but requires an open mind.
> Because what came before is the project you're working on.
Right, if that's what you meant, then yeah, of course they don't ignore the existing code, if there is a function that already does what it needs, it'll use that. If the agent/LLM you use doesn't automatically does this, I suggest you try something better, like Codex or Claude Code.
But anyways, you don't really seem like you're looking for improving, but instead try to dismiss better techniques available, so I'm not even sure why I'm trying to help you here. Hopefully at least someone who wants to improve comes across it so this whole conversation wasn't a complete waste of time.
Strange. For a simple "add two integers" you now have to do five different updates to specs to make it non-ambiguous, restarting the work from scratch (that is, starting a new context) every time.
What happens when your work isn't to add two integers? How many iterations of the spec you have to do before you arrive at an unambiguous one, and how big will it be?
> Once you figure out how to communicate,
LLMs don't communicate.
> Right, if that's what you meant, then yeah, of course they don't ignore the existing code, if there is a function that already does what it needs, it'll use that.
Of course it won't since LLMs don't learn. When you start a new context, the world doesn't exist. It literally has no idea what does and does not exist in your project.
It may search for some functionality given a spec/definition/question/brainstorming skill/thinking or planning mode. But it may just as likely not. Because there are no actual proper way for anyone to direct it, and the models don't have learning/object permanence.
> If the agent/LLM you use doesn't automatically does this, I suggest you try something better, like Codex or Claude Code.
The most infuriating thing about these conversations is that people hyping AI assume everyone else but them is stupid, or doing something incorrectly.
We are supposed to always believe people who say "LLMs just work", without any doubt, on faith alone.
However, people who do the exact same things, use the exact tools, and see all the problems for what they are? Well, they are stupid idiots with skill issues who don't know anything and probably use GPT 1.0 or something.
Neither Claude nor Codex are magic silver bullets. Claude will happily reinvent any and all functions it wants, and has been doing so since the very first day it was unleashed onto the world.
> But anyways, you don't really seem like you're looking for improving, but instead try to dismiss better techniques available
Yup. Just as I said previously.
There are some magical techniques, and if you don't use them, you're a stupid Luddite idiot.
Doesn't matter that the person talking about these magical techniques completely ignores and misses the whole point of the conversation and is fully prejudiced against you. The person who needs to improve for some vague condescending definition of improvement is you.
Similarly, some humans seem to unable to too. The problem is, you need to be good at communication to effectively use LLMs, judging by this thread, it's pretty clear what the problem is. I hope you figure it out someday, or just ignore LLMs, no one is forcing you to use them (I hope at least).
I don't mind what you do, and I'm not "hyping LLMs", I see them as tools that are sometimes applicable. But even to use them in that way, you need to understand how to use them. But again, maybe you don't want, that's fine too.
"However, people who do the exact same things, use the exact tools, and see all the problems for what they are? Well, they are stupid idiots with skill issues who don't know anything and probably use GPT 1.0 or something."
Maybe I'm easily impressed, but that LLMs even work to output basic human-like text to me is bananas, and I do understand a bit of how it works, yet it's still up there as "Amazing that huge airplanes even can fly" is for me.
YaCy does a pretty good job, and is free, and you can run yourself, so the quality/experience is pretty much up to you. Paired together with a local GPT-OSS-120b with reasoning_effort set to high, I'm getting pretty good results. Validated with questions I do know the answer to, and seems alright although could be better of course, still getting better results out of GPT5.2 Pro which I guess is to be expected.
The point of my comment was that the AI/LLM is almost irrelevant in light of low quality search engine APIs/indexes. Is there a way to validate the actual quality and comprehensiveness of YaCY beyond anecdata?
Yeah, if that's how you feel about your own abilities, then I guess that's the way it is. Not sure what that has to do with YaCy or my original comment.
I assume that should be qualified with some basic amount of evidence beyond “I said so”? Anyways, thanks for pointing me in the direction of YaCy, will try it out.
reply