Hacker Newsnew | past | comments | ask | show | jobs | submit | lufenialif2's commentslogin

What kinds of services would you pay for that don’t already exist?

In general, it's difficult to find services that are high-quality and high-trust.

Possibly unlikely to occur if prompt injection remains possible. I’ll just have my counter party ai prompt inject yours to negotiate a better deal on my behalf.

I’d imagine it’s less taxes and more you want to buy a nice house in the Bay Area where a lot of people are high earners and would be driving up prices on the low supply.


Yes and for some weird reason, the bay and all the nice places to live are all single-family and expensive as hell. Just build some soviet or Chinese style apartment blocks and give people housing like Singapore does its not that hard. This is not a democrat or republican issue, it is a have versus have-not issue.

The logical conclusion is that the residents of these desirable areas like the bay / San Diego / Seattle / DC actually want housing prices to stay high.


Building giant apartments would change the vibe of the Bay though, and my guess is some of people who want to live there also want to live in it as it is now and not what it would be with high rise apartments etc. There’s probably a way to do it well, but it’s a pretty heavy lift versus doing nothing, which is the current status quo.

Also doesn’t help there’s a lot of red tape as the other commenter mentioned.


I mean.. some people would prefer to live next to a forest or grassland, but nope, houses were built there, because people needed somewhere to live. Now that's not enough, and larger buildings are needed, and that includes socialist buildings.

I live in a former socialist country (well, part of a country, the country does not exist anymore), and when we needed more housing, we designated the land in the city to be for housing, ie. large socialist buildings. Then 1990s came, no more socialism, capitalism now, and no more large building projects, no new neighborhoods. So now, we have cows and cornfields in what would be prime realestate because the government won't change the zoning, all three neighbors there complain and apartments that used to be 120k eur maybe 20 years ago are now close to 500k eur.

If you want to live next to cows, move to a village, thousands want apartment buildings there, to live in a city.


>its not that hard.

Repealing all the bullcrap from the last 50yr that makes that artificially expensive to the point of being a non starter if not outright illegal is the hard part.



Advanced Gemini Advance Enterprise Boy Advanced Edition 3 (feat. Pitbull) & Knuckles


I sent this to accounting friends and this aligns with what I've been going through trying to use LLMs to create a game from scratch. Seems like the current best use case for language models (even with agent mode) is to feed it exactly what you want to get out, essentially turning it into a better auto complete. Still saves tons of time, but it isn't a panacea.


I'm not even sure it saves a ton of time to be honest. It sure _feels_ like I spend more time writing up tasks and researching/debugging hallucinations than just doing the thing myself.


This is consistently my experience too, I'm seriously just baffled by reports of time saved. I think it costs me more time cleaning up its mistakes than it saves me by solving my problems


There's really pernicious stuff I've noticed cropping up too, over the months of use.

Not just subtle bugs, but unused variables (with names that seem to indicate some important use), comments that don't accurately describe the line of code that it precedes and other things that feel very 'uncanny.'

The problem is, the code often looks really good at first glance. Generally LLMs produce well structured code with good naming conventions etc.


I think people are doing one of several things to get value:

0. Use it for research and prototyping, aka throwaway stuff.

2. Use it for studying an existing, complex project. More or less read only or very limited writes.

3. Use it for simple stuff they don't care much about and can validate quickly and reasonably accurately, the standard examples are CLI scripts and GUI layouts.

4. Segment the area in which the LLM works very precisely. Small functions, small modules, ideally they add tests from another source.

5. Boilerplate.

There can be a lot of value in those areas.


What about 1. ?


7 8 1 :-p


ive found that the shorter the "task horizon" the more time saved

essentially, a longer horizon increases chances of mistakes, increasing time needed to find and fix them. so at one point that becomes greater than the time saved in not having to do it myself

this is why im not bullish on AI agents. task horizon is too long and dynamical


So here's my problem, ultimately

If the task horizon for the LLM is shorter than writing it yourself, this likely means that the task is well defined and has an easy to access answer

For this type of common, well defined task we shouldn't be comparing "how long it takes for the LLM" against "how long it takes to write"

We should be comparing against "how long it takes to find the right answer on SO"

If you use this metric, I bet you the best SO answer, which is also likely the first google result, is just as fast as the LLM. Maybe faster


The reports of time saved are so cooked it's not funny. Just part of the overall AI grift going on - the actual productivity gains will shake out in the next couple years, just gotta live through the current "game changer" and "paradigm shifting event" nonsense the upper management types and VC's are pushing.

When I see stuff like "Amazon saved 4500 dev years of effort by using AI", I know it's on stuff that we would use automation for anyways so it's not really THAT big of a difference over what we've done in the past. But it sounds better if we just pretend like we can compare AI solutions to literally having thousands of developers write Java SDK upgrades manually.


this exactly right. remember, these models were trained to be functions. f(x)=y. thats an interface at its heart. when x and y are language, then its a translator.

they have emergent capabilities, like "translating" instructions/questions in X to the probable answers in Y, but i think people are getting way way ahead of themselves with those. these things still fundamentally cant think, and we can try to mimic thinking with scaffolding but then your just going to learn the bitter lesson again


I feel it does essentially save a lot of time in bookkeeping, but doesn’t negate the need for a human bookkeeper. Who knows what they’re doing


"a better auto complete" than what, specifically?


Still no information on the amount of compute needed; would be interested to see a breakdown from Google or OpenAI on what it took to achieve this feat.

Something that was hotly debated in the thread with OpenAI's results:

"We also provided Gemini with access to a curated corpus of high-quality solutions to mathematics problems, and added some general hints and tips on how to approach IMO problems to its instructions."

it seems that the answer to whether or not a general model could perform such a feat is that the models were trained specifically on IMO problems, which is what a number of folks expected.

Doesn't diminish the result, but doesn't seem too different from classical ML techniques if quality of data in = quality of data out.


Ok but when reported by mass media, which never used SI units and instead uses units like libraries of Congress, or elephants, what kind of unit should media use to compare computational energy of ai vs children?


If the models that got a gold medal are anything like those used on ARC-AGI, then you can bet they wrote an insane amount of text trying to reason their ways through these problems. Like, several bookshelves worth of writings.

So funnily enough, "the AI wrote x times the library of Congress to get there" is good enough of a comparison.


Dollars of compute at market rate is what I'd like to see, to check whether calling this tool would cost $100 or $100,000


4.5 hours × 2 "days", 100 Wats including support system.

I'm not sure how to implement the "no calculator" rule :) but for this kind of problems it's not critical.

Total = 900Wh = 3.24MJ


100 watts seems very low. A single Nvidia GeForce RTX 5090 is rated at ~600 watts. Probably they are using many GPUs/TPUs in parallel.


I forgot to explain in my comment, but my calculation is for humans.

If the computer uses ~600W, let's give it 45+45 minutes and we are even :) If they want to use many GPU ...


Convert libraries, elephants, etc into SI of course! Otherwise, they aren't really comparable...


Kilocalories. A unit of energy that equals 4184 Joules.


Human IMO contestants are also trained specifically on IMO problems.


They can train it n “Crux Mathematicorum” and similar journals, which are collections of “interesting” problems and their solutions.

https://cms.math.ca/publications/crux


Some unofficial comparison with costs of public models (performing worse): https://matharena.ai/imo/

So the real cost is something much more.


>it seems that the answer to whether or not a general model could perform such a feat is that the models were trained specifically on IMO problems, which is what a number of folks expected.

Not sure thats exactly what that means. Its already likely the case that these models contained IMO problems and solutions from pretraining. It's possible this means they were present in the system prompt or something similar.


Does the IMO reuse problems? My understanding is that new problems are submitted each year and 6 are selected for each competition. The submitted problems are then published after the IMO has concluded. How would the training data contain unpublished, newly submitted problems?

Obviously the training data contained similar problems, because that's what every IMO participant already studies. It seems unlikely that they had access to the same problems though.


IMO doesn't reuse problems, but Terence Tao has a Mastodon post where he explains that the first five (of six) problems are generally ones where existing techniques can be leveraged to get to the answer. The sixth problem requires considerable originality. Notably, both Gemini and OpenAI's model didn't get the sixth problem. Still quite an achievement though.


Do you have another source for that? I checked his Mastodon feed and don't see any mention about the source of the questions from the IMO.

https://mathstodon.xyz/@tao


strange statement--it's not true in general for sure (3&6 typically hardest but they certainly aren't fundamentally of a different nature to other questions) this year P6 seemed to be by far the hardest though but this posthoc statement should be read cautiously


>How would the training data contain unpublished, newly submitted problems?

I don't think I or op suggested it did.


Or that they did significant retraining to boost IMO performance creating a more specialized model at the cost of general-purpose performance.


I tried recreating this - gemma3n:e4b ID 15cb39fd9394 and /set parameter seed 42 and I got 2 different results on the two times I asked 'were the original macs really exactly 9 inch?'

Strangely, I'm only getting 2 alternating results every time I restart the model. I was not able to get the same result as you and certainly not with links to external sources. Is there anything else I could do to try to replicate your result?

I've only used ChatGPT prior and it'd be nice to use locally run models with consistent results.


Hmm, that's pretty strange. Not sure what might be going wrong, could be terminal shenanigans or a genuine bug in ollama.

Of course, double checking the basics would be a good thing to cover: ollama --version should return ollama 0.9.3, and the prompt should be copied and pasted to ensure it's byte-exactly matching.

Maybe you could also try querying the model through its API (localhost:11434/api/generate)? I'll ask a colleague to try and repro on his Mac like last time just to double check. I also tried restarting the model a few times, worked as expected here.

*Update:* getting some serious reproducibility issues across different hardware. A month ago the same experiment with regular quantized gemma3 worked fine between GPU, CPU, and my colleague's Mac, this time the responses differ everywhere (although they are consistent between resets on the same hw). Seems like this model may be more sensitive to hardware differences? I can try generating you a response with regular gemma3 12b qat if you're interested in comparing that.


Yeah trying out gemma3 12b qat sounds great.

I got back 0.9.3 as well as copied and pasted the prompt (included quotes and no quotes as well just in case...)

I can try the API as well and I'm using a legion 15ach6 but I could also try on my MacBook Pro.


Okay, this has been a ride.

Reverted to 0.8.0 of ollama, switched to gemma3:12b-it-qat for the model, set the seed to 42 and the temp to 0, and used my old prompt. This way I was able to get consistent results everywhere, and could confirm from old screenshots everything still matches.

Prompt and output here: https://pastebin.com/xUi3bbGh

However, when using the prompt I used previously in this thread, I'm getting a different response between machines, even with the temp and seed pinned. On the same machine, I initially found that it's reliably the same, but after running it a good few times more, I was eventually able to get the flip-flopping behavior you describe.

API wise, I just straight up wasn't able to get consistent results at all, so that was a complete bust.

Ultimately, it seems like I successfully fooled myself in the past and accidentally cherry picked an example? Or at least it's way more brittle than I thought. At this point I'd need significantly more insight into how the inference engine (ollama) works to be able to definitively ascertain whether this is a model or an engine trait, and whether it is essential for the model to work (although I'm still convinced it isn't). Not sure if that helps you much in practice though.

I wouldn't make a good scientist, apparently :)


I appreciate the effort! And I disagree - that's what it's all about haha

I assume there are more levers we could try pulling to reduce variation? I'll be looking into this as well.

As an aside, because of my own experience with variability using chatGPT (non-API, I assume there are also more levers to pull here), I've been thinking about LLMs and their application to gaming. To what extent it is possible to use LLMs to interpret a result and then return a variable that then executes the usual state updates? This would hopefully add a bit of intentional variability in the game's response to user inputs but consistency in updating internal game logic.

edit: found this! https://github.com/rasbt/LLMs-from-scratch/issues/249 Seems that it's an ongoing issue from various other links I've found, and now when I google "ollama reproducibility" this thread comes up on the first page, so it seems it's an uncommon issue as well :(


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: