With ollama you could offload a few layers to cpu if they don't fit in the VRAM. This will cost some performance ofcourse but it's much better than the alternative (everything on cpu)
I'm doing that with a 12GB card, ollama supports it out of the box.
For some reason, it only uses around 7GB of VRAM, probably due to how the layers are scheduled, maybe I could tweak something there, but didn't bother just for testing.
Obviously, perf depends on CPU, GPU and RAM, but on my machine (3060 + i5-13500) it's around 2 t/s.
It's quite telling that these discussions often end up at conclusion that we are becoming a developing (or 3rd world) country again, and not Star Trek society.
Arxiv has about 2.6M articles, assuming about 10 pages per article, that's 26M pages. According to OpenAI, their cheapest embedding model (text-embedding-3-small) costs a dollar for 62.5K pages. So the price for calculating embedding for the whole Arxiv is about $416.
I think doing it locally with an open source model would be a lot cheaper as well. Especially because they wouldn't have to keep using OpenAI's API for each new query.
Edit: I overlooked the about page (https://searchthearxiv.com/about), seems like they *are* using OpenAI's API, but they only have 300K papers indexed, use an older embedding model, and only calculate embeddings on the abstract. So this should be pretty cheap.
It makes sense that humans would have been using it though, chatgpt learned from us afterall