Ok, your LLM can perfectly predict the next token. How do you extract the "logic...

yorwba · 2025-07-13T14:58:19 1752418699

It's possible to identify a surprisingly large number of matching words by learning a linear transformation mapping word vectors from two different languages into the same space (e.g. https://arxiv.org/abs/1805.06297 ).

But the problem with ancient languages is typically that there's not enough data to usefully constrain a large enough model. Doubly so for undeciphered scripts where scholars might not even agree on how many different letters there are.

yyyk · 2025-07-13T14:59:32 1752418772

Presumably, they'd want to get at embeddings, and compare the dimensional space somehow to say: 'the relation between tokens a,b,c is close to the relation of tokens a1,b1,c1 in a similar model of texts of known language of apparently same family (same up to aN,bN,cN), and out of these N sequences, sequence X makes most sense given existing examples'.

(As you can tell, the argument involves some handwaving, but it may possible?)

noworld · 2025-07-13T20:09:09 1752437349

It's LLMs all the way down.

talos · 2025-07-13T14:57:16 1752418636

I don't think OP's idea would work, but if it did you could just ask for a translation.

privatelypublic · 2025-07-14T02:58:32 1752461912

In what language? The model wouldn't speak english.

tinco · 2025-07-15T08:59:04 1752569944

In English. The decoder translates from the Dhofari to tokens the LLM understands. So you present the LLM with the decoded Dhofari, and a question in English, like "Please express the following in modern English" and the LLM would answer in English. There's also a chance the decoded Dhofari would be intelligible to humans directly, though I don't know how large that chance is.