I was confused at first. They wrote, "In this paper we present a new method for automatic transliteration and segmentation of Unicode cuneiform glyphs using Natural Language Processing (NLP) techniques". So my reaction was, "huh, why did they need to do any segmentation and transliteration with Unicode?". Turns out, they were doing segmentation and then transliteration from the image of cuneiform tablets to Unicode (cuneiform) glyphs/characters.
this is not what they did. They took unicode and separated it into words.
Which seems silly since the online archives are already separated but I guess it's useful for new texts where the word boundaries are not clear
Find this application exciting. Using natural language processing to crack Akkadian cuneiform is like equipping a historian with a high-speed translation engine. What once took scholars painstaking years of decoding complex logograms and syntax now gets a digital boost, with NLP stepping in and saying, "Let’s breeze through these ancient texts like it’s a weekend crossword puzzle."
How did I not know about this 2020 paper before?
I wrote a blogpost about using chatgtp for translating Akkadian text in 2023 but was unaware of this research.
I’m sure that all of the ancient scripts will be fed into an LLM or equivalent eventually - it will be fascinating to see if this will give new insights into the evolution of language and culture - perhaps even allowing understanding of undeciphered scripts such as Linear A
Linear A is the undeciphered script with the largest repertoire, about 8000 characters in total, and the largest single fragment is like 300 characters long. It's like trying to understand English given random clippings of signs and maybe a whole sentence from a book that in total amount to 5 or 6 pages of text. And we have a head-start--we make the educated guess of Linear A's phonetic values by comparing to Linear B, which was derived from it. The situation for the other undeciphered scripts is even worse.
It's hard to see how LLMs can help here because the problem with understanding these scripts is we just lack enough data to make any conclusions. And LLMs are famously reliant on being very data-hungry.
Rongorongo has roughly double that number of glyphs. The descendant (Rapa Nui) of the language it encodes is still spoken, and there's a whole family of related languages. But we haven't made much headway in deciphering it. Linear A has one advantage though: we have some understanding of how the script works, and can (or think we can) pronounce parts of the text. There's also Etruscan, which is partially deciphered (about 250 words are understood with any certainty), but it has no surviving relatives and only a couple of bilingual texts, one very short and the other not a literal translation of the known language. So all we have to go on is textual and archaeological context.
Different languages have similar words, people make similar expressions and do similar things. I think everything to solve the puzzle is there its just impossibly hard.
But who knows, there might be embarresingly obvious things we just didnt notice.
At least we will get funny halucinations of the quality of Hindu Rongorongo. lol
Cuneiform does strike me as potentially the most awesome application of AI. Very exciting. (Second only to talking with animals.)
The spoken language corresponding to Linear A is probably unrelated to the spoken language (Greek) represented by Linear B. If Linear A represents a spoken language--and this is by no means certain--then it is the native language of the Minoans, which has long since been lost. So unfortunately we don't have much to go on.
I know someone who is working on a general LLM to bring back dead languages. They have had great success using old biblical manuscripts and the LLM picks up on the syntax and grammar.
It would certainly be newsworthy, but I'm not sure about revolutionary. We only have essentially a few pages worth of linear A, and who knows what's written with it.
Undoubtedly there are tablets and scripts that have been sitting unread in an archive somewhere, in a deciphered language, that we just haven't gotten around to analyzing yet.
It'd be funny if it turned out to be an epic style rickroll
"For many moons and many suns have passed, as time, relentless, flows,
Since first their hearts, like twin flames, danced in love’s eternal fire.
Yet, the hero, of noble heart and steadfast will, did speak,
"My heart yearns for a bond unbroken, a pledge that none may sever.
No fleeting fancy, but a troth eternal, one that the gods themselves would envy.
For such a vow, none but I could offer, no other man could dare."
With eyes like the stars in the firmament, he gazed into the soul of his beloved,
And spake thus, with the voice of thunder, yet with the gentleness of the west wind:
"Hear me, O cherished one, for I must reveal the depths of my heart.
Never shall I forsake thee, never shall I let thee fall,
I shall not stray into the shadows nor abandon thee in the wilderness,
Never shall I bring sorrow to thine eyes, nor bid thee farewell.
Never shall my lips craft falsehoods that wound like the sharpest spear."
Through the ages, their bond had grown strong as the oak,
Though the maiden's heart did ache with a silent pain,
For fear held her tongue captive, her love hidden in the shadows.
But the hero, wise as the elders, knew the truth of their shared plight,
For both had danced the dance of fate, and knew well the game they played.
Then again, he spake, his voice a balm to the troubled heart:
"Do not be blind to what lies before us, for the path is clear,
And if thou wouldst ask of my heart's burden, know this:
I shall not waver, nor falter in my love for thee.
Never shall I forsake thee, never shall I let thee fall,
I shall not stray into the shadows nor abandon thee in the wilderness,
Never shall I bring sorrow to thine eyes, nor bid thee farewell.
Never shall my lips craft falsehoods that wound like the sharpest spear."
Through the years they had known each other, their souls intertwined,
Her heart had borne the weight of unspoken love,
Yet now, the veil lifted, they stood as one,
United in purpose, ready to face the trials of love's enduring quest.
And thus, the hero pledged once more, with a heart pure and true,
"Never shall I forsake thee, never shall I let thee fall,
I shall not stray into the shadows nor abandon thee in the wilderness,
Never shall I bring sorrow to thine eyes, nor bid thee farewell.
Never shall my lips craft falsehoods that wound like the sharpest spear."
So spake the hero, and the heavens themselves bore witness,
As love, unyielding, forged a bond that time could not erode.
And thus they walked, hand in hand, into the dawn of a new age,
Their hearts as one, forever bound by the sacred oath."
I can just imagine somewhere their is a guy trying to twist this into a neovim plugin so our fellow neovimmers can use ceunaiform for programming and note taking.
We are speaking a language that is a similar monstrosity, borrowing both words and structure from so many other languages, pronouncing the vowels differently from everyone else because of the Great Vowel Shift, never having a spelling reform so you have to know where a word came from to know how to spell it.
So I am cool with weird hybrid words that mash together concepts from multiple languages to make puns. I speak the wrong language to be a purist.
No, there are better ways and worse ways to use more distant languages for the development of this one: this Language is very far from being something that could be reduced to cheap puns.
«Hybrid words that mash together concepts from multiple languages» - which is very fine - are not necessarily cheap puns (which the superficial similarity of sounds easily engenders).
But anyway, that a coined word suffer from transplant rejection is "not the end of the world" - provided it remains a proper name for a specific item, a brand.
For your amusement: What a physics textbook might look like without borrowing words from French, Greek, or Latin. [0] Such as isotopes that are unstable:
> Most samesteads of every firststuff are unabiding. Their kernels break up, each at its own speed. This speed is written as the half-life, which is how long it takes half of any deal of the samestead thus to shift itself. The doing is known as lightrotting. It may happen fast or slowly, and in any of sundry ways, offhanging on the makeup of the kernel. A kernel may spit out two firstbits with two neitherbits, that is, a sunstuff kernel, thus leaping two steads back in the roundaround board and four weights back in heaviness. It may give off a bernstonebit from a neitherbit, which thereby becomes a firstbit and thrusts the uncleft one stead up in the board while keeping the same weight. It may give off a forwardbit, which is a mote with the same weight as a bernstonebit but a forward lading, and thereby spring one stead down in the board while keeping the same weight. Often, too, a mote is given off with neither lading nor heaviness, called the weeneitherbit. In much lightrotting, a mote of light with most short wavelength comes out as well.
> These are lines 31-34 of the second column of Sennacherib’s clay prism, probably from Nineveh, now in the Israel Museum (IMJ 71.072.0249). The text records eight campaigns of the Assyrian King, including the siege of Jerusalem which is well known from the Book of Kings. The line reads: ‘On my return march, I received a heavy tribute from the distant Medes, of whose land none of the kings, my ancestors, had heard mention.’ (translation adapted from A.K. Grayson and J. Novotny’s edition available on ORACC, Q003497).
Basically an OCR for cuneiform :)