>One of my main motivations for making this library was dealing with the frequent case where using Japanese text isn't an option for technical reasons, or it is an option but comes with downsides.
The annoying thing is that the Japanese are equally quilty of this. I live in Finland so Ä/Ö are used; I have never ever found a single Japanese online store that accepts Ä/Ö in input (not even Amazon.co.jp). Add a name and address and you're suddenly give a "Please enter only English characters" error.
You'd think the Japanese of all people would know how annoying it is when systems accept a very limited amount of character sets.
I wonder how much of this phenomenon is due to some key system somewhere using Shift-JIS or EUC-JP, leading to the restriction to ASCII-ish Latin characters.
“English characters” are a part of the Japanese written language at this point, so it makes sense from that perspective. They’re consistent in supporting their own written language, far less so any others.
Not really. I'm using Amazon.co.jp in English and I can't enter my address in katakana. Full-width ASCII is also out. Only A-Z here.
It's kinda strange since Amazon.co.jp is really easy and handy for international shopping since you can even use it in English. They even precollect the Finnish VAT and DHL ships stuff from there super fast (package leaves Japan on a Friday, is at my door on Monday).
I agree that it’s silly Amazon Japan doesn’t support this, but was speaking to the broader point that “Japanese, of all people, should know...”
Failing to provide support or consideration to out-groups is a quite common feature of Japanese society. Difficulties arising from those situations tends to be blamed on those on the receiving end of them for being “other.”
I guess devs are thinking there are only half-with == “English” == `char` and full Unicode can-of-worms mode, and DGAF. Yeah that’s embarrassing, feel sorry for you.
I just want to chime in and say that the author knows what he is talking about when it comes to Japanese and NLP.
Like many other commenters, I am also not sure why you’d want “katsu” transliterated to “cutlet”, but the author didn’t choose to have a tool to do this out of ignorance of the Japanese language.
Regarding the name, it's true that the word "cutlet" isn't typically an example of where you'd want to use the foreign spelling functionality. However it had several other nice points:
1. It follows the tradition of naming Japanese language tools after food the author likes, like MeCab, Sudachi, my own fugashi, etc.
2. It has a concrete image that's easy to convey, which is useful mnemonically
3. It's a very clear demonstration of the foreign spelling feature (even if not a useful one)
4. The word "katsu" is a good example for showing the difference between Hepburn and other romanization systems (katsu/katu)
From Wikipedia, that's technically and etymologicaly correct :
"The cutlet was introduced to Japan during the Meiji period, in a Western cuisine restaurant in the fashionable Ginza district of Tokyo. The Japanese pronunciation of cutlet is katsuretsu.
In Japanese cuisine, katsuretsu or shorter katsu is actually the name for a Japanese version of the Wiener schnitzel, a breaded cutlet. Dishes with katsu include tonkatsu and katsudon."
TIL, but in my defense, an abbreviation of a transliteration, which has come to mean something slightly different should probably be considered a new word and not be transliterated back into its etymological ancestor.
This is amazing! Years ago I wrote a script to convert my music library to romaji for my car that would show non-English characters as ?????. So many titles were really just English titles in katakana that come out as gibberish. I had my own hacky exception override dictionary and the script ending up being like 70% exceptions.
If Cutlet had been around, this would have been exactly what I needed.
While I’m not a huge fan of romaji itself, I disagree with converting foreign words from their japanese spelling to their original spelling.
It might be easier for foreigner to understand the word, but at the same time no japanese will ever understand if you talk about cutlet curry instead of katsukaree. And that’s something i’d guess this would be used in; reading something out loud you don’t know how to read.
TFA says this is optional, but it's a good default. The general case for most loan words is that even native Japanese speakers will find the English spelling much more familiar than romanization. E.g. in any Japanese shop you'll see signs that say "sale" or "bargain", but you'd never see "se-ru" or "bāgen" -- writing that way looks as strange to JP speakers as to anyone else.
"Cutlet" strikes me as an edge case, considering that "katsu" has more or less become a word in its own right. So, perhaps an unfortunate choice of name for this project...
A couple weeks ago I ran into some third party code with translated Japanese comments where "debadora" was left as untranslated romaji. I was puzzled and couldn't find it at first but a little search magic revealed that it is a contraction of "debaisu doraiba" which I suspected was the case. Converting this back to "device driver" is the only way to make it understandable. You can't always reconstruct the original words by playing with the pronunciation.
> I ran into some third party code with translated Japanese comments where "debadora" was left as untranslated romaji. I was puzzled and couldn't find it at first but a little search magic revealed that it is a contraction of "debaisu doraiba" which I suspected was the case. Converting this back to "device driver" is the only way to make it understandable.
Translating debadora to "device driver" is the only way to make it understandable to someone who doesn't speak Japanese.
Then again, translating いとこ to "cousin" is also the only way to make that understandable to someone who doesn't speak Japanese (and does speak English), but the goal is to transliterate it as "itoko", not to translate it as "cousin". Like it or not, the Japanese word is "debadora".
No, the Japanese word is デバドラ. Once you translate or transliterate it, you’ve diverged from the original and there are choices to be made with different answers depending on context and intent.
The example above of “sale” is a good one. In a Japanese shop for Japanese and by Japanese, you will see “sale” but you will never see se-ru.
>> Like it or not, the Japanese word is "debadora".
> No, the Japanese word is デバドラ.
False; these are one and the same claim, not two conflicting claims.
> Once you translate or transliterate it, you’ve diverged from the original
Again, this is wrong. If your "transliteration" has diverged from the original, it's not a transliteration. A transliteration is a reflection of the original, just using a different orthographic system.
I guess you're trying to be pedantic, but that's never useful when talking about humans, especially human language. And even if we tried to remove the human element, the transliteration is clearly not the same or even an accurate reflection; if it were, then transliterating to Japanese and back would get you the same word and this wouldn’t even be a problem.
The key factor you seem to be missing is that most of these words are themselves very lossy transliterations of original English words. Some become so ingrained in Japanese that they’re considered Japanese words by most people, and would make sense to leave in their lossy, katakana form even if romanized. As many people have mentioned here, “katsu” is ironically one of those words.
However, words like katsu are the exception and most are still considered English (or other language) words with the katakana being a best-effort representation in Japanese. The links shared by Zarel[0] and numpad0[1] demonstrate this perfectly.
The “original” word being expressed is “home” and “ホーム” is just the best you can do in an official Japanese script. “Ho-mu” isn’t useful for anyone, foreign or Japanese. It’s meaningless to non-Japanese speakers who aren’t familiar with the lossy transliteration to Japanese they need mentally reverse it. Japanese speakers could work it out quickly enough, but it would take more time and confusion since it’s neither representation they’re familiar with everyday, the original word “home” and the expected Japanese representation “ホーム.”
I think there’s more of a philosophical disagreement between you two about what makes a given word “Japanese” and how orthographic renderings are determinative of that. You highlight lossiness of many loan words and I personally think that when they’ve reached this level of divergence in Japanese, they become equivalent, regardless of orthography (because the Japanese writing system contains within it Roman script). Are “naive,” “ナイーブ,” and “naiーbu” all Japanese words for example?
I don’t think it’s a settled question one way or the other, but is highly variable on context and history. For example, there have been several pushes, beginning with the Meiji to Revolution, to switch Japan to writing exclusively in romaji. If any had succeeded, we’d view this debate quite differently:
So when Japanese romanizations is discussed in English(or any other Latin) expressions in question are written in Latin alphabet e.g. “katsu” or “furafu-pu”, but maybe it should be explicitly noted, in Wikipedia or somewhere, that this form is almost never used in actual Japanese texts.
When an “alphabet” representation of a word is used in Japanese texts, the version of the word in either the original language or English is used.
As the article points out, this is intended to be used for URLs. Converting foreign words to their original spelling is the convention in Japan when transliterating to English.
I don't know Japanese, but aren't there also cases where you would want the original spelling? Glancing at a list of English loanwords in Japanese, I'm guessing that for words like batā 'butter' or chāji 'charge' you could plausibly want the original instead of the romaji.
Perhaps it's hard to say much more without knowing who these transliterations are supposed to serve. If it's a general Japanese-speaking population with limited knowledge of English, I'm not sure whether they'd prefer romaji or the original spelling.
I also wonder if the author had non-English origins of words in mind.
Ex., to how many people is it useful to convert アルバイト from "arubaito" to "Arbeit", especially when the Japanese word has a different connotation to the German (part-time work vs occupation)
Or if it's a word where the source language doesn't use the Latin Alphabet. I have low confidence in the accuracy of a Tamil ---> Katakana ---> English conversion.
Another example of loan word not meaning the same thing as the original word: ズボン means trousers, comes from the French jupon, which means... underskirt.
Why would you expect this to be used for pronunciation? That’s exactly where it wouldn’t make sense. But it would be much better for semantic understanding by non-Japanese speakers, or titles and such like my example in another comment.
Ironically, OP itself contains "Katsu curry" which is a half-back-romanization.
I myself am not too sure what to think about カツ becoming Cutlet when カツ is a shortened version of カツレツ, which itself is the katakana version of Cutlet. Technically, カツ would be Cut, but then English doesn't have that notion of using Cut to mean Cutlet.
There are very little real world uses for kana romanized word typed out in English alphabet. Words are either romanized into katakana, or original forms in source languages is inserted similar to how formulas appear in English text.
There are plenty real world examples for “Beef curry” and “ビーフカレー” used in pairs in Japanese, katsu is usually an extra so it’s rarer though.
Hello, thanks for trying fugashi. I'm the developer. Note that the most recent version of spaCy, 2.3, uses sudachipy instead of fugashi, so maybe that's the source of your problem. If you are having trouble with fugashi, always feel free to open an issue (and note it's fine to write in Japanese).
I'm an American and a few years ago I started getting into Japanese musicians (Senri Kawaguchi, Kanade Sato, Yoyoka Soma, Rie Suzaku, Juna Serita and others). This got me into reading a little about how the Japanese write and it's crazy.
They have four different systems. There is Kangi which is from Chinese. There are two different systems that are phonetic but based on syllables rather than single sounds. And finally they use the Roman script like we do. It must be really hard on the elementary school kids who are trying to learn this all.
Obviously, language learners should never use this. They should take a week or two to learn kana, and use a furigana tool instead.
I'm also pretty unconvinced of the article's stated purpose of "readable" URL text. I spot checked a few Japanese news sites (Yomiuri, Asahi, Mainichi), and none of them try to do this. The URLs just have random alphanumeric article IDs. I don't think romaji is valuable for most readers, who probably find it more effort to decipher it than to simply read the Japanese text in the title.
(Notice: "big gangan" rather than "biggu gangan".)
While it's true that Japan tends to use numbers for dynamic content URLs, this is more about (and using CMSes that require them) than users actually preferring them.
Japanese URLs do frequently tend to be in English, though.
I'm not very familiar with CMSs, but what kind of use case would actually benefit from the tool?
Your example actually requires human decision-making, since Big Gangan is a proper noun with a canonical English name. I expect most Japanese devs to be as familiar with romanization as English devs are with spelling and grammar, so a tool shouldn't be needed unless you're dealing with a large amount of text and can allow for errors.
Check japanese.io for a recent product. I haven't used them in years but rikai-chan or rikai-kun used to be relatively useful; it looks like there's a newer thing called yomi-chan too.
If you're comfortable with Python you should be able to use fugashi to throw together a tool pretty quickly, I've though about doing it before but never got around to it.
That's not a converter if you mix both translation and 'kanji to romaji' at the same time in the same word. That's like mixing two sets of tools in one when only one was needed.
The annoying thing is that the Japanese are equally quilty of this. I live in Finland so Ä/Ö are used; I have never ever found a single Japanese online store that accepts Ä/Ö in input (not even Amazon.co.jp). Add a name and address and you're suddenly give a "Please enter only English characters" error.
You'd think the Japanese of all people would know how annoying it is when systems accept a very limited amount of character sets.