Hacker News new | past | comments | ask | show | jobs | submit login
KanjiVG – SVGs of Kanji character strokes including order, shape and direction (tagaini.net)
236 points by pabs3 on Feb 21, 2023 | hide | past | favorite | 68 comments



I used this in my Anki deck for John Heisig's Remembering the Kanji. The stroke ordering was helpful and it was animated with CSS. It's been a while, so I can't remember who to give credit to for the RTK deck or the CSS animations.

I also built a little tool that would expand out the Kanji character into its constituent radicals and Heisig primitives so I could at them to my study code.

At that time, I was reading NHK Web News Easy on a regular basis and wanted to study those characters with Heisig's visual approach.


I also used this with anki, but i wanted some interactivity and practice actually physically writing stuff out combined with spaced rep.

Resulted in some pretty gnarly code using kanjivg, pygame and ankiconnect: https://github.com/eshrh/anki-kunren


There's also a stroke order font which is used in the Anki deck I have for learning kanji. The svgs in the post look great, but as far as I can tell there's not an easy way to copy the actual kanji character (you can get it from the actual SVG source however). The font is nice because as far as the clipboard is concerned it's just the character.

https://www.nihilist.org.uk/


It's generated from KanjiVG though: https://kanjivg.tagaini.net/projects.html


Oh interesting. I skimmed that page to make sure it wasn't already referenced and ironically didn't recognize that the kanji were the same as the link I posted. More practice needed I guess.


The publication of the kanji stroke order font at the nihilist.org.uk site actually predated the publication of the KanjiVG data. The person who made the Kanji stroke order font had access to the KanjiVG data before it was generally published. However, the link to that web site was only added to the list of projects a few months ago.

https://github.com/KanjiVG/kanjivg.github.com/commit/3b72197...


This looks very much like Make Me A Hanzi (MMAH), which is exactly the same (Chinese) characters. It's just that Japanese knows those as Kanji.

https://github.com/skishore/makemeahanzi


(Caveat: there are a handful of cases where the Chinese and Japanese stroke orders differ, so take care treating these as equivalent.)


This. Traditional Chinese and Japanese are two seemingly similar but distinctly different things (the latter is forked from the former).

Simplified Chinese is a whole different beast of corruption that is a fork of Traditional Chinese but otherwise not similar to either.


Kanji were also simplified but not necessarily the same way as simplified Chinese. Simplified Chinese also sometimes uses 'old' characters.

So sometimes modern Japanese is actually similar to simplified Chinese, sometimes it is similar to traditional Chinese, and sometimes it is unique. There is no simple 'fork'.

For instance, 円 (yen) is simplified Japanese and uniquely Japanese. It used to be same as traditional Chinese 圓. In Chinese it was separately simplified twice to its current form 元. So when you see prices in 円 in Japan and 元 in China it's actually the same original character simplified differently.

Interestingly, traditional Chinese 國 was simplified by reusing an old character and is now 国 both in Japanese and simplified Chinese but not in traditional Chinese.

Why not.


Sometimes Blink is similar to WebKit, sometimes it is similar to KHTML and sometimes it is unique.

I think the fork analogy still holds, no one says forks can't have convergent evolution or cross pollination


To further complicate things, there are the 'kokuji" - characters that were invented in Japan, used only in Japan, and only have Japanese pronunciations - yet are still considered kanji ("Chinese characters)". Examples:

働 "work" 峠 "mountain pass"

https://en.wikipedia.org/wiki/Kanji#Kokuji


I think 'kanji' should be interpreted at large. This is the Chinese writing system and inventing new characters (which happens everywhere this writing system is used) add to the whole corpus of kanji/hanzi even if some are invented or used in specific countries.

峠 has a mandarin pronunciation apparently: https://baike.baidu.com/item/%E5%B3%A0/4336929


>峠 "mountain pass"

Incidentially, the backstory behind that kanji is hilarious.

The kanji is composed of the kanjis for "mountain" on the left side, and "up" and "down" on the upper right and lower right sides respectively.

You know what a mountain pass does? Go up and down a mountain.


This is how a lot of kanji are formed. For example 町 (town) is 田 (rice paddy) + 丁 (street). I guess at some point in language formation a lot of towns were primarily collections of rice paddies.


働 exists in Chinese too, meaning "labor" 峠 does not exist in Chinese although many Chinese dictionaries (most notably the CC-CEDICT) include it


Definitely not exactly the same. Kanji and Hanzi are two different character sets - they overlap a lot, but each has common everyday characters that aren't in the other, and sometimes the "same" character is in both sets but written differently in various languages (e.g. 骨).


In case anyone is wondering why different glyphs have the same unicode code point, and how an app is supposed to decide which one to render... Well I don't know the reason for the first question actually, though many people appear to have some choice comments.

But as for the second question: for HTML documents, many tags have a lang attribute that decide which version of the glyph to render within that tag. Hacker News has lang="en", so it'll use a user setting to decide. For example, in Firefox' about:config, there's a setting called cjk_pref_fallback_order. If e.g. ja comes first, the little square inside the top square in 骨 is rendered on the right side, if any zh thing comes first, it's rendered on the left side.


> In case anyone is wondering why different glyphs have the same unicode code point, and how an app is supposed to decide which one to render... Well I don't know the reason for the first question actually

https://en.wikipedia.org/wiki/Han_unification

My understanding is that this is basically "white guy says all Asian writing looks the same" in standards form and is largely regarded as a terrible idea.


Unicode had a builtin language tagging system to resolve glyph variants. Han unification was implemented with this in mind. Then the tagging got deprecated in a later version.


The more I learn about unicode, the more it looks like Bad Ideas: The Standard to me. The only good part of it is the UTF-8 encoding and that was just Thompson and Pike sitting down and thinking about the problem for an hour.


UTF-8 is just a simplified improvement on another variable encoding. They didn't conjure it from nothing.


the lil inside square is inconsistent in chinese

For instance traditional chinese in china will be left 過. Most computer systems will type this one

But in Taiwan they do right side. That said, i dont entirely understand how it works. You cant even copy\paste the right hand version into this comment box for instance- but you can see it on wiki. Maybe theyre separate fonts? Really not sure. Maybe somebody knows better

And simplified is entirely different 过


> This looks very much like Make Me A Hanzi (MMAH), which is exactly the same (Chinese) characters. It's just that Japanese knows those as Kanji.

Well "kanji" literally means "Han character".


That's a neat project. While they are extremely similar, there are still many variations. For example one small variation is 今 is written with a horizontal stroke in Japan but a slanted stroke in mainland China.

https://en.wiktionary.org/wiki/%E4%BB%8A#Alternative_forms

https://github.com/KanjiVG/kanjivg/blob/master/kanji/04eca.s...

https://makemeahanzi.herokuapp.com/#/codepoint/20170


If anyone is looking for animated stroke orders for Chinese characters, Hanzi5 is by far the best resource [1]. I built a small app to make them searchable too [2].

1. http://www.hanzi5.com/

2. https://dragonmandarin.com


The above web site offers animated stroke orders, see the "animate" button on

https://kanjivg.tagaini.net/viewer.html

Credit for the project is at the bottom of the page.


The originally posted website offers animated stroke order for japanese kanji, which is not necessarily the same as for chinese characters.

For example, enter 田 into both the original website (https://kanjivg.tagaini.net/viewer.html?kanji=%E7%94%B0) and hanzi5 (http://www.hanzi5.com/bishun/7530.html), and you'll see that the stroke order differs.

There are also several chinese characters which are not present in that japanese viewer, for example 厅


KanjiVG is awesome. I used it for a free kanji app I made for iOS and Android. Figuring out how to write an SVG parser + renderer wasn't as tricky as I thought when I set out to do it. https://www.bjmalicoat.com/projects/kanjibook


Please feel free to add your project to the list at http://kanjivg.tagaini.net/projects.html if you like. You can either announce it to the mailing list or just add a pull request at https://github.com/KanjiVG/kanjivg.github.com/blob/master/pr...


Done! Thanks for letting me know, I totally missed that section.


Very nice. I might even get a sticker :)

One thing that IMO is missing is the audio playback option for readings. This would've greatly facilitated remembering readings. AWS Polly is very easy to integrate with and it costs next to nothing. /nudge /nudge.


Great suggestion. I couldn't find an audio source I liked. I'll take a look, thanks.


On Polly "Takumi/neural" voice is the best option.


I looked around and couldn't find any examples or demo of the SVGs on the website. This is the epitome of a project that would benefit from a visual representation.



There has been a viewer there for a while, which is also the viewer linked from the EDRDG dictionary. See

https://www.edrdg.org/cgi-bin/wwwjdic/wwwjdic?1MKU8cc0

and click "KanjiVG Stroke Diagram".

The listing page was also restored a few days ago:

http://kanjivg.tagaini.net/listing.html

You can click on any character and you should get a page which shows the character

http://kanjivg.tagaini.net/viewer.html?kanji=%E8%B3%80

and also can display its groups by clicking the "Show the component groups", and the variant forms if they exist, such as

http://kanjivg.tagaini.net/viewer.html?file=08cc0-Kaisho.svg

and also gives a link to github.com

https://github.com/KanjiVG/kanjivg/blob/master/kanji/08cc0.s...

where the image is also displayed. There is also a new animation feature which was added a couple of weeks ago, and a "Random" button where you can get a random kanji.

There's also a link to Wiktionary on each viewer page

https://en.wiktionary.org/wiki/%E8%B3%80

which was added mostly as a quick way to view the IDS information for the character.


To ameliorate this problem, I've moved the "Viewer" link to immediately underneath the "Home" on the left side menu on the latest version of the website.


It’s too bad the pitch data out there for Japanese is either proprietary or pirated without attribution

Another resource: https://github.com/CaptainDario/DaKanji-Single-Kanji-Recogni... someone made an interesting kanji recognition lib that hasn’t gotten attention


> It’s too bad the pitch data out there for Japanese is either proprietary or pirated without attribution

I have serious doubts whether this kind of data is even copyrightable, as long as you're not redistributing it verbatim.

A specific selection of words with pitch data (e.g. the NKH pitch dictionary) might be copyrightable as there was creative expression involved in picking which exact words to put into the dictionary and in what order. But the data itself? A 猫 is always going to have a HLL pitch accent (in the standard accent). That's a fact. And facts itself are not copyrightable.

You can't copyright a phone book[1]. Quoting from the case:

> "Notwithstanding a valid copyright, a subsequent compiler remains free to use the facts contained in another's publication to aid in preparing a competing work, so long as the competing work does not feature the same selection and arrangement"

Sounds exactly like the pitch accent data dump that's been floating around which a lot of people use. (Not the verbatim NHK one; the other one.) They probably used the data from NHK's pitch dictionary to compile it along with a few other dictionaries. But does it feature the same selection and arrangement? Nope.

[1] - https://en.wikipedia.org/wiki/Feist_Publications%2C_Inc.%2C_....


I see apps pay to license data from https://www.cjk.org for pitch accent data. I don't know what their data looks like though. Maybe you're right and their business case for this data is essentially a scam, or there is non-factual copyrightable data, or the law is different in Japan, I don't know.


Given a large enough corpus of spoken lines, could you do some ML magic to get the pitch accents (maybe even just FFT and a simple classifier would do)? I'm aware that the "base" pitch accent does change in context so it's not quite trivial, but it seems like you could get pretty close?

Edit: Found https://mizoru.github.io/blog/2021/12/25/Japanese-pitch.html


The pitch variants are also highly regional but that’s an interesting idea. I wouldn’t want to give it to learners in use cases I can think of, when correct data is available though (just with licensing headaches/costs)

The correct data already exists, so I'm not sure what the point is besides having a less accurate but freer option


99% (or more) of the recordings you might find for training would be in the standard accent unless you were specifically digging for regional accents.


it's also changing generationally


It predates modern ML, but that's what Suzuki-kun does:

https://www.gavo.t.u-tokyo.ac.jp/ojad/


I don't think Suzuki-kun was trained on voice data, they trained a classifier on an annotated text corpus. And getting access to such a corpus is probably much harder than finding voice samples.

https://research.ibm.com/publications/accent-sandhi-estimati...


This linked article seems to have gotten special access to the data -- unfortunately only the resulting model is available.


It seems like most people use the NHK pitch accent dictionary (there are TSVs of it online). I don't feel like it's particularly "pirated" though, can you really pirate the way people actually pronounce words?


NHK didn’t give that data an open license

Piracy is rampant. Whether lifting and reusing/redistributing copyrighted dictionaries or other stud materials, or pirating ebook/cdrom type content. That’s nice, but as a legit service provider it’s less accessible without just giving users their own ability to side load in materials


I'd never really understood just how complex these characters can be. Sure, there's a lot going on at first glance, but seeing the stroke order and direction, and imagining the process, really hammers it home to this monolingual American.


See also the linked projects (https://kanjivg.tagaini.net/projects.html), which seem like a valuable resource. I only looked at https://www.tanoshiijapanese.com/home/ and already thought this site is just insane with what it offers for free.


KanjiVG is pretty cool. The color coding for radicals and stroke orders is nice. Also, parsing and using the SVGs is fairly straight forward.

Having owned a couple of books that had the stroke orders wrong while I was learning Japanese, I always check for mistakes in the stroke orders of kanji like 右 (right) and 左 (left) to make sure they're correct. KanjiVG gets it right.

On a tangentially related note, I recently purchased an iOS Japanese dictionary app called "Nihongo" (https://apps.apple.com/us/app/nihongo-japanese-dictionary/id...) for my daughter because she wanted to study Japanese. I was just expecting a basic course, but it is probably the best vocabulary/kanji studying app I've ever used. It's a little pricey, but well worth it if you're trying to build a strong Japanese vocabulary or learning to read kanji. I have no affiliation with the people who make the app. I'm just an impressed buyer.


There isn't actually any colour coding in the SVG files. Various viewers add that on, but the SVG files themselves are black and white.

The stroke order is the accepted one for the most part, but it would be going to far to say that it is definitely correct.

https://github.com/KanjiVG/kanjivg/issues?q=is%3Aissue+is%3A...

There are various disputed stroke orders and the radicals are also sometimes disputed. KanjiVG also contains a large number of variant stroke orders which are identified using suffixes.

http://kanjivg.tagaini.net/variants.html


Hmm, this reminds me of Jim Rose's stroke order diagram retrographer-editor[0] (SODER) from some time ago.

[0]: https://archive.org/details/soder


That seems to have been uploaded on February 11 2023. I remember kanjicafe.com, which used to have a giant picture of Jim Rose's face for some reason. The site seems to have gone now.

https://web.archive.org/web/20040925183635/http://kanjicafe....


There's a bunch of 2000s-era Japanese content up at <http://ftp.edrdg.org/pub/Nihongo/00INDEX.html>; I've been uploading some it it to the Internet Archive. The GIF you mentioned is on there, as well as SODER.


I like the stroke orders for the romaji.


I'm not sure why there are stroke order numbers on the alphabet/numerals. Looking at the history of the files,

https://github.com/KanjiVG/kanjivg/commits/master/kanji/0004...

they weren't there originally, then there is a commit where they were added which says "Recover stroke numbers from SVG directory". But in the same commit the stroke orders for kana were also added, so it might have been just a side effect of something useful.

Another thing I don't really understand is why all the ASCII characters were copied into the "wide ascii" positions:

https://github.com/KanjiVG/kanjivg/commits/master/kanji/0ff2...

The commit summary actually says "The ascii characters copied to the full width character positions." which I think was completely pointless. KanjiVG doesn't have the entire JIS character set, since that includes Greek and Russian letters, and various graphical symbols, as well as half-width katakana (narrow katakana), so there wasn't any clear reason to stuff these duplicates into there.

I might bring these two issues up on the mailing list at some point.


Why does stroke order matter? Is it to handle animating them and if each stroke isn’t the same colour?

Sorry but I don’t know much about these kinds of characters.


One thing is balance. If you draw characters out of the typical stroke order, things will often look lopsided or have weird proportions. When written with the proper order, proportions look nicer.

For example, think about writing a capital A. It’ll look different if you draw the middle bar first or if you draw the outer two lines first. F will also look different depending on whether you draw the vertical line or one of the horizontal lines first. Try a Q with the little bottom dash before drawing the circle. It’s not only weird but more difficult.

The difference in these characters is subtle, but you can notice it with your own writing. Now instead of 3 strokes to write a character, imagine those with 15 or even 28 strokes. The odd balance and proportions have cascading effects.


It's partly just that it's tradition, but there are also practical reasons:

1) Chinese characters are traditionally/historically written with a brush, not a modern pen or pencil. Because brushes don't create uniform lines, there is a connection between the specific series of movements and the final appearance of the character (sort of like calligraphy pens). Basically, inconsistent movements tend to produce inconsistent-looking characters, so an agreed-upon standard aids in legibility.

2) Various components (most notably the "radicals") reoccur across many characters. Having a (mostly) consistent set of rules for how each component is written and in which order aids memorization because you're not learning every new character "from scratch".

3) Stroke order affects how mechanically efficient it is to write the character, which can be a pretty big deal when some of the more complex characters are upwards of a dozen strokes.


One thing not mentioned in other comments is that stroke order also helps with software that wants to do character recognition.

If, for example, you are taking handwritten notes on an ipad, and want software to convert the notes into text... well, knowing the order of strokes and having an agreed upon order helps considerably over just trying to match shapes.

Digital dictionaries also usually have a "handwritten input" mode to look up a character, and that mode will also recognize characters much more accurately when input with correct stroke order.


Because when you handwriting, the stroke order is very important.


Very important due to, tradition?


As a beginner Mandarin learner, my understanding is that historically, people wrote using the traditional stroke order, this informed what people think of as the aesthetically pleasing or "correct" way that the characters look. Now, if you want to write the characters in a legible and aesthetically pleasing way, the easiest method is to write them in the traditional stroke order. I think it's analogous to the way cursive writing in the west was taught, which informed the way it was written and what people thought of as the "correct" way to write cursive. If you wanted to learn to write in cursive, you could just look at existing cursive writing and try to copy it, but if for example you guessed that you should write it from right to left then you'd probably find it harder because cursive evolved to be written from left to right.

You can normally tell when someone uses the incorrect stroke order because things will be the wrong size. For example, when writing 因 you're supposed to write the outer ㄇ first, then the inner 大 and then the bottom horizontal stroke of the 口. If you start with the 大 then it's harder to write the outer 口 the right size.

Again, this is all from a beginner, so take it with a good amount of salt.


Writing by hand leads to optimizations, such as not lifting the pen a lot. This means that the movement between strokes also gets drawn. A 10 stroke character might end up being one single continuous path (effectively 10 strokes + 9 connections). Depending on stroke order, the results can be wildly different. So consistency makes sense for the result to be intelligible.


IMHO it’s high time China and Japan went the enlightened Hangul way Korea took half a millennia ago. There’s no reason to keep to absurdity going any longer; even they don’t know how to type their own words and use pinyin as input. The Vietnamese way would also be easy with their explicit tones written on each vowel, however they lost the advantages that blocky characters offer.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: