On Anki's Database

kouteiheika · on Feb 23, 2022

I run a website (https://jpdb.io/) which has an Anki importer so I deal with a lot of Anki databases that people send to me and which fail to import, and yeah, Anki's database schema is kind of a mess to be honest. (Which is to be expected for a program of Anki's age and with such a long development history.)

A few extra tidbits:

- A few versions back the cards' "ease" field (that is - how a card was graded) meant something different depending on which phase the card was at the time (so sometimes "2" meant "hard" and sometimes "2" meant "okay"). It was finally fixed and AFAIK in new versions it's consistent now, but apparently the migration didn't always work properly and I still sometimes see databases where the grading is the other way around compared to what it's supposed to be, and I need to heuristically detect that this is the case and handle it.

- Initially JSON blobs were used to store a lot of data; relatively recently that was changed so that it's stored as proper tables, but not completely, so a lot of data's still in the blobs, but this time instead of JSON it's protobuf. (Which seems strange to me considering SQLite has native support for JSON.)

It's a good thing the schema's slowly being cleaned up, but unfortunately it's only done incrementally, so every time any little thing changes I need to add yet another special case to my importer to handle it, and often in various permutations too because some databases are half migrated Frankensteins. (Don't ask me how that happens; I don't know. Maybe it's an issue of people using outdated plugins with their Anki installation, or copying their database between multiple independent Anki implementations, or maybe the current phase of the moon's just wrong.)

chaorace · on Feb 23, 2022

I love JPDB, it's cool to randomly see you in the wild!

I can't personally use JPDB (due to my own niche learning strategy, not a flaw in JPDB), but I desperately want to be able to consume the underlying data. It's just that good -- the data that you've curated is unbeatable. If you ever provide a public API, I'll join your Patreon in a heartbeat.

adaszko · on Feb 23, 2022

It's astounding how much quality material and tools exist for learning Japanese, compared to eg Chinese. JPDB looks fantastic.

kouteiheika · on Feb 23, 2022

I've thought about potentially tackling other languages in the future, and I think that would be fun to work on too, but alas, at this point I don't really have the resources to even be able to work on it as-is (since this is currently purely a spare time project, and my TODO list is already hundreds of items long), so I'd be just spreading myself way too thin.

CorrectHorseBat · on Feb 23, 2022

On the other hand, I've never seen something like Pleco for Japanese.

kouteiheika · on Feb 23, 2022

> If you ever provide a public API

Yes, an API will be coming in the future! (:

> due to my own niche learning strategy, not a flaw in JPDB

Just for curiosity's sake - what kind of strategy is it, if I may ask? I have a very ambitious plans for the future, so depending on what exactly it is it might be possible someday.

zwayhowder · on Feb 23, 2022

I know I gave jpdb a try a while ago and found that while the dataset is incredible (and like others I'd pay just for it) but the built in tool doesn't work the way I need.

I have always had the most success with Anki and Wanikani when it comes to Japanese. Trying to add in yet another paradigm for learning is frustrating. I appreciate you've put a lot of effort into helping people move from those tools, but I don't want to.

The single biggest reason is offline access. Anki works on my phone on an airplane or in an area with no mobile service. (In Australia there are lots of those).

I only started using WK seriously when I discovered the Android apps that let me do my reviews offline.

If you had a Patreon tier that allowed for Anki exports of your lists I'd sign up in a heartbeat even if it only allowed 1 download per month of something similar. I mean lets be honest what possible valid need could I have for downloading the whole data set in one go...

Compared to the time it would take me to use Subs2srs across a season of a show I'd rather just give my money to you.

kouteiheika · on Feb 23, 2022

Well, there's nothing wrong with using Anki if it works for you! I know that a lot of folks need offline access and/or other features which Anki provides which I don't have, and that's totally fine.

I always ask this not because I necessarily want to convert people to use my thing, but because I always love to hear what features people need and what I can improve. In case of offline access it is something that's technically on my tentative roadmap, but very far off into the future, so indeed for anyone who needs that Anki's the better choice.

> allowed for Anki exports

That's something that I'm planning to add very soon actually! Well, maybe not exactly Anki exports (I haven't yet researched as to what that would entail), but just generic functionality to be able to export the built-in decks as a .csv (which I'll be happy to tweak/improve to make it easier to import).

chaorace · on Feb 23, 2022

That's exciting to hear!

I'm following an eclectic strategy where I isolate and separately learn spoken and written Japanese. The process looks a little like this:

1. I start with a deck of Anki vocabularly notes that I want to acquire

2. Study begins with "Speech" Anki cards from these notes (The card front is audio-only, including a clip of the word and a clip of an example sentence. The back has the English definition & a helper image). I only consider a card as being "Good" once I am able to recall & replicate the pitch accent with a steady rythm (I pipe back delayed audio from my microphone while I practice with a metronome running)

3. In parallel, I also do Kanji isolation study using KKLC

4. Each week, I manually enable new "Writing" Anki cards that come from the same set of notes (The writing is on the front. Only the word audio is on the back). I only enable a "Writing" Anki card if I have previously learned BOTH the component Kanji and the spoken word

5. I study my enabled "Writing" Anki cards in parallel with the other two tracks

I like this approach because I effectively have three separate learning tracks that I can switch between -- the variety keeps me motivated. It also helps train your ear to be able to distinguish homophones by pitch and leads you to think of 同訓異字 writings as variations of a spoken word, rather than as true homophones.

kouteiheika · on Feb 23, 2022

Wow, that's indeed a very niche learning strategy. (: I like it though!

I'd definitely like to expand the configurability of jpdb up to a point where you'll actually be able to do something like this in the future. Unfortunately that's not going to be anytime soon, so you're definitely better off with sticking with what you have now. (The most immediate feature that I have planned soon-ish are pure kanji decks; the necessary customizability for the rest will come much later.)

cehrlich · on Feb 23, 2022

jpdb is great, thanks for making it :)

I'm incredibly thankful that Anki exists, I don't think I would have ever learned Japanese to a high level without it. But having spent some time looking at its guts myself, it sure is a mess in there. I thought at one point about building something on top of Anki, but decided against it discovering some of the same stuff that has already been mentioned.

zozbot234 · on Feb 23, 2022

The main Anki codebase is getting rewritten in Rust (from Python) so they'll probably clean up a whole lot of technical debt in the process and make future contributions easier.

istjohn · on Feb 23, 2022

As a mostly self-taught dev, it's really helpful to read code reviews like this. I also enjoyed the author's code review reference[1]. That document cites Erick Breck's code review reference as inspiration. Is that publicly available? I couldn't find it on his website or via Google.

Does anyone know where one can find high quality public code reviews? I imagine there must be open source projects on GitHub with good public feedback on pull requests. Any ideas of specific projects to look at?

Finally, I haven't found much information on database migration best practices. Any good articles, books, or other resources people would recommend?

1. https://www.natemeyvis.com/code-review-reference/

solarmist · on Feb 23, 2022

Unfortunately, that was the biggest problem with Anki’s DB. It was done by feel by a self-taught dev.

And now he has spent years trying to slowly undo the stuff he baked into it at the lowest levels.

It's fine to do stuff independently without knowing, but please seek feedback and advice on crucial design decisions.

pessimizer · on Feb 23, 2022

> but please seek feedback and advice on crucial design decisions.

Or at least isolate them from the rest of the code. Of course, it's difficult to recognize which design decisions are crucial without already being an expert, so this sort of advice is probably silly. It's probably also the kind of advice that would have been more likely to strangle Anki in its crib rather than giving us the opportunity to discuss how a wildly successful program that has helped many people learn many things should improve its data model.

solarmist · on Feb 23, 2022

Yeah, it's a hard thing to judge and balance.

I'm not saying they need an expert or they shouldn't push forward anyway, but it's good to recognize when you're out of your depth and to get input when possible. There's a lot of value from just running something by someone else to see if they can understand it and if something occurs to them that would make things simpler.

Theaetetus · on Feb 23, 2022

[Author here.]

Thanks for the note! Unfortunately Eric's reference is not publicly available (as far as I know).

I'd love for more people to collect and publicize sets of commonly used code review notes.

Others here will know much more than I do about which publicly viewable projects have the best (public) feedback. Good luck!

xdfgh1112 · on Feb 23, 2022

As you say the schema is clearly not intended for public consumption and closer to a dump of their internal data structures. And people are really protective of their Anki data. My Japanese deck has 11k handmade cards and represents years of progress. I'd really prefer that the devs didn't do dangerous migrations on it!

Theaetetus · on Feb 23, 2022

[Author here.]

Yes, I agree. I hope I was sufficiently clear about that. My motivations for writing this up were that (1) it's a wonderful example of a lot of things I often find myself saying about databases; (2) peoples' Anki decks are really important to them and I'd like to enable people to work with them if they want; (3) I do hit strange behavior sometimes in Anki and perhaps these issues are behind it.

I hope you're backing up your deck!

aasasd · on Feb 23, 2022

Surprise: for me, Anki seems to have lost some data when changing to the 2.1 scheduler—though thankfully I just laughed it off and restored the backup I made immediately before the migration. Will need to repeat the process and figure out what exactly changes, one day.

Of course, usually changing the schema shouldn't lose data, so there's no need to be afraid of it in general.

LAC-Tech · on Feb 23, 2022

My impression is most databases are designed fairly haphazardly, and while tables may have fields added, the entity relationship diagram stays pretty static. My rationale is as follows:

1 - A database schema requires quite a bit of thought and up-front design, which we're loathe to do in this agile world.

2 - Schema migrations are hard and scary, so we avoid doing them

Theaetetus · on Feb 23, 2022

[Author here!]

I'd guess that (2) is more important than (1) here. Or, at least, it seems anecdotally right to me that programmers are very hesitant to do migrations.

As for (1), you must be right that a disconnect between a design and eventual use is important here, but I'd have said that the more central cause is just that systems change and that their eventual use is very hard to predict. I need to think more about your conjecture that there's a misfit between programming practice and what's necessary for database work; I like the idea that we're habituated to implement something very basic up front, and that this habit intersects particularly badly with database work (that is, it causes problems that aren't nearly as bad with non-database parts of computer systems).

edylemond · on Feb 23, 2022

Wish this article had been written a year ago when when I added Anki import to my learning app (https://traverse.link/)!

Especially importing media files and de-renaming them was a pain, as well as handling the different types of cloze deletes (some of this is described quite well in anki's docs, for example here https://docs.ankiweb.net/#/templates/generation)

Another link on their DB structure which saved me a lot of time was this one: https://github.com/ankidroid/Anki-Android/wiki/Database-Stru...

emursebrian · on Feb 23, 2022

This post comes at a perfect time because I am working on an app that utilizes spaced repitition and flash cards and happen to be tweaking the spaced repition algorithm today. I've used Anki a quite a bit, and it mostly works but there are limitiations and things that don't work well. The post does a good job at highlighting some of the underlying issues.

The app I am working on is focused on language study. A big drawback of Anki when it comes to language study is that the card is tightly coupled to the content it's trying to teach you.

In our implemntation, we decided to decouple the presentation of what you're studying and how you're studying. We have the concept of a term which can be a word, definition, sentence or character. We store the progress information along with term. The term can then be studied with any of the available study methods. This gives quite a bit of flexibility in how the information is studied and reviewed and goes a lot further than just flash cards and multiple choice questions.

Another benefit is that this progress information can then be used to make recommendations to the user on what to study next.

saladuh · on Feb 23, 2022

Refold?

emursebrian · on Feb 23, 2022

Our app is called Emurse. Our focus is on teaching languages with the most efficient path to fluency. With language study, once you get to an advanced beginner or intermediate level of study there's just not a lot of material out there to help you continue learning journey. We aim to help fix that. A lot of the core functionality is built, but quite a bit of work remains on the content development side.

We're working on a Thai language course right now and hope to be rolling parts of it out in the near future. Other languages are coming later. The link is in my bio, but there's nothing to play around with yet.

dymk · on Feb 23, 2022

Doubtful, refold appears to just be articles on how to learn, not a learning platform itself.

Authors bio links to https://emurse.io/

saladuh · on Feb 26, 2022

No, refold have a basic SRS web app now. It's terrible though, awful. Refold is basically trying to commercialize the community MattVsJapan created with awful things like proprietary alternatives to Anki.

knubie · on Feb 23, 2022

I had the pleasure of implementing an Anki import for my SRS app [0] recently, and when opening up an Anki database I found myself scratching my head a lot. I have much respect and gratitude for the people that decoded and documented this stuff!

[0] https://mochi.cards/

dvko · on Feb 23, 2022

Hey! I _think_ I tried Mochi before discovering Anki has a web version too. But Mochi looks really good! Do you (plan to) support MathJax in cards?

EDIT: Never mind, was able to get behind a computer and yes, I did use Mochi before. And it does seem you support MathJax. Awesome!

jamil7 · on Feb 23, 2022

I really like your landing page! Inline preview is very cool.

zachwill · on Feb 26, 2022

Mochi is great!

aasasd · on Feb 23, 2022

Anki data basically needs wrapper libraries with a humane API, even when working with low-level structures (i.e. without the app itself running). Just to avoid traumatizing external developers.

If everyone is using such libraries, the structure itself can be changed.

kashunstva · on Feb 23, 2022

100%

The AnkiConnect project[1] is about the closest that we get to that right now, but requires running everything through a server inside Anki.

[1]: https://github.com/FooSoft/anki-connect

aasasd · on Feb 23, 2022

Not needing Anki itself is pretty much a requirement for me, because that's the way which makes sense in my worldview. Tools work independently and feed into each other—not get slapped onto each other, which kinda smells like a whiff of bad OOP.

I began a feeble attempt at a lib for my own purposes, but I had to take Anki db structs a little at a time just to keep my frustration in check.

There are in fact some libs for writing Anki data: https://github.com/kerrickstaley/genanki and maybe https://github.com/patarapolw/AnkiTools — but genanki, while looking quite good for one-time generation, doesn't seem to be able to update notes.

Afaik Anki itself does include Python libs for creating and manipulating db records, which can be used in third-party scripts—however dunno if they work without the full app running, and on top of that I personally keep trying to use Lua, since it runs circles around Python in terms of speed.

_dain_ · on Feb 23, 2022

There is also apy[1] which needs an Anki installation but at least doesn't actually run the Anki process.

[1]https://github.com/lervag/apy

kashunstva · on Feb 23, 2022

For simpler tasks on the Anki database, the Python `anki` module[0] provides a level of abstraction that can be helpful without resorting directly to SQLite queries against the db, but just a little. The main problem with the Python module is the insistence on 1:1 correspondence between the module version and the Anki (db) version.

[0]: https://pypi.org/project/anki/

d33 · on Feb 23, 2022

Speaking of Anki... has anybody played with AnkiWeb here? I'd love to see a Telegram bot that I would notify of random English words I sometimes look up and everyone on the channel would immediately have it added to their accounts. My main use case is group language learning - I can imagine it being useful when learning Chinese with my girlfriend.

rsanek · on Feb 23, 2022

For those that are on the creation side, I've found genanki [1] to be a joy to work with. I have a service that auto-generates a bunch of cards from code and then spits out an `apkg`. I've made what I consider the funnest part of that open for registration [2]

[1] https://github.com/kerrickstaley/genanki

[2] https://www.reddit.com/r/Anki/comments/g0zgyc/spotify_anki_l...

solarmist · on Feb 23, 2022

It's a bit limited in what it does for you.

But yeah, that's what I've used as well.

gilgamesh327 · on Feb 23, 2022

I'm learning data analysis/science and Anki's database has been something I'm playing with, it was an insightful read since I was thinking in a similar way about the `id` column and how things can get messy with a time-related value.

I've seen the author reply here, if you don't mind me asking, are there other ways to represent this database? Is there a useful exercise I can perform to do so? e.g. NoSQL version or similar, thank you.

Theaetetus · on Feb 23, 2022

[Author here.]

There are definitely other ways to represent this data! I think a useful exercise is:

(1) Sit down and think of a very basic representation of this data (in whatever language you prefer). (2) Figure out where in the current SQLite database the relevant information lives. (3) Write a function that is given some relevant set of rows from this database and returns an item in the representation you determined in (1). (4) Test it. (5) Think about how to persist it.

You can do some subset or superset of these as you see fit, but I do think it's valuable to think about how to represent the relevant objects before you worry about persistence details.

Thanks for your comment!

nojs · on Feb 23, 2022

I got pretty deep into this when automating card creation for Chinese characters. I would highly recommend using the AnkiConnect plugin [1] rather than manipulating tables! This worked a charm (at least for creating new decks).

1. https://ankiweb.net/shared/info/2055492159

wodenokoto · on Feb 23, 2022

I just use csv when importing generated decks.

What’s the benefit of getting knees deep (other than out of pure interest)

rkachowski · on Feb 23, 2022

A few years ago I wrote a ruby gem to generate anki decks and spent some time reversing the schema from the android app and open source code. There were some very weird decisions iirc, something around pipe characters to mark individual card fields.

https://github.com/rkachowski/anki-rb

IshKebab · on Feb 23, 2022

> using flds instead of fields appears to be abbreviation for abbreviation's sake

God yes. Why do people do this? I recently moved into hardware and it's even worse! Someone abbreviated a TLA to a single letter. Like instead of CPU_CLOCK it was CCK. Madness. These aren't printed on a silkscreen!

shak360 · on Feb 23, 2022

It'd be cool if there was an ability to learn Anki cards based on their graph relations to other Anki cards.

> compute the Perron-Frobenius eigenvector of a graph of medical school Anki cards based on an automated tagging system that tokenizes the cards

> change medical education forever

throwaway5486nv · on Feb 23, 2022

How complex is anki algorthim? Is the time interval of every card is independent of each other?