I run a website (https://jpdb.io/) which has an Anki importer so I deal with a lot of Anki databases that people send to me and which fail to import, and yeah, Anki's database schema is kind of a mess to be honest. (Which is to be expected for a program of Anki's age and with such a long development history.)
A few extra tidbits:
- A few versions back the cards' "ease" field (that is - how a card was graded) meant something different depending on which phase the card was at the time (so sometimes "2" meant "hard" and sometimes "2" meant "okay"). It was finally fixed and AFAIK in new versions it's consistent now, but apparently the migration didn't always work properly and I still sometimes see databases where the grading is the other way around compared to what it's supposed to be, and I need to heuristically detect that this is the case and handle it.
- Initially JSON blobs were used to store a lot of data; relatively recently that was changed so that it's stored as proper tables, but not completely, so a lot of data's still in the blobs, but this time instead of JSON it's protobuf. (Which seems strange to me considering SQLite has native support for JSON.)
It's a good thing the schema's slowly being cleaned up, but unfortunately it's only done incrementally, so every time any little thing changes I need to add yet another special case to my importer to handle it, and often in various permutations too because some databases are half migrated Frankensteins. (Don't ask me how that happens; I don't know. Maybe it's an issue of people using outdated plugins with their Anki installation, or copying their database between multiple independent Anki implementations, or maybe the current phase of the moon's just wrong.)
I love JPDB, it's cool to randomly see you in the wild!
I can't personally use JPDB (due to my own niche learning strategy, not a flaw in JPDB), but I desperately want to be able to consume the underlying data. It's just that good -- the data that you've curated is unbeatable. If you ever provide a public API, I'll join your Patreon in a heartbeat.
I've thought about potentially tackling other languages in the future, and I think that would be fun to work on too, but alas, at this point I don't really have the resources to even be able to work on it as-is (since this is currently purely a spare time project, and my TODO list is already hundreds of items long), so I'd be just spreading myself way too thin.
> due to my own niche learning strategy, not a flaw in JPDB
Just for curiosity's sake - what kind of strategy is it, if I may ask? I have a very ambitious plans for the future, so depending on what exactly it is it might be possible someday.
I know I gave jpdb a try a while ago and found that while the dataset is incredible (and like others I'd pay just for it) but the built in tool doesn't work the way I need.
I have always had the most success with Anki and Wanikani when it comes to Japanese. Trying to add in yet another paradigm for learning is frustrating. I appreciate you've put a lot of effort into helping people move from those tools, but I don't want to.
The single biggest reason is offline access. Anki works on my phone on an airplane or in an area with no mobile service. (In Australia there are lots of those).
I only started using WK seriously when I discovered the Android apps that let me do my reviews offline.
If you had a Patreon tier that allowed for Anki exports of your lists I'd sign up in a heartbeat even if it only allowed 1 download per month of something similar. I mean lets be honest what possible valid need could I have for downloading the whole data set in one go...
Compared to the time it would take me to use Subs2srs across a season of a show I'd rather just give my money to you.
Well, there's nothing wrong with using Anki if it works for you! I know that a lot of folks need offline access and/or other features which Anki provides which I don't have, and that's totally fine.
I always ask this not because I necessarily want to convert people to use my thing, but because I always love to hear what features people need and what I can improve. In case of offline access it is something that's technically on my tentative roadmap, but very far off into the future, so indeed for anyone who needs that Anki's the better choice.
> allowed for Anki exports
That's something that I'm planning to add very soon actually! Well, maybe not exactly Anki exports (I haven't yet researched as to what that would entail), but just generic functionality to be able to export the built-in decks as a .csv (which I'll be happy to tweak/improve to make it easier to import).
I'm following an eclectic strategy where I isolate and separately learn spoken and written Japanese. The process looks a little like this:
1. I start with a deck of Anki vocabularly notes that I want to acquire
2. Study begins with "Speech" Anki cards from these notes (The card front is audio-only, including a clip of the word and a clip of an example sentence. The back has the English definition & a helper image). I only consider a card as being "Good" once I am able to recall & replicate the pitch accent with a steady rythm (I pipe back delayed audio from my microphone while I practice with a metronome running)
3. In parallel, I also do Kanji isolation study using KKLC
4. Each week, I manually enable new "Writing" Anki cards that come from the same set of notes (The writing is on the front. Only the word audio is on the back). I only enable a "Writing" Anki card if I have previously learned BOTH the component Kanji and the spoken word
5. I study my enabled "Writing" Anki cards in parallel with the other two tracks
I like this approach because I effectively have three separate learning tracks that I can switch between -- the variety keeps me motivated. It also helps train your ear to be able to distinguish homophones by pitch and leads you to think of 同訓異字 writings as variations of a spoken word, rather than as true homophones.
Wow, that's indeed a very niche learning strategy. (: I like it though!
I'd definitely like to expand the configurability of jpdb up to a point where you'll actually be able to do something like this in the future. Unfortunately that's not going to be anytime soon, so you're definitely better off with sticking with what you have now. (The most immediate feature that I have planned soon-ish are pure kanji decks; the necessary customizability for the rest will come much later.)
I'm incredibly thankful that Anki exists, I don't think I would have ever learned Japanese to a high level without it. But having spent some time looking at its guts myself, it sure is a mess in there. I thought at one point about building something on top of Anki, but decided against it discovering some of the same stuff that has already been mentioned.
The main Anki codebase is getting rewritten in Rust (from Python) so they'll probably clean up a whole lot of technical debt in the process and make future contributions easier.
As a mostly self-taught dev, it's really helpful to read code reviews like this. I also enjoyed the author's code review reference[1]. That document cites Erick Breck's code review reference as inspiration. Is that publicly available? I couldn't find it on his website or via Google.
Does anyone know where one can find high quality public code reviews? I imagine there must be open source projects on GitHub with good public feedback on pull requests. Any ideas of specific projects to look at?
Finally, I haven't found much information on database migration best practices. Any good articles, books, or other resources people would recommend?
> but please seek feedback and advice on crucial design decisions.
Or at least isolate them from the rest of the code. Of course, it's difficult to recognize which design decisions are crucial without already being an expert, so this sort of advice is probably silly. It's probably also the kind of advice that would have been more likely to strangle Anki in its crib rather than giving us the opportunity to discuss how a wildly successful program that has helped many people learn many things should improve its data model.
I'm not saying they need an expert or they shouldn't push forward anyway, but it's good to recognize when you're out of your depth and to get input when possible. There's a lot of value from just running something by someone else to see if they can understand it and if something occurs to them that would make things simpler.
As you say the schema is clearly not intended for public consumption and closer to a dump of their internal data structures. And people are really protective of their Anki data. My Japanese deck has 11k handmade cards and represents years of progress. I'd really prefer that the devs didn't do dangerous migrations on it!
Yes, I agree. I hope I was sufficiently clear about that. My motivations for writing this up were that (1) it's a wonderful example of a lot of things I often find myself saying about databases; (2) peoples' Anki decks are really important to them and I'd like to enable people to work with them if they want; (3) I do hit strange behavior sometimes in Anki and perhaps these issues are behind it.
Surprise: for me, Anki seems to have lost some data when changing to the 2.1 scheduler—though thankfully I just laughed it off and restored the backup I made immediately before the migration. Will need to repeat the process and figure out what exactly changes, one day.
Of course, usually changing the schema shouldn't lose data, so there's no need to be afraid of it in general.
My impression is most databases are designed fairly haphazardly, and while tables may have fields added, the entity relationship diagram stays pretty static. My rationale is as follows:
1 - A database schema requires quite a bit of thought and up-front design, which we're loathe to do in this agile world.
2 - Schema migrations are hard and scary, so we avoid doing them
I'd guess that (2) is more important than (1) here. Or, at least, it seems anecdotally right to me that programmers are very hesitant to do migrations.
As for (1), you must be right that a disconnect between a design and eventual use is important here, but I'd have said that the more central cause is just that systems change and that their eventual use is very hard to predict. I need to think more about your conjecture that there's a misfit between programming practice and what's necessary for database work; I like the idea that we're habituated to implement something very basic up front, and that this habit intersects particularly badly with database work (that is, it causes problems that aren't nearly as bad with non-database parts of computer systems).
Wish this article had been written a year ago when when I added Anki import to my learning app (https://traverse.link/)!
Especially importing media files and de-renaming them was a pain, as well as handling the different types of cloze deletes (some of this is described quite well in anki's docs, for example here https://docs.ankiweb.net/#/templates/generation)
This post comes at a perfect time because I am working on an app that utilizes spaced repitition and flash cards and happen to be tweaking the spaced repition algorithm today. I've used Anki a quite a bit, and it mostly works but there are limitiations and things that don't work well. The post does a good job at highlighting some of the underlying issues.
The app I am working on is focused on language study. A big drawback of Anki when it comes to language study is that the card is tightly coupled to the content it's trying to teach you.
In our implemntation, we decided to decouple the presentation of what you're studying and how you're studying. We have the concept of a term which can be a word, definition, sentence or character. We store the progress information along with term. The term can then be studied with any of the available study methods. This gives quite a bit of flexibility in how the information is studied and reviewed and goes a lot further than just flash cards and multiple choice questions.
Another benefit is that this progress information can then be used to make recommendations to the user on what to study next.
Our app is called Emurse. Our focus is on teaching languages with the most efficient path to fluency. With language study, once you get to an advanced beginner or intermediate level of study there's just not a lot of material out there to help you continue learning journey. We aim to help fix that. A lot of the core functionality is built, but quite a bit of work remains on the content development side.
We're working on a Thai language course right now and hope to be rolling parts of it out in the near future. Other languages are coming later. The link is in my bio, but there's nothing to play around with yet.
No, refold have a basic SRS web app now. It's terrible though, awful. Refold is basically trying to commercialize the community MattVsJapan created with awful things like proprietary alternatives to Anki.
I had the pleasure of implementing an Anki import for my SRS app [0] recently, and when opening up an Anki database I found myself scratching my head a lot. I have much respect and gratitude for the people that decoded and documented this stuff!
Anki data basically needs wrapper libraries with a humane API, even when working with low-level structures (i.e. without the app itself running). Just to avoid traumatizing external developers.
If everyone is using such libraries, the structure itself can be changed.
Not needing Anki itself is pretty much a requirement for me, because that's the way which makes sense in my worldview. Tools work independently and feed into each other—not get slapped onto each other, which kinda smells like a whiff of bad OOP.
I began a feeble attempt at a lib for my own purposes, but I had to take Anki db structs a little at a time just to keep my frustration in check.
Afaik Anki itself does include Python libs for creating and manipulating db records, which can be used in third-party scripts—however dunno if they work without the full app running, and on top of that I personally keep trying to use Lua, since it runs circles around Python in terms of speed.
For simpler tasks on the Anki database, the Python `anki` module[0] provides a level of abstraction that can be helpful without resorting directly to SQLite queries against the db, but just a little. The main problem with the Python module is the insistence on 1:1 correspondence between the module version and the Anki (db) version.
Speaking of Anki... has anybody played with AnkiWeb here? I'd love to see a Telegram bot that I would notify of random English words I sometimes look up and everyone on the channel would immediately have it added to their accounts. My main use case is group language learning - I can imagine it being useful when learning Chinese with my girlfriend.
For those that are on the creation side, I've found genanki [1] to be a joy to work with. I have a service that auto-generates a bunch of cards from code and then spits out an `apkg`. I've made what I consider the funnest part of that open for registration [2]
I'm learning data analysis/science and Anki's database has been something I'm playing with, it was an insightful read since I was thinking in a similar way about the `id` column and how things can get messy with a time-related value.
I've seen the author reply here, if you don't mind me asking, are there other ways to represent this database? Is there a useful exercise I can perform to do so? e.g. NoSQL version or similar, thank you.
There are definitely other ways to represent this data! I think a useful exercise is:
(1) Sit down and think of a very basic representation of this data (in whatever language you prefer).
(2) Figure out where in the current SQLite database the relevant information lives.
(3) Write a function that is given some relevant set of rows from this database and returns an item in the representation you determined in (1).
(4) Test it.
(5) Think about how to persist it.
You can do some subset or superset of these as you see fit, but I do think it's valuable to think about how to represent the relevant objects before you worry about persistence details.
I got pretty deep into this when automating card creation for Chinese characters. I would highly recommend using the AnkiConnect plugin [1] rather than manipulating tables! This worked a charm (at least for creating new decks).
A few years ago I wrote a ruby gem to generate anki decks and spent some time reversing the schema from the android app and open source code. There were some very weird decisions iirc, something around pipe characters to mark individual card fields.
> using flds instead of fields appears to be abbreviation for abbreviation's sake
God yes. Why do people do this? I recently moved into hardware and it's even worse! Someone abbreviated a TLA to a single letter. Like instead of CPU_CLOCK it was CCK. Madness. These aren't printed on a silkscreen!
A few extra tidbits:
- A few versions back the cards' "ease" field (that is - how a card was graded) meant something different depending on which phase the card was at the time (so sometimes "2" meant "hard" and sometimes "2" meant "okay"). It was finally fixed and AFAIK in new versions it's consistent now, but apparently the migration didn't always work properly and I still sometimes see databases where the grading is the other way around compared to what it's supposed to be, and I need to heuristically detect that this is the case and handle it.
- Initially JSON blobs were used to store a lot of data; relatively recently that was changed so that it's stored as proper tables, but not completely, so a lot of data's still in the blobs, but this time instead of JSON it's protobuf. (Which seems strange to me considering SQLite has native support for JSON.)
It's a good thing the schema's slowly being cleaned up, but unfortunately it's only done incrementally, so every time any little thing changes I need to add yet another special case to my importer to handle it, and often in various permutations too because some databases are half migrated Frankensteins. (Don't ask me how that happens; I don't know. Maybe it's an issue of people using outdated plugins with their Anki installation, or copying their database between multiple independent Anki implementations, or maybe the current phase of the moon's just wrong.)