I remember doing this about ten years ago, taking up almost the entirety of my phone’s storage in the process, and feeling unreasonably delighted by it. I’m not sure why, but this idea really appeals to me. It’s all very good having access to information on some server somewhere, but literally having (a very useful amount of) information on almost every topic humans have so far contemplated available literally in your pocket is somehow incredibly thrilling. For the same reason I’m fascinated by the idea of locally-hosted LLMs. It’s hard to explain. Maybe it’s like some sort of digital version of hoarding behaviour.
56K, such luxury. I know what digital scarcity is, and a 1Gbps fiber line to my house isn’t it. Sure, I have my secret stash over in the corner of that NAS in the garage, just in case. But I otherwise can now download what I want in seconds, minutes at worst. Ergo, I don’t hoard nearly like I used to (never was a true hoarder in the first place, so I can probably be ignored).
Not to say that your point isn’t well taken. One does tend to keep stuff after a long overnight download, and old habits die hard. And I do have a local copy of Wikipedia…just in case.
Local LLM with the entirety of Wikipedia accessible via RAG (or better, newer technique) is legitimately a little super-powered assistant. It works when the grid goes down, and the searchability is orders of magnitude better - especially when you're not sure what you're looking for.
<|im_start|>user
What is this sign about?\n<image>
###
<|im_start|>assistant
The sign is about bullying, and it is placed on a black background with a red background.
###
I sometimes daydream about what would happen if I were suddenly transported back in time -- but with everything on my person intact (iPhone included).
Step one in this daydream is always "Crap, no internet for 100 years -- if only I had all of Wikipedia cached".
Check! :)
Step 2 of course is figure out how to fashion some sort of charging system so my magical godlike pocket super abacus continues to function after ~days, but I'm making the bold assumption that somewhere in the annals of Wikipedia I can figure that out.
It'd be nice if there was a curated downloadable subset of Wikipedia pages for that kind of purpose. (Time-traveler in past, stranded/post-apocalyptic survival, history pages but only the text parts.) Even if only mundane events occur, it also represents a certain cross-section of user-interest.
That said, one more daydream item to add to the pile: "What if I was somehow lost in space, what useful information could I provide for maybe locating my home-planet again?"
____
Without Wikipedia, the best I can think of are notable conditions that could help them filter existing records, often using ratios to avoid having to define measurement units, ex:
1. It's the 3rd planet from the Sun, the majority of it is covered with water, etc.
2. It has an unusually large single moon of 1/6 its mass.
3. The 5th and largest planet in the system has a little over 70% of all the (known) planetary mass.
4. The 5th planet is 5.2x further from the Sun than my 3rd planet.
5. The 6th planet still has notable rings which haven't decayed yet.
With Wikipedia, perhaps I could tell them about pulsar maps or whatever. [0]
Locating the Sun from an unknown position in space, especially if you're many lightyears away, would be a complex task. Here are some strategies you could use:
1. *Spectral Analysis*: Every star has a unique spectrum, and the Sun's spectral signature is well-documented. If you have the capability to perform spectral analysis on nearby stars, you could compare their spectra to that of the Sun. This, however, assumes you have a database of stellar spectra, including the Sun's, for comparison.
2. *Position of Known Stars and Constellations*: By observing the position of known stars and constellations, you might be able to estimate your location. This becomes more difficult the further you are from Earth, as the relative positions of stars change with distance, and familiar constellations may become unrecognizable.
3. *Parallax Measurements*: Parallax is the apparent shift in position of an object when viewed from different perspectives. Measuring the parallax of nearby stars compared to more distant background stars could help you triangulate your position. Knowing your position relative to the Milky Way galaxy would be essential in locating the Sun.
4. *Galactic Landmarks*: Identifying key features of the Milky Way, such as the galactic core, spiral arms, or specific nebulae, could help orient you within the galaxy. The Sun is located in the Orion Arm, about 26,000 lightyears from the galactic center.
5. *Radio Signals*: If your ship can detect radio signals, it might catch broadcasts from Earth. The Sun would be in the general direction of these signals. However, depending on your distance, these signals might be very weak or non-existent.
6. *Automated Stellar Navigation Systems*: Use any advanced stellar navigation systems on board. These systems could automatically identify stars and calculate your position relative to them.
7. *Consult Star Maps and Databases*: If your ship has a database of star maps, you could use these to match up visible stars with known stars and work out your position relative to the Sun.
8. *Communication with Earth*: If communication with Earth is possible, getting assistance from ground-based observatories or space agencies would be invaluable. They could provide information based on your last known position and trajectory.
There's some sweet spot where technology is relatively easy to improve upon with Wikipedia, but one can also find wires and generators. Some time in the first half of the 19th century is probably good. It gets easier if you're allowed a solar charger to take with you, then you can improve tech in loads of ways (though probably best to restrict yourself to post enlightenment if you want to keep intact).
one of the other apps to I would want to have would be an astronomy app. also, in your "on my person" thing to consider would be a solar charger. otherwise, when the natives come to you after the battery dies, your magic abilities are less godlike.
> Step 2 of course is figure out how to fashion some sort of charging system so my magical godlike pocket super abacus continues to function after ~days
In the good old days of USB 2 I'm sure you can figure out how to get a stable-enough ~5V power supply that would charge the smartphone.
With USB-C you'd need to reinvent IC manufacturing and figure out a needlessly-fragile and overcomplicated protocol before the phone would even accept any power. Yay for progress!
USB-C works without USB-PD. Plain USB-C is simpler than USB-A since it supplies 1.5A or 3A based on resistors. The USB-C to USB-A and microUSB adapters are just resistors to use legacy USB. Legacy USB power is fairly complicated with different standards for 500mA, 1A, and 2.4A power. The reason it looks simple is that cheap chip is in every charger and device.
You can probably provide at least 2.4A 5V on USB-A, just like providing 3A 5V would be fine with USB-C.
USB-C is a superset of USB-A/B, so you can use exactly the same simple protocol: A short of the D+ and D- pins, per USB Battery Charging.
For a USB-C sink, you don't even need any resistors beyond that, let alone USB-PD. For a USB-C source to be able to charge a legacy USB-B sink, all you need is a single 56 kOhm resistor.
To my understanding, there are ways of wiring a USB-C cable such that only four contacts are active, and behave as a USB 2 cable? (although you can only do USB 2 charging over it, so no PD)
While not necessarily related to Wikipedia, I focus a lot on the offline usage of my phone: I fly a few times per year and I like to travel. Even in 2023, internet is not always working so great. Two important things for me are:
* Offline Map - The times I need a map and I am offline are strongly correlated.
* Stardew Valley - Great game, pay only once, play offline, no ads.
Wikipedia does sound cool though. Maybe I'll trial one of the reduced size versions.
Completely offline, shows more info about the landscape than Google/Apple Maps, especially for hiking, has almost all businesses, has driving directions.
It heavily depends on your area though. It's often way behind Google Maps, in every populated place I've tried it in the USA. (Businesses are years out of date, roads are missing or incorrect, etc.)
I do contribute to OSM, but day to day I find Google Maps + Alltrails + Trailforks to be more practical (all paid, closed source, but high quality data).
Google Maps also has free offline tiles. Those other two have paid offline maps, but it's worth it to not have to always wonder if your map is accurate.
When your phone can “hear” a cell tower (and thus thinks it’s online) but doesn’t have enough signal to get any data, apps like Google Maps and Gaia will often hang instead of showing you the downloaded map data that’s on your phone.
Put your phone in airplane mode to fix that behavior.
Hmmm. The full English Wikipedia Zim went from 87GB 1.5 years ago to 103GB now. While 512GB microSD cards are fairly reasonable nowadays, iPhone users will struggle.
With the iPhone 15 you could easily plug in a USB C SSD and download it onto that—would probably require a custom client to read the data from the external drive though.
I tried on my iPad Pro M1 using a Crucial X9 Pro drive, and Kiwix works just fine. The non-pro iPhone 15 with its USB2 speed USB-C port won’t be snappy, however.
Or any other phone lacking a microSD slot. I'd assume it works that way on an iPhone, but I can at least plug a flash drive (or card reader) into my Pixel 7a's USB-C port. Not quite as nice as having always-ready extra storage, but handier for quick swaps.
Current iPhones are available with up to 1 TB of storage. I’d argue that whoever would even think about downloading all of Wikipedia on their phone is a particular type of user and their choice storage will be on the northern side.
I do have the Maxi Wikipedia ZIM on my iPad Pro, which has 1TB.
I upgraded my wife's iPhone X to a 15 non-Pro, maxed to 512GB, but she has so many videos I doubt the ZIM will fit. I myself will keep my iPhone 12 until it either breaks or they no longer provide security updates for it. Not everyone changes to the latest iPhone.
Back in the year 2001/2002 (I think) I remember doing this with a Wikipedia plugin for Rockbox (https://www.rockbox.org/), which I had running on my Archos FM Recorder. It was truly awesome to have all this knowledge in the palm of your hands without omnipresent Internet.
A (tiny) AI model can not cite sources, it can only hallucinate citations. A (tiny) AI model can not create a representation of knowledge or truths based on sources it knows. It can not provide any truths except by random chance.
You would have an hallucinating make-up-some-fake-facts machine on your phone.
"A (tiny) AI model can not cite sources, it can only hallucinate citations."
I don't think that's definitely true, if you build the system around it well. It would look something like this:
1. User asks a question. LLM extracts key concepts from that question to use as search terms.
2. LLM triggers a search of Wikipedia, getting back snippets of pages, each with their page identifier.
3. LLM is fed those snippets along with the user's question and instructions to cite pages that it uses content from.
4. LLM generates a response which includes formatted citations. This response may be complete garbage, but...
5. Your code can at least confirm that the citations correspond to the pages that you fed into the LLM.
I've seen this approach work well with larger models. The open question is if it could work with smaller ones.
The 7B models (Mistral and its variants in particular) are getting VERY effective. I'm confident they could mostly-work for the above sequence... and you can just about run a 7B model on a phone.
The bigger question for me is if you could get this to work with a 3B model, since those are much more mobile-device friendly than 7B.
I suspect that chatGPT was shy about giving me exact quotes of Latin works for fears of disclosing its training sources and risk copyright infringement lawsuits.
LLMs on mobile phones can be surprisingly capable. I've been having fun trying out this one on my iPhone: https://llm.mlc.ai/#ios
Getting that to work with a local copy of Wikipedia - where it can essentially run a "search_wikipedia(...)" function any time it needs extra information - would definitely be feasible, using this pattern: https://til.simonwillison.net/llms/python-react-pattern#user...
I think it would need to use regular text search though - you could try calculating embedding vectors for the whole of Wikipedia but even with a small embedding model I doubt you could fit those on the device, they'd take up too much space.
Looking forward to Wikipedia building these models of their corpus and hosting them alongside the content. They could even offer as a mutable torrent cryptographically signed [1] (or rsync if you want to go old school).
Funny enough, I was doing this just yesterday, on the same app mentioned. I remember seeing offline Wikipedia hardware devices about a decade ago and decided it would be handy to have while on a flight. It was actually quite surprising to see how big Wikipedia has since gotten.
The subdivisions the app provides isn't enough, or at least I was using it wrong. For example, trying to download the lighter versions of the Geography section only seems to offer combinations that ultimately leave you without much usable information (i.e Paris only providing the introduction text and not the content from the later sections) unless I go for the much larger size.
I like the wikipedia dumps with Aard 2. Excluding images, it's only about 21gb which was small enough to use years ago when I had spotty connectivity and small enough to leave lying around today.
This reminds me of Aard2, it does the same job as the one mentioned in the article, but the dumps are really small (for example, the wikipedia english one uses around 20 something GBs)
No images on it, but the app loads them from the source when needed
That language dropdown list is insane. I can't find spanish/español. And since it's not sorted, I can't rule out that it doesn't exist without scanning the whole thing.
This is great, I have difficulty with the navigation though, it seems you can get stuck in some part of the library and there’s no button to go back to the beginning.
Our usage may differ, but I would think I'm probably most interested in some relatively little visited articles, because I'm there looking something up not knowing about - versus say 'United Kingdom (country)' is probably quite highly visited in total, but not that likely to be visited.. if that makes some sort of sense.
Might be better to ignore certain topics you know you're not interested in. (That way if there's no result, you know there's either truly no result, or that you don't care anyway because you're not interested.)
Or perhaps most effective for the size goal would be just to cut images. Perhaps selectively include them depending on category - e.g. maybe you want them for science & technology categorised articles (diagrams) and geography, but can do without otherwise.
The frequency of article visits probably resembles a Pareto distribution. So there would be many many articles with very few views, including new articles.
I’d think the good ol’ Page rank would be a better metric for deciding which ones to prune.
I'm not quite sure what you mean here - English Wikipedia is ~60GB, and there are definitely models trained on Wikipedia that are less than 60GB (eg: Llama 7B), but that's a lossy store of information - you won't get perfect recital of a Wikipedia page from an LLM
Yeah, I find that odd too. I wonder if there's some kind of algorithm that downloads what the hockey articles link to, and thus a lot of what's called "Hockey" could be things like the articles about the cities that the teams are in, articles about the physics of water freezing, ect.