Hacker News new | past | comments | ask | show | jobs | submit login
In search of the least viewed article on Wikipedia (2022) (colinmorris.github.io)
386 points by alexmolas on Oct 20, 2023 | hide | past | favorite | 218 comments



> On the other hand, no-one has yet come up with a way to monetize a topic like Pseudoneuroterus mazandarani or use it to push a contentious point of view. Hence articles about species and populated places are generally not deleted, even if the topic is only weakly sourced

This is not why this happens on Wikipedia; sourcing is a factor, but notability derived from the nature and quality of the cited sources is the most commonly used criteria to delete articles.

For example, Wikipedia once had a notability guideline that a person who completed in a sport at the international level was probably notable, so a weakly sourced article on, say, a footballer with a single international cap could be notable enough to avoid deletion.

But then editors removed this guideline and fell back onto the general notability guideline (GNG), which is much stricter for people: substantial biographical coverage from reliable mainstream sources at the national level. Match reports don't count (too routine), no local interviews (too primary), no enthusiast publications (not reliable enough), etc.

Of course, this meant that hundreds of stubs and dozens of full articles about women's international footballers would never meet GNG due to a relative lack of mainstream media coverage, even if they won a World Cup. So all it took was one editor to decide to do almost nothing but flag hundreds such articles since the opening of the Women's World Cup this summer, almost all of which were deleted.

So it's not this reason that keeps moths and places:

> On the other hand, no-one has yet come up with a way to monetize a topic like Pseudoneuroterus mazandarani or use it to push a contentious point of view.

But rather, very different standards of notability are why there are more low-traffic articles on items and places that can't sue Wikipedia editors for libel.


It's really funny - just this morning I was reading the Wikipedia article for Interstate 94, which led me to the WikiProject that maintains it (US Roads) and an open letter they wrote a month ago that they were leaving Wikipedia and starting their own wiki, because of how painful it's been having their articles deleted due to arbitrary notability guidelines.

https://en.wikipedia.org/wiki/Wikipedia:WikiProject_U.S._Roa...


A couple points on their complaints: you _can_ cite primary sources, but you can't _rely_ on them. That's beyond notability altogether. It's still relevant to why Roads would want to fork (there aren't a lot of news publications doing feature stories on random highways) but it comes down to informal nonenforcement giving way to a few hardasses flagging everything they can as soon as their attention is turned toward a particular domain.

Because there's very little effort required to flag articles for deletion, and the burden to keep is often on the contributors doing research on a 7-day timeframe if there are even just one or two editors supporting deletion, most times the contributors eventually give up running the research treadmill and the content gets deleted anyway.

And WP:GEOROAD is a subject-specific notability guideline, like WP:NSPORT is. Those aren't immutable; NSPORT was fundamentally changed in 2022,[1] and those changes now justify all these deletions of international footballer articles created before. If the Roads editors have all left, there'll be little or no opposition to changing WP:GEOROAD,[2] and deletions of that content from WP could get done even faster.

1: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(policy...

2: https://en.wikipedia.org/wiki/Wikipedia_talk:Notability_(geo... - and I see some of the editors who participated in NSPORT-related deletions this summer also in there advocating to change GEOROAD and make it more explicitly defer to GNG, and therefore easier to delete articles.


> Because there's very little effort required to flag articles for deletion, and the burden to keep is often on the contributors doing research on a 7-day timeframe if there are even just one or two editors supporting deletion, most times the contributors eventually give up running the research treadmill and the content gets deleted anyway.

Pretty much sums up the problem with every community-run moderated site. It directly leads to the downfall because contributors get discouraged and leave, and eventually all that's left is people with the same view as the heavy-handed moderators. Having hours of work wiped out from editors/moderators who've spent less than a minute on it sucks.

Stackoverflow is another notable example of this.


As much as I think that Wikipedia notability guidelines are way too strict, I'm wondering -- is it necessarily a bad thing for editor communities to split off and create separate specialized wikis? As long as the other wikis are also under a free license (here, CC BY-SA 4.0), you can always import back the articles into Wikipedia. So maybe it can be a useful way for communities to "incubate" articles?


Indeed it's not a bad thing at all. I remember in the early days of wikipedia it seemed like every single Pokemon had their own article while actually notable real life topics had scant information.

Migrating all that cruft to a separate pokemon wiki was an improvement for everyone, no matter how "notable" you think Beedrill might be, it doesn't need it's own separate wikipedia article. With a separate wiki, Pokemon fans can go into as much depth and lore as they like.

Personally I think the criteria for fictional things should be even stronger. There ought to be an article about the work of fiction itself, but not articles about fictional characters or events unless they're notable outside of the work of fiction.

This keeps wikipedia about facts, not fictional canon.

e.g. Pikachu derves a separate article, Charmander does not.

To elaborate further, the Charizard article has a "Physical characteristics" section.

It's a anime / computer game. It's not a physical being, so any "physical characteristics" is not factual information, it's fictional information. Wikipedia does a poor job at separation of fact and fiction in articles about fictional beings.


> no matter how "notable" you think Beedrill might be, it doesn't need it's own separate wikipedia article

This has always been my sticking point with deletionist thinking. Why doesn't it need it's own separate wikipedia article? What's the harm in it? Are we worried that people will start treating Charizard as a real creature?

Where I see the value of notability criteria it is mostly in preventing vanity articles. Beedrill, presumably, is of general enough interest that people are willing to contribute and reference the information. Why isn't that enough?


I wish they'd give all the fantasy stuff (character in tv show, animated character, etc) something like a different background color or theme that essentially says "this is for people who want to document fantasy worlds".


What purpose would the theme change have? Is there a risk of confusion?


Wouldn't it make sense for Pokemon to have its own wiki? Sure have an article about the existence/history of the games and anime in Wikipedia, but such details as describing the various creatures could go elsewhere. That's how it works for other games. For example, there are articles about the various games in the Fallout series in Wikipedia, but the various creatures and locations don't have their own pages there. But there's an entire (in fact several) wikis dedicated to everything Fallout.


I mean, maybe it would make sense. But having the pages on wikipedia has a bunch of huge advantages -- not least of which is piggybacking on the ad-free nature of wikipedia. The size of the editorial community is much larger too.

The reason that there are wikis dedicated to "everything Fallout" is because of these deletionist sentiments. Most of these things started on wikipedia and had to migrate off because of the constant barrage of deletion fights.


Well, I don't like "deletionist sentiment" when the issue is "notability" -- obscure moths or Bulgarian poets have a legitimate reason to be in Wikipedia even if non-entomologists and non-Bulgarians may not care about them. But there is a real argument that fictional beings and places don't belong in a serious encyclopedia (even if the works they are from do exist and should be covered).


> But there is a real argument that fictional beings and places don't belong in a serious encyclopedia

I'll bite, though -- why the passive voice? What is the argument? The first blush here is that these topics are non-serious and make Wikipedia seem less serious. That's clearly a strawman though -- what's the deeper argument? I mean, for Brittanica, you only have so much print space you can use, and an article on Beedrill is a waste of paper. But Wikipedia is not printed; and while space is scarce in theory we shouldn't be rationing until the need it apparent.


I think fiction and non-fiction are worth keeping separate on a philosophical level. It isn't about saving space, it is about keeping reality and fantasy isolated from each other which is more important now than ever in the "post-truth" society.


I'm sure there's some very good reasons for it, but what exactly is the reason why there cannot be articles for very niche things like articles for each Pokemon individually listed on Wikipedia? I find the idea of having a complete tome of everything to be an incredibly neat idea!

At a guess, it's probably something like practical restraints around "not enough people monitoring for quality", or the fact that the hard drive space to save all of this information is not free and unlimited, or that simply, it might be better served by niche communities who will be devoted to caring far more about such specific topics?

I just find it frustrating that there isn't a kind of...ultra, super mega colossal set of all human knowledge of everything stored under a single digital roof. I suppose that ideal itself probably isn't practical for the reasons mentioned...it just seems so neat in concept. Just one place for everything.


Wikipedia articles are kinda meant to be a broad view of a subject that is approachable to a general reader with no prior knowledge. You can write an article about Pikachu or Squirtle that is relevant to this type of reader. Can you really write such an article about Dartrix or Groudon?

Granted, this is also a problem with the obscure moth species articles. I think the moth species articles survive because no one really cares enough to start the crusade against them. When we used to have every Pokemon, it was a common line in deletion discussions to say "well if every Pokemon has an article, why can't <my obscure topic> have one too?" -- I think some people eventually got fed up and decided it was worth putting in the work of figuring out what the notability standards should be. What I've learned from a long time of editing on Wikipedia is that often, things are the way they are not because it's the best way, but because the project has a lot of inertia -- it's a lot of work to make a big change happen.

The concept you long for sounds a little bit like Wikidata. It's much less in-depth than Wikipedia, and just describes its subjects as structured data instead of with prose, but the notability bar on Wikidata is much much lower. Every Pokemon, every scientific article, every book, every village, every athlete, etc is generally in scope.


Please don't give the deletionists ideas. I treasure the ability to look up Victaphanta compacta and find real information, regardless of how approachable the topic of the Otways Black Snail is.


> because I don't want it that way.

- some editor probably

In my handful of encounters trying to contribute to wikipedia it's always been such a frustrating experience.

"not enough people monitoring for quality" is one way to put it, but I've often found it to be one very zealous person monitoring for their idea of quality. It ends up quite frustrating, especially if you're a domain expert.

I've corrected articles where things I've written have been cited and had the changes reverted. It was enough to just give up.


> I've corrected articles where things I've written have been cited and had the changes reverted. It was enough to just give up.

I've heard about this happening enough that I stopped treating WP with any credibility whatsoever even for what should be cut & dry fact (aka: non-controversial/political topics). I've heard of people who were being quoted updating the context to more accurately reflect what they were saying and having the changes reverted. As if the person who said the thing being quoted doesn't know what they meant. Often because it didn't meet some guideline or another but more often than not because one overzealous editor has decided that the page being edited is "their page".

I learn the truth more from perusing the edit history or talk pages than from ever reading the page itself. Also despite claims of neutrality it's amazing how often pro-communist articles are heavily maintained almost exclusively by diehard self-proclaimed Marxists making politically biased edits.

Exhibit A: https://en.wikipedia.org/wiki/Talk:Holodomor


Kings of their own tiny virtual mountains. I tried once to update the Wikipedia article about my own military unit, just to update for our new location after a move and some other details about our heraldry. Basically, I was told/ordered to update it by the CO. No luck. Changes were repeatedly reversed by whatever kid/editor didn't want anyone else in thier sandbox. It remains incorrect to this day.


>As if the person who said the thing being quoted doesn't know what they meant.

What they meant at the time they said it and what they want it to mean later upon reflection, certainly could be two completely different things and should be scrutinized.


For political things sure. Now imagine you're explaining how something works on a technical level - like the physics of how induction heating works or the summary of a study where they've twisted your summary to claim the opposite of what the study and your summary actually claims. You go to correct their misinterpretation of your study and are told you are wrong and your edit reverted.


It's really not a bad idea, but the problem is structuring.

It's odd but I come back to Cantrill's "Fork Yeah" talk here.

So the basic problem is one of resources: forked wikis still have hosting costs, so they need either a patron community (often just one person) or a for-profit company (with lots of crappy ads), while Wikipedia by all accounts is rolling in the dough with very little accountability (i.e. when professors and teachers remind students "don't believe 100% of what you read on Wikipedia," the-Wiki-community is blamed moreso than Wikipedia-the-nonprofit). If you fork Wikipedia then you lose out on the resources.

Without that, you do have a "forkophobic" culture which is why you get this "governance orgy" -- the notability guidelines and so forth. But the difference is, Cantrill's software examples expect the software to have some sort of editorial control, so if the Linux or Apache foundations take on some project it's because it's used by thousands or millions of people and they don't really take kindly to "oh yeah upstream was vandalized by someone who came in and just made every request to the Apache server return 'HTTP 499 BUTTS BUTTS BUTTS'."

In Wikipedia you get this strange direct democracy by "whoever happened to show up." Deletion votes are often done with like, 20 votes or less of just random passersby. Worse, those random passersby are usually the people who visited the article in the first place and saw that it was up for deletion, so they'll say things that are nonsensical like "oh, he's a very notable figure in the XYZ community, Googling him turns up 40,000 results so clearly he is notable."


> In Wikipedia you get this strange direct democracy by "whoever happened to show up." Deletion votes are often done with like, 20 votes or less of just random passersby.

20! That'd be a luxury.

Here are some of those footballer AfDs from this month that passed with 1 vote on top of the nomination. Without passing any judgment on whether the subjects are notable, see if you notice any participation trends:

https://en.wikipedia.org/wiki/Wikipedia:Articles_for_deletio...

https://en.wikipedia.org/wiki/Wikipedia:Articles_for_deletio...

https://en.wikipedia.org/wiki/Wikipedia:Articles_for_deletio...

https://en.wikipedia.org/wiki/Wikipedia:Articles_for_deletio...


> Deletion votes are often done with like, 20 votes or less of just random passersby.

You don't even need that. IIRC, the "PROD" process can get an article deleted with no votes at all. All you have to do is tag the article and if no one removes it for a week, it will be deleted with no further discussion.

The caveats are you can only PROD something once, and I believe no discussion is required to delete a PROD'ed article (if anyone remembers it to want to bring it back).


It's not a bad thing in concept, and probably won't be a bad thing in execution since Roads has a pretty solid core of editors moving. They've already implemented technical improvements on AARoads, like better and newer road maps. The potential is there.

It's often a bad thing in practice, much like how splitting off entirely separate social networks for specific subjects is often a bad idea. If there aren't enough people or activity, interest wanes. If the project relies on one or two heavily active editors, it collapses when they're unavailable. If you never hit critical mass for SEO or can't/don't know how to promote your content, all the work is done in a vacuum and everyone sees the Wikipedia stubs over your stuff anyway.

Fandom (as in the company), as awful as it is, largely exists by virtue of participation in the subjects where there's enough interest to facilitate it; the mass wiki-farm infrastructure gives them enough of a SEO boost to dominate even some of the wikis that fork off of Fandom to escape its policies.

> you can always import back the articles into Wikipedia. So maybe it can be a useful way for communities to "incubate" articles?

Developing an article on a separate wiki has additional technical and administrative overhead.

CC BY-SA requires attribution. Wikipedia attributes contributions by article history. The only way to import article history is via Special:Import. You have to be an admin or have import rights to use Special:Import on Wikipedia.

If you can clear all those hurdles, then in theory you can import changes with the required attribution from another wiki. But if the imported changes conflict with changes made in parallel on Wikipedia, resolving them can be very contentious, especially if any of the editors involved disagree on the resolution. Or the import might fail, because Special:Import isn't particularly robust.

And as the forked wiki admin, do you decide to set up scheduled imports _from_ Wikipedia to keep up with upstream? These forks often split because Wikipedia is deficient in some way beyond just editing the text. AARoads' fork's use of improved maps, for instance, would probably break the article on Special:Import because it looks like they use different wiki templates and modules.


If the Wikimedia foundation spent its immense donation income on supporting those specialized wikis - something far more appropriate to the spirit in which it was given than what it actually gets spent on - then I'd agree with you. But in practice these other wikis are generally struggling to keep the lights on and resort to advertising, and then the censorship that follows from that.


Roads fork? Trivial.


I wrote an article on a somewhat niche person:

https://en.wikipedia.org/wiki/The_Mexican_Runner

I had to fight a bit with other editors to convince them that this guy was notable. He had coverage in Polygon and Kotaku, so that seemed enough to me, but it was a difficult task at first.

Is some dude who pushes buttons very quickly notable?


The primary thing that establishes general notability on Wikipedia for biographies is significant coverage from reliable secondary sources independent from both each other and the subject.

If someone gets significant coverage from multiple reliable secondary sources for taking a shit, they can have a Wikipedia article. They don't have to win an award for it (though it'd help) or contribute to society (though it'd help).

The action itself might be broadly, by common sense, not notable. Only the notoriety of the person _in published sources_ is relevant.


That's a good read. I'm glad it wasn't deleted.


Thanks for this! I watched some of NESMania back in the day. TMR is great!

A bit weird that the article includes details of his mother's medical condition though!


He's public about it, and the citation for that is an NPR interview, which seemed to me like a reputable source.


Ahhh okay. I had no idea he was interviewed by NPR!


>reliable mainstream sources at the national level

And does anyone actually believe this is applied consistently?

I'm willing to bet that there are countless Wikipedia articles about people in technical, academic, local political, etc., etc. roles/positions that wouldn't meet that criterion.


Of course it’s not 100% consistent, just like any rule in any project where tens of thousands of people work together. That’s what makes Wikipedia great: you don’t have to get your article validated by some random comitee of 20 persons; you just read the rules and try to respect them. It won’t be perfect, but you can always ask someone with more experience to review the article.


No Wikipedia policy is applied consistently. They're applied when editors want to apply them to justify a decision. That's how every individual decision on Wikipedia works; precedent doesn't even apply.


nope, but that's the neat part (well, neat in the "morbidly scary" sort of way). They don't need to. If a topic is niche enough there won't be enough wiki editors well versed enough to argue for nor against the obviously asinine requirements. they can quash women's football for lacking a national mainstream source but then pull out some esoteric source for their own pet project uninhibited (again, assuming the topic is niche enough to go under the "mainstream editor" radar).

Classic lawful evil behavior.


Thank you for this answer. The only thing I'd add is that Wikipedia is not a monolithic entity, and whilst you're probably referring to English Wikipedia (not unreasonable, given this is HN), my understanding is that other versions of Wikipedia have their own, sometimes quite different notability requirements.


Wikipedia's notability guidelines have been too tight for a very long time. I remember being flabbergasted that highly popular webcomics couldn't get a wiki article.


I remember reading an accusation from someone upset about a page deletion that the notability guidelines are so tight in order to drive traffic to Fandom. At the time I thought it was absurd, but I find myself wondering about that a lot.


WMF has had nothing to do with Wikia/Fandom for years (apparently they shared some hosting costs back in 2009). Jimmy Wales also hasn't exerted control over Wikipedia policy in a long time, and as of this year no longer even nominally has the right to overrule ArbCom.

So at the very least that's not actively motivating the continued state of the notability guidelines, though I don't know if it's possible the original motivation was as you suggest.


Yes, and that was what I myself thought as I scrolled past.

"Jimmy Wales also hasn't exerted control over Wikipedia policy in a long time, and as of this year no longer even nominally has the right to overrule ArbCom."

Except these two facts niggle.

He's on the board, and will always be the founder. He is SV aristocracy in a town where the money you earn is less important than the people you know and the things you have built. He will always have influence and it's naive to believe he wouldn't. So, it is not entirely accurate to say he has no control, just less control than he did. Is it too little control? Perhaps, perhaps not.

If I wanted to construct a set of policies that drove traffic to another site, I would make them quite like Wikipedia has now e.g. Wikipedia is not a gamers manual. Then I would target for deletion much of the nerd content, quite like now. This is quite a coincidence.

I'm not attached to this theory. Also, I don't like spreading negativity. But to me it seems at least somewhat plausible, and that possibility disturbs me a little.


I don't believe this is actually enforced. I created a small website called "Man of the Hour", which filtered Wikipedia with the Wikidata code Q8441 (man - male adult human), selected a random entry every hour, and displayed it.

I would go there a few times a day, and noticed that almost every single article it gave me was some random amateur athlete, with only a results page as a source.

[1] https://www.wikidata.org/wiki/Q8441


I gave an example of enforcement. It takes an editor to flag the article for deletion. Chances are, like the women's footballer stubs up until an editor decided to go through all of them, there hasn't been an editor motivated to do so.


To say the quiet part out loud: It only takes one misogynist to go out of their way to flag hundreds of pages of women athletes. No one is out there flagging men's pages to delete for the same reasons.

I was at an early Wikipedia meetup and a woman editor told me that women porn stars were more likely to get (and keep) a profile page than a woman sci-fi author. It's pretty clear that Wikipedia's quality has been diminished by this level of misogyny.


There's a part of that, but the women's stubs are also unjustifiable because of the foundational mandates around sourcing and notability.

Once upon a time, subject-specific notability guidelines carved out exceptions by arbitrarily defining typical notability for a given subject at a lower standard than general notability. Those exceptions are being rolled back toward the higher bar of general notability.

When a women's sport gets 1/10th the mainstream coverage of the men's equivalent, there's no policy justification to have articles under general notability. Even if Wikipedia was capable of banning every editor whose focus is in deleting articles about women (and it's sincerely not capable of banning any of them), the core policies themselves would justify not creating articles — especially for almost all who played in the decades before the recent boom in coverage and investment — because the coverage wasn't and isn't there.


The issue here not specifically notability or a lack of articles. In most cases someone could find the sources but they would need to do so really quickly while flagging hundreds of pages is trivial for an editor with a grudge.

It’s that time pressure gives a great deal of power to anyone with an agenda of any kind.


The comparison wasn't notable male athletes versus non-notable female athletes. It was non-notable male athletes versus non-notable female athletes. The observation that male sports is more popular, while true, is a non-sequitur.


Wikipedia's misogyny is built into its policies, notably the pernicious effects of https://en.wikipedia.org/wiki/MOS:IDINFO and how they punish anyone who dissents from this.

So for example, per this policy, it's impossible to rewrite https://en.wikipedia.org/wiki/Dana_Rivers to point out the angle of male violence against women, despite there being sources that exist discussing this, because it's forbidden by policy to refer to this male as a male or use sources that describe him as a man. Instead, the pretence that this is the crime of a woman has to be maintained.


So you think that because wikipedia guidelines are intolerant of intolerance and don't fit with your personal (bigoted) opinions on trans people that means they are misogynist?


But she is a trans woman, right? How can you say she is male?


Toxic masculinity at play.


Why did the notability-obsessed editor take it on themselves to flag the World Cup footballers? Sounds like a dick move.


I was looking up Gall Wasps last week... :)


For me one of the best kept secrets of Wikipedia is the Navboxes collapsed at the bottom of every article. They are very good for getting a bird's eye view of complicated topics and especially helpful for seeing how something fits within a complex hierarchy. Some of them are like minuature works of art that I'd want to hang up on my wall one day:

https://imgur.com/gallery/ILp6TtA

I get why they have to be collapsed by default, but I use a userjs to move them to the top of the page, uncollapsed, so I can use them to navigate (I set their CSS zoom to 0.3 so they don't take up too much space)


I am interested that you find these attractive as "art".

I cannot see it, not at all.


to me, reading these overviews is a very rewarding experience, as every article title in the category gives me a rewarding rush of memories and context all related to a common theme. and if I don't know about a topic, it's an effective way to find the gaps in my knowledge. maybe it's not a traditional form of art, but I can see how it's a unique way to interact with a large domain of content.


I can as well.


I’m shocked at the method for picking a random page. It’s guaranteed to produce permanent biases in the randomization and lots of path dependence in how individual articles’ chances change over time. Picking a random int from 1 to N is not rocket science.

This makes me wonder if there isn’t some intentional weighting going on, e.g. certain favored pages get space in front of them cleared out so they’re more likely to pop up.


It is shocking, but what's more shocking is that doing it correctly is kind of rocket science. MySQL simply isn't built for picking a truly random row performantly (I'm not sure if any of the common relational databases are?).

If you do something naive like "ORDER BY RAND() LIMIT 1" performance is worse than abysmal. Similarly, doing a "LIMIT 0 OFFSET RAND() * row_count" is comparably abysmal. While if you do something performant like "WHERE id >= RAND() * max_id ORDER BY id LIMIT 1", you encounter the same problem where the id gaps in deleted articles make certain articles more likely to be chosen.

There are only two "correct" solutions. The first is to pick id's at random and then retry when an id isn't a valid article, but the worry here is with how sparse the id's might be, and the fact it might have to occasionally retry 20 times -- unpredictable performance characteristics like that aren't ideal. While the second is to maintain a separate column/table that includes all valid articles in a sequential integer sequence, and then pick a random integer guaranteed to be valid. But the problem with performance is now that every time you delete an article, you have to recalculate and rewrite, on average, half the entire column of integers.

So for something that's really just a toy feature, the way Wikipedia implements it is kind of "good enough". And downsides could be mitigated by recalculating each article's random number every so often, e.g. as a daily or weekly or slowly-but-constantly-crawling batch job, or every time a page gets edited.

You could also dramatically improve it (but without making it quite perfect) by using the process as it exists, but rather than picking the article with the closest random number, select the ~50 immediately lower and higher than it, and then pick randomly from those ~100. It's a complete and total hack, but in practice it'll still be extremely performant.


If it were up to me, I'd implement something sort of like an online[0] Fisher-Yates shuffle:

Add a column to the article database for "random_index". This gives each article a unique integer index within a consecutive range of 1 to N. To pick a random article, just pick a random number in that range using your favorite random number generator and then look up the row with that random_index value.

The tricky part, obviously, is maintaining the invariant that random index values are (1) randomly assigned and (2) contiguous. To maintain those, I believe it's enough to:

1. Whenever a new article is created, give it random_index N+1. Then choose a random number in the range (1, N+1). Swap the random_index of the new article and the article whose random_index is the number you just rolled. This is the Fisher-Yates step that gives you property (1).

2. Whenever an article is deleted or otherwise removed from the set of articles that can be randomly selected, look up the article with random_index N. Set that article's random_index to the random_index of the article about to be deleted. This gives you property (2).

[0]: https://en.wikipedia.org/wiki/Online_algorithm


Oh wow, yup that will do it. That was my second suggestion for what is "correct" but I never heard of your #2 suggestion, so I thought there would still be performance issues because you'd have to shift all the values above. I'm kind of annoyed now I didn't know about this before... :)

Yes, this should definitely considered to be the correct, definitive, performant solution then. Thank you!

I think you'll need a full table lock in the case of a race condition trying to do a remove and add at the same time, but that's probably not usually going to be an issue?


Actually, now that I think about it, step 1 isn't necessary. You don't actually need to shuffle the indexes since you always choose a random one anyway. It should be enough to just do the swap on deletes in order to keep the integers contiguous.

> I think you'll need a full table lock in case you're ever trying to do a remove and add at the same time

Yeah, I think you need some sort of locking to ensure N can't be concurrently modified but I'm not a DB person, so I don't know how to go about that.


Yup, agreed on step 1. Basic autoincrement is perfectly fine, since you pick a random number at query time.


What an incredibly simple solution for something that was just deemed rocket science.

Also I liked your book.


Thanks to both of you for pointing out some quite good solutions to this problem.


This is the nicest solution I've seen so far.

I believe that for the operation of randomly sampling a single row, only the consecutiveness invariant is needed. The random permutation invariant is unnecessary, because the choice from the range (1, N+1) is random anyway.

However, I can imagine the random permutation invariant being useful for other operations (although I can't immediately think of one).

EDIT: the random permutation invariant could maybe be useful for a list that can be displayed to the user. I can imagine some cases where it is nice to have an order that doesn't change a lot while you don't want it to depend on any property in particular (like the order in which the rows were added).


What we actually have though are thousands (potentially millions) of well-fed human brains which are volunteering their downtime in a particular moment of their life to receive factual information of any kind (of the kind they expect to see on Wikipedia) in the hopes of getting something good. A system that chose the best possible article to present to them in that moment to bend the course of their life in a good direction or satisfy some need (or attempt/pretend to satisfy, as much as we can by shooting photons at their retinas) would be a better engineered system than one that randomly selects from its entire catalog of options correctly and efficiently.


2 is exactly how removal from sets is O(1) whereas lists are O(N)


you probably don't even need 2. If your select with a random number N returns null then increment to N+1 and select again. You'll find one eventually.


I think that’s got a similar bias as the original, unfortunately. All deleted indices between K and N (K<N) will result in picking N, and the number of such deletions would vary wildly between individual articles.


> but the worry here is with how sparse the id's might be, and the fact it might have to occasionally retry 20 times -- unpredictable performance characteristics like that aren't ideal.

Is this really a big deal? Surely for a SQL database, looking up the existence of a few IDs, just about the simplest possible set operation, is an operation measured in a fraction of a millisecond. Wikipedia is not that sparse, and a lookup of a few IDs in a call would easily bound it to a worst case of <10 calls and still just a millisecond or two.


It's not a big deal because there is no need to retry 20 times. The probability of getting a deleted article several times in a row is very low, so you could limit to e.g. 5 tries, and if all are deleted fall back to the next non-deleted article (or even a hard-coded article). The bias would be negligible assuming the proportion of deleted articles is low; to guarantee that it is low, one can periodically renumber to eliminated deleted articles (this can be done efficiently using the trick suggested by @munificent; but the naive O(n) approach would probably be good enough).


Oracle has `SAMPLE(N)` which can come after a table name to get N records from the table. I believe it's built specifically to be performant for this kind of thing. That said, it's the only case I've heard of and only because of an aside in a book a while ago.

https://stackoverflow.com/a/22822473/9939508


Why does the randomization have to happen in the database query? Assuming there aren't any large gaps in the distribution of IDs, and if you know the max_id, couldn't you pick a random number between min_id and max_id and get the record whose ID matches that random number?


Precisely because of the gaps. Tons of Wikipedia article ID's aren't valid for random selection because they've been deleted, because they're a disambiguation page, because they're a redirect, or they're a talk page or user page or whatever else.

My comment covered your suggestion already -- that's why I wrote "you encounter the same problem where the id gaps in deleted articles make certain articles more likely to be chosen".


Can't you just query the ID column and grab the entire list of valid IDs, put that into an array, store it, and pick random IDs from that?


That requires setting up an entirely different service and somehow keeping it in perfect sync with the database, and along with all of the memory it requires.

And you've still got to decide how you're going to pick random ID's from an array of tens of millions of elements that are constantly having elements deleted from the middle. Once you've figured out how to do that efficiently, you might as well skip all the trouble and just use that algorithm on the database itself.


>And you've still got to decide how you're going to pick random ID's from an array of tens of millions of elements that are constantly having elements deleted from the middle. Once you've

How so? you have array of only valid IDs [1,2,3]


Oh I thought this was a one-time thing, like a research paper. Doing it continuously in real time is much harder.


How would you do it given that you can't know ahead of time which articles have been deleted? If you pick a random int from 1 to N, there's a probability that you'd get a bad result. If the proportion of bad pages is greater than good pages (a not-unlikely hypothesis given spam) then repeatedly picking a random int from 1 to N will take a few rerolls.

I agree that rerolling is probably a fine strategy, but it's not obvious that everyone would be willing to just trust probability like that. In fact, if you set up your own wiki with 99 bad pages and 1 good page, hitting the random article button won't work most of the time.

You could of course pick a random number from 1 to M, where M is the number of valid articles. But that runs into the same problem of how do you map the 1-to-M number back to an actual article id?

This method may be biased, but it's constant performance.

(I still prefer just keeping all valid articles in memory and choosing a random one of those. There are only 7M as of October, so including the deleted pages, there's probably ~70M, which is on the same order as HN's total item count. Heck, even creating one file per valid article on disk and doing a random search with "find . -type f | shuf -n 1" probably wouldn't be terrible if you cache the result every few seconds, though that'd also be biased in its own way.)


If they went to this much trouble it's presumably because selecting a random row is an actually slow operation for their database (and this is a pure gimmick feature so the weighting not being equal is irrelevant). MariaDB suggests that "ORDER BY RAND() LIMIT 1" involves a full table scan..


Worse yet, they have had to struggle with MySQL’s idiosyncrasies since at least 3.22. There will be a lot of leftovers from those grim times that used to be “real clever hacks” back in the day.


If I had a database with articles (deleted or not), each with an ID, and knowing the ID range at each app server - and I want to roll a random non-deleted article:

I'd have 1) a remote cache for deleted articles 2) the application servers adding to the remote cache whenever they roll a deleted article. The remote cache can be data-center local and backed by durable storage, with a long TTL. It's possible to not roll un-deleted articles until the cache TTL, but that's ok. Data-center locality ensures some level of fault tolerance and low latency. Durable storage means restarts don't have as much of a cold-cache problem. The ID range maximum water-mark can be updated asynchronously. It's ok if new articles don't show up for a short while.


I also thought that the different weights might have been intentional since it reminds me of arithmetic coding[1]. Turns out it was just an accident.

I suppose current implementation makes sense since the weights would be roughly equal as the number of articles increases, and any individual user is unlikely to care about fairness as long as they get some random article.

[1] https://en.wikipedia.org/wiki/Arithmetic_coding


It is so fantastic that we have a single resource where I can find information on an unremarkable Peruvian moth, a random small town in Iran, or a type of reading material designed for Sunday reading in Victorian England. Before the advent of the internet this stuff wouldn’t be worth the paper it is published on. But now the cost of storing it in the cloud is effectively zero, so we get this amazing long tail of information that is just incredibly valuable. Thank you to the people who contribute and maintain this stuff!

Edit: what would be really cool is if I could leave a note on a page saying "hey, I visited this page and I'm looking to connect with other people interested in the Peruvian Foovius Barivius Moth"... ping me


For as much as I love Wikipedia as it is, I yearn for a return to the days of their old inclusionist[1] policy.

[1] https://gwern.net/inclusionism


I came across a huge article on some magazine marked for deletion the other day. The amount of work that had gone into the article was extensive, and it made me sad. While my little article about a small pile of sand in the middle of nowhere has ironically been translated into nine languages:

https://en.wikipedia.org/wiki/Qeqertaq_Avannarleq


I agree. Why throw away perfectly good work out of some bizarre idea about "notability?" What's notable now is now what will be notable in the future.

Stop the destruction! End this dumb rule and make Wikipedia truly open to all.


"Notability" is a weapon used by many editors to get rid of entries that go against their views. I've seen it used to keep off pages for highly-selling authors who hold contrary views to certain viewpoints even though they are better-known and more highly cited than authors who advocate to those certain viewpoints who get to keep their pages. No one should be able to gatekeep in this way, and there should not be anyone who can claim a page and make it theirs, keeping off any and all edits or challenges by others.


I've seen it getting used against... games in particular language

Same game in a one language can be extremely notable and well known to everyone, while in other it'd get deleted without a trace.

Sometimes it feels like "racism" based on countries (countrism?)


> Sometimes it feels like "racism" based on countries (countrism?)

I'd call it national chauvinism.


jingoism


A lot of languages span several races or countries.


As far as I understand, this idea of notability also adversely impacts science publishing, it's not published by reputable journals unless it's notable, which skews incentives and also means there's probably some useful information out there left to someone else to rediscover.


I wish there was an easier way to find deleted articles. You can see all versions of an article but if it's gone there seems to be no way to see anything.


This is by design. A lot of articles, especially recently created ones, get deleted because they consist of material which is legally problematic (e.g. text which is a copyright violation, which is libelous, which violates someone's privacy, etc).


Doesn't that use a different deletion mechanism? Afaik you can still see "normally" deleted edits and articles if you have the revision link, but those who have been deleted for legal reasons are wiped from history and their revision or edit link are blanked out


> Doesn't that use a different deletion mechanism?

Generally not. There is a separate process called "oversight" [1] which behaves the way you're describing, but it can only be invoked by a very small number of privileged users, and is only used in extreme cases.

[1]: https://meta.wikimedia.org/wiki/Oversight_policy


There used to be deletionpedia.org but it has gone dormant (again)


Probably because you exploit this. Do this a bunch in some what that leans towards whatever your agenda is and eventually one of those articles will get a bunch of hits and attention before anyone considers revising it.


Wikipedia should allow more articles to have sections for viewpoints from different biases and agendas.


Nothing is destroyed; the person who wrote the article can ask an admin to get its text back so they can publish it elsewhere. I don’t get this "Wikipedia should host everything" mentality: if you disagree with the rules, just publish your work somewhere else. The number of articles need to be manageable, so rules are needed.


surely if an article exists and is shown to be being poorly maintained, then perhaps it's worthy of dormancy or maybe deletion. why do it pre-emptively?


It's interesting, I always theoretically appreciated that wikipedia could be inaccurate and that the strict requirements could be a bad thing, but I never saw it in action. Until I was in Tunisia and wanted to visit all the star wars film sites that I could.

There's a place called Ksar Ouled Soltane [0] and I couldn't quite figure out if it was a film site or not. The locals assured me it was, but there were conflicting reports online. I saw that it was listed on Wikipedia, but when I went through the history, I saw that someone tried to remove it [1]. Unfortunately their removal was reverted and their source [2] wasn't considered good enough.

That said, I read their article and I'm much more inclined to believe them than the wikipedia editors.

It was the first time I ever encountered something like this in the wild.

[0] https://en.wikipedia.org/wiki/Ksar_Ouled_Soltane

[1] https://en.wikipedia.org/wiki/User_talk:Dbavandy

[2] https://galaxytours.com/research/ksar-ouled-soltane-debunkin...


I have followed Ukraine-Russia related articles for a decade and saw thousands of articles changed over that time span.

One thing that always stood out to me was how till 2014 you could find many references on wikipedia to pre-2014 attempts from Crimea to secede from Ukraine.

"Crimea" (quoted, as it may refer to different groups/leaderships in Crimea) tried to split from Ukraine twice already during Soviet times.

But this whole thing got edited away in years across dozens of articles.

I think there's a significant nuance between the two narratives.


I have to wonder if Billinghurst even looked at the galaxytours website. I suspect not.

It is telling that he never responded to Davin Dbavandy question about why it is not a reliable source.


I'm much more inclusionist than deletionist. And find there's a lot of both randomness and irony in what qualifies as notable, especially given that notability can be a very local thing. On the other hand, it's probably fair that not everyone who has ever had their name in print or has a website should have a Wikipedia entry.


The ugliest edit war I've seen involved a less-known adult film star.

Her legal name had appeared in some public court filings, and some user had added it to her Wikipedia page with the reference.

Someone else (purportedly one of her co-stars) had requested removal.

A Wikipedia editor then got involved, and refused to budge from the position the original actress was notable enough to merit inclusion of her legal name.

Cue months of back-and-forth deletions and reversions.

Finally, I think a year later, someone quietly edited the legal name out and it stuck.

Sometimes people with a little power are the greatest assholes. Probably because they sought the little power.


What is worse is the people who went thru the trouble to make those edits in the first place will conclude that it is not worth the trouble to fix this. Then go on to not make other edits because 'of that last time'. That sort of fight you basically have to stay at until someone gives up. As neither side is going to budge. Then you can do like what happened here. Let the other side think they won. Wait a year and put what you want and it would probably stand as they moved on ages ago.


there's a certain free-to-play multiplayer game I used to play that is absolutely full of cheaters. on one server some people built a base opposite me, and for a while I sat in my sniper's nest and took pot-shots at them. then suddenly they absolutely got my number (turned on their cheats) and I couldn't walk inside my base without getting headshot through walls. I tried my best to kill them but I realised that they were staying online just for the sake of killing me. so I logged off, came back on after an hour and blew their base to fucking kingdom come. wikipedia seems to be the same. people feed off the conflict, and if you just let the conflict die, eventually you can come and take the spoils unopposed


An actress notable enough to merit her own Wikipedia article didn't want her name known, but her name was a matter of public record?

I don't know about this case, but it seems entirely reasonable that to argue for including the name, or not.

Plenty of public personas have aliases they wish to be known by, but their Wikipedia article prominently notes their real name.


surely people should have a right to privacy wrt this kind of thing?


Wikipedia isn't a primary source, and it needs to be verifiable.

So by definition it's not exposing any information you can't find in other public sources.

Whether it's notable and something that should be included is another matter.

In any case, I'm saying that if you look at the articles for Madonna, Bono, Jenna Jameson, Marilyn Manson etc. you'll find their real name displayed prominently.

We can't form our own opinion in this case, as the GP is deliberately keeping the specific article in question hidden.

But in general this seems like a thing two editors could reasonably disagree about.


It's probably the case that "outing" a micro-celebrity or someone who isn't really especially well-known is sort of obnoxious. But, yeah, for especially well-known people in the spotlight (often by their own choice) there's no particular right to keep their legal name a secret.


As a matter of (English) Wikipedia policy the inclusion of personal details ("such as date of birth, home value, traffic citations, vehicle registrations, and home or business addresses") should not be supported by only public records. Proper names for those primarily known under an alias are not explicitly covered under this section, but there would be a reasonable case for exclusion.

For the section in question, see: https://en.wikipedia.org/wiki/Wikipedia:BLPPRIMARY


They need something like "wikipedia adjacent" to allow for inclusionism without authority - it's not like they're limited by the size of a bound encyclopedia.


Wikipedia is an excellent example of something that would benefit from an upstream, less-authoritative, more-comprehensive source.

Unfortunately, it seems so ossified that that work can only be done within Wikipedia (notable article, standards, etc).

A far more reasonable approach would have been to have CommonPedia... and then regularly pull qualifying information into Wikipedia.


> They need something like "wikipedia adjacent" to allow for inclusionism without authority

This is called "The Web". You’re free to write whatever you want, it’s just not on Wikipedia.

> it's not like they're limited by the size of a bound encyclopedia.

No, but articles still require maintainance, so it’s false to say that just because this is "in the cloud" you can have an infinite number of articles.


I genuinely believe that deletionism is rooted in mental disorder. It comes from a similar place as neurotic clean-freak OCD. Deletionists should be given ECT, not admin accounts.


Me likewise – it makes me happy that there's an 800 word article about a pond that used to exist for less than 200 years in my town.


I still miss the Trivia sections.


Re: your edit - I've always thought it'd make for a fun pet project to take the hyper-engagement-driven approach that social media apps take but apply it to Wikipedia. Like a Tiktok-style vertical feed of articles that tries to find the ones you'll engage with most (WikiScroll [0] is already part of the way there). Or a chatroom corresponding to each article that people can banter in, kind of like you described. Or a DM feature but you can only send Wikipedia links. The idea of Wikipedia as a social catalyst is super interesting to me.

[0]: https://wikiscroll.blankenship.io/


> an unremarkable Peruvian moth

Miloš Moth (Jan 3, 2023–May 17, 2023) was a Peruvian moth,


Very shameless plug on your last "Edit:", but I think it fits perfectly one feature that a web extension I made does: https://webcursors.click/.

There's a feature to write a note and leave it on the page for others (with the extension installed) to see and maybe have luck someone contacting you.


The design on your site reminds me of kinopio.club, I like the sort of whimsical style that doesn't compromise on utility.

> There's a feature to write a note and leave it on the page for others

Your site says that "nothing is stored." Does writing a note require everyone to be on the site at the same time? At first I thought it was sort of like the old "Dissenter" extension that added comments sections to every page, but the description seems to indicate a more live chat style.


I’m so thankful for Wikipedia. It’s the best part of the internet in my opinion.


The eye opener for me was in the mid-late aughts. A coworker for whatever reason did a deep dive into the world's knowledge of GI Joe lore. Before that I wouldn't have even considered there *was* lore that was that well fleshed out. And before Wikipedia I assume that knowledge was known by only a small handful of supernerds. And yet, some random person went down the wikipedia rabbit hole for a few hours and knew everything.


> Before the advent of the internet this stuff wouldn’t be worth the paper it is published on.

Are you seriously suggesting that before the internet no one published or wrote about highly niche or uninteresting subjects? That you couldn't find mundane facts from history or nature written down somewhere?

> Edit: what would be really cool is if I could leave a note on a page saying "hey, I visited this page and I'm looking to connect with other people interested in the Peruvian Foovius Barivius Moth"... ping me

No, that wouldn't be cool at all. What we'd then have is a social network of people competing for some kind of points or popularity, or to prove who was the most expert or most interested in the subject. And we would see the quality of the articles degrade, and suddenly no one wants to work on the unpopular stuff. And people like you would be left wondering, "gee, why is it no one wants to write unpopular articles for this encyclopedia?"


> Are you seriously suggesting that before the internet no one published or wrote about highly niche or uninteresting subjects? That you couldn't find mundane facts from history or nature written down somewhere?

Yes and no. I'm suggesting that before Wikipedia, no encyclopedia could have published an article about Jack Wade, Alaska, an incorporated community of probably < 100 people (I picked this place somewhat randomly off of google maps). If you put even one sentence about every town similar to that in an encyclopedia, you'd have an encyclopedia set that no one would be able to afford.

> What we'd then have is a social network of people competing for some kind of points or popularity, or to prove who was the most expert or most interested in the subject.

This really depends. For example, maybe I'm an American with a grandparent from Volosianka, Lviv Oblast, Ukraine (again, I picked this semi-randomly). Population is about 1500 today. There's probably a handful of people in the world who speak English and who have a particular interest in that town. Maybe my grandpa is from there and i'm looking to learn more about his life and i'd love to chat with someone else who knows more about the place. I think that there are a lot of articles on the long tail of wikipedia where there's probably only a few people in the world who care about the topic and if I'm one of them, I'd be interested in meeting with others. Obviously, there's plenty of topics where this isn't the case. I don't know how you'd implement this feature, because I don't think you want to turn wikipedia into a meeting room for the Israeli-Palestinian conflict [1]. But for these very very niche things, it could serve as an interesting meeting place.

[1] In other words, I am stating a problem here... not a solution


Prior to the Web, the place name would have been listed in a gazetteer (of Alaska, probably) with a small description.

The moth would be described in the original scientific publication (from when it was first discovered) and then in books listing the Moths of Peru etc.


> No, that wouldn't be cool at all.

It also already exists, but in a better form. If you find something is interesting, write about it. Then to connect with others you can find others who have written or researched it and then write them a letter/email/message. They will likely not be an island and you can get connected to a group of people who are genuinely interested in a subject without the degenerative downsides of a social media platform.


Exactly. What does this person think the Wikipedia articles are based upon? There's supposed to be a real source!


Wikipedia's origination was almost 100% self-reported, uncited facts.

The standards came much later.


Fun fact: There are no uninteresting articles on Wikipedia. I can even prove that mathematically. It's a proof by contradiction:

Start with ranking all articles by their interestingness. Then order the list and look at the lowest values. There must be a least interesting article. But the very fact that this is the least interesting one in the entirety of Wikipedia makes it interesting.

In the same way, I assume the least viewed articles in this blog post will have lost their status by now.


:-) and surely enough there is a wikipedia page for it: https://en.wikipedia.org/wiki/Interesting_number_paradox


I'm doubting the validity of the proof. It seems to assume that there is exactly one least interesting article (the least interesting article). But maybe there are hundreds of articles that have the same lowest amount of 'interestingness'. I would say that then those hundreds of articles are uninteresting.


That's because they botched the proof. You need to sort the articles not by interestingness, but some separate canonical property, like creation date. Then you get a (here chronologically) first uninteresting article, which is interesting.


You assume A) that the set of interestingness values is both finite and discrete and B) that it's cardinality is extremely low. Both things can be remedied by defining a more comprehensive kind of interestingness that lets you compare more articles.


In the interest of being pedantic, the article actually looks at the least viewed articles in 2021.


Which part of the statement would that affect?


>I assume the least viewed articles in this blog post will have lost their status by now.

No matter how many views they get now, they would still be the least viewed article in the year 2021.


And neither was anything of that sort claimed.


>I assume the least viewed articles in this blog post will have lost their status by now

You didn't directly say it, but as the status given them in the blog post was "least read article of 2021" the statement is wrong. It's more because you worded it ambiguously, but that's why you got a pedantic reply.


The statement still remains valid, since the current year is no longer 2021 and it doesn't claim that the list is absolute or eternal. Neither does the article.


The blog post is specifically about the least read article in 2021, and in that context saying it can lose that status is wrong. The "least read article of 2021" is now eternal.

I don't get why you're so reluctant to admit you could have worded something better. "I assume the least read articles of 2021 mentioned in the blog post are no longer the least read articles this year" would have said the same thing without the ambiguity. It's not that big of a deal.


But it actually was the least read article at that time. Since that time is past and the post was made together with links to the least viewed articles, it's a given that they no longer are today.


This does require that the interestingness of being the least interesting article is large enough to cover the gap between least interesting and next-to-least interesting (plus the interestingness of being next-to-least interesting). Probably true, QED


Then it would be even more interesting, because of the more complex gap dynamics. The more you add to this argument, the more interesting things can potentially get.


In the extreme case of a Wiki with two articles, one about WWII and another about the 14th meeting of your HOA to address hedge height limitations, there will never be enough interesting properties of how uninteresting the HOA article is to achieve interest parity. I'd expect the gap dynamics and further will get geometrically less interesting, such that the sum of all levels will never exceed double the first.


The proof implicitly requires a large set of objects. See also the linked page in one of the other comments. Since Wikipedia contains millions of pages, you can directly convert the other proof.


You have to actually do it, though. Until somebody finds the least interesting article and thinks about it, it isn't interesting. Mathematical properties are eternal and exist whether or not anyone is looking.


You don't have to actually do it. In the same way you don't have to calculate and sort every number's prime factors to prove the fundamental theorem of arithmetic. Or, a closer example: You don't need to calculate Euler's number to prove that it exists. For the articles, you only need to postulate that individual ones can be more or less interesting than other articles.


I think it doesn't proof that there are no not interesting articles, rather that sorting itself introduces a side-effect affecting the score. A paradox indeed.


No least interesting article, surely?

Unless articles become interesting if they're in the wikipedia category "Uninteresting".


Reminds me of Geoff Marshall's series of Least Used Stations on UK's Network Rail:

* https://www.youtube.com/playlist?list=PLt4q5oaptyI9U2zddss8d...

* https://www.youtube.com/@geofftech2


What an incredibly pleasant channel/video, thanks for the link!


Wikipedia mods are eager to remove articles that they claim to be not significant so the least viewed article will probably change a lot


My article for Carrier_IQ was deleted as non-notable for obvious political reasons then recreated yaers later after the spin around the company had been settled. There may also be some intelligence community shenanigans.

It's a great way to delete uncomfortable historic facts. Now it's just a funny little story about a rootkit with no specific political intrigue at all.


not just that, just remember Donna Strickland they deleted her article because non-notable and then restored it after she won the Nobel Prize of physics... oops.


> It's a great way to delete uncomfortable historic facts. Now it's just a funny little story about a rootkit with no specific political intrigue at all.

That's not how I'd describe the page today

https://en.wikipedia.org/wiki/Carrier_IQ


Yes, but in my experience, this applies more to people than to some of the topics discussed in the article. I think you'd be hard pressed to find a city, town, village or hamlet in the US without a wikipedia page (as an example, take a look at Jack Wade, Alaska [1]), for example. And I don't imagine that rare moth species articles get taken down. But you have to have some kind of policy in place so that every individual on the planet doesn't have a wikipedia page.

[1] On Google maps view, i can't maybe 8 houses there? https://www.google.com/maps/place/Jack+Wade,+AK+99732/@64.15.... But it's on wikipedia here: https://en.wikipedia.org/wiki/Jack_Wade,_Alaska


At some point a long time ago someone imported all the US Census data into Wikipedia, creating a page for every town reported.

It strikes me strange that those are considered "notable enough" mainly because they've been in Wikipedia long enough, but new articles have to fight.

Petty bureaucrats can only show their power by saying no, if they say yes it's as if they weren't there.


The author actually brings this up in the article, and it seems that they are not likely to be deleted (baring vandalism due to their unique status being surfaced, I guess).

Spoiler for the interested, and it could be an artifact of the dataset the author used, but one of the commonalities between the least viewed articles is that their subject matter falls into a category that isn't usually eligible for deletion under Wikipedia content guidelines.


The writer suggests that a page about a moth is unlikely to provide an opportunity for pushing contentious views.

The edits show there was disagreement earlier this year about the wingspan of the Scrobipalpula crustaria.

Is it 11-13mm or 10-13mm?

People feel strongly about these things.


Well, finding and then publishing the name of the least viewed wikipedia article would then drive up the view count, nullifying the initial reason for the search!



This is exactly what I had in mind when I was writing this! Except just observing it would not change it, publishing it would.


He's changed the stats by taking a look at the articles to verify things like disambig status, so you'd have to be quite careful how you did your analysis to ensure no reflexivity.

Personally, I think the reflexivity is part of the fun if you do a followup analysis. For example, I recently scraped WP to find 'the first unused acronym on Wikipedia': https://gwern.net/tla - it turns out to be 'CQK', and I'm looking forward to checking back in 10 years or so to see if anyone wound up using 'CQK' for a company or something, precisely because I wrote it up. We'll see!


What was the initial reason then? I'm not following you. Is there a problem?

TBF, he's analyzing a 2021 dataset, and that will of course not be affected.


No, I'm just making an observation that he will destroy the uniqueness of the object by talking about it.


He is talking about its uniqueness in the 2021 dataset, which is time bound and unaffected by current phenomena.


The quest for finding Hapax Legomena on the Internet suffers from a similar problem, only worse. If you find one, announcing its existence destroys it. See: quizzaciously.


But with 6 million plus articles

6.0e6 been a brute force number for decades.

A linear search might have taken less time than reading the article. Almost certainly less time than writing it.

It would not have been as entertaining, of course or as clever.

Good engineering is like that nearly all the time.


Discussed at the time:

In search of the least viewed article on Wikipedia - https://news.ycombinator.com/item?id=31524943 - May 2022 (68 comments)


It really doesn't matter, but why in the world would you do random page selection like that? Maybe that makes it super-easy from a SQL standpoint? But as demonstrated it does a terrible job of weighting pages evenly, when a tiny bit of code could make them completely equally likely.

And further, there's good arguments that they shouldn't be equally likely. Instead, rank articles by size, or popularity, so people are more likely to see articles that are reasonably interesting and full of content.


In an attempt to help preserve the article's status as "least viewed", I will not read TFA and remain ignorant of this fact.


Amusingly, one of the identified articles was vandalized soon after this article was published: https://en.wikipedia.org/w/index.php?title=Scrobipalpula_cru...


Schrodingers Least Viewed Article....Just by observing it... it may change state :)


Came here to say the same, but using [Heisenberg's uncertainty principle](https://en.wikipedia.org/wiki/Uncertainty_principle) instead :-)


If you're interested in a dataset that can run vector search filtered by page views, there is a Wikipedia embeddings database available on the Hugging Face Hub.

https://huggingface.co/NeuML/txtai-wikipedia

Example query.

# Find wikipedia pages for a topic in the bottom 5th percentile of page views

SELECT id, text, score, percentile FROM txtai WHERE similar('topic') AND percentile <= 0.05


Definitely not the least-viewed, but I've referenced "List of pasta" to make breathtakingly-lame jokes more than a few times.

https://en.wikipedia.org/wiki/List_of_pasta


Incredible that even the least viewed 2021 articles are still really well done! One has a photo. They have 5 and 6 citations respectively. They both have multiple sections.

As someone who was already impressed by Wikipedia, wow! Talk about toiling in obscurity


I'd curious if there was a way to find Wikipedia categories with the largest amount of stubs, and if that might potentially be an indicator of topics that have the least amount of information on the site.


Can we fetch say, 10 articles after the server-generated random number, and then choose a random article from that? Intuitively, this should even the selection chance, but the math evades me


We could take the HN route of keeping all wikipedia articles in memory at all times, then just take a random sample of the list. In an alternate universe where Lisp dominated, Wikipedia's random article button is in fact implemented this way.


I'm sure those least viewed ones will now quickly become the most viewed ones for some period of time...

The power of slashdot/HN


Paging Dr. Heisenberg


Why does he need to subsample? Seems like it should be feasible to find the minimum of a 6M array


Am I going to get on a list for clicking on a few articles about random towns in Iran?


No, you're already on the same list for posting a comment that mentions that country.


Please try not to skew the results. Poke around with Archive.org plz.


it's kinda sad that the least viewed articles are about moths. i really like moths they are like socially-awkward butterflies


Interesting.

What about smallest number of views V, and word count C.

What is smallest V/C?


Very unexpected Cow Tools reference


ASCII arrow: own work.

LOL


Contributing to Wikipedia has become really not fun. Or maybe it never was. I tried writing an article about Playwright - perhaps the most common test automation tool these days. It first got rejected and now just has been sitting in review state for 3 months.

https://en.wikipedia.org/wiki/Draft:Playwright_(software)

It is highly demotivating to try to write a quality article for Wikipedia because someone can just reject your days of work in seconds and then leave it in draft forever.


I want to point out that the version that got declined [1] definitely did not have any independent references to support notability [2] (and had it been accepted would probably have been nominated for deletion quickly).

The current version looks fine. Unfortunately reviewing drafts is a very thankless job (~90% of drafts are worthless) so there are never enough reviewers (speaking as an Wikipedia admin who used to review a lot of drafts). The backlog is definitely not something people are happy about, but it isn't easy to solve.

Also, having your drafts reviewed is actually not required. Once you have made 10 edits you can move the draft to the main article space yourself (or directly create articles). The reason why brand new users can't directly create articles is that whey they used to be able to, a ~third of all new articles ended up being deleted immediately because they were spam/gibberish/vandalism, which ends up both being a lot of work for reviewers, and very discouraging to those new users.

[1] https://en.wikipedia.org/w/index.php?title=Draft:Playwright_...

[2] https://en.wikipedia.org/wiki/Wikipedia:Notability


It looks fine to me; some of the references look a bit obscure, but I wouldn't call them "unreliable". I'd say "Curb Safe Charmer" has been harsh; there are many, many articles that are supported by worse citations, and are less notable, but exist in mainspace.

[Edit] I wonder if Curb Safe Charmer automated his review using something like this?

https://www.nature.com/articles/d41586-023-02894-x


Go ahead an publish it. "Be bold" - https://en.wikipedia.org/wiki/Wikipedia:Be_bold

I've written dozens of articles starting from a few sentences and slowly expanding it over the course of a few days. I simply make sure the topic is notable enough and that every sentence is cited.


Hey I'm a moderately experienced wikipedia editor. And yeah it's a really annoying thing when you don't get recognized for the hard work put into creating a draft like this. When I'm back at my computer maybe I'll take a look and see how I can help improve your draft.

Generally the hardest and most annoying part about a case like this is wikipedia's desire for reliable sources which generally means newspapers or academic sources. You can have something super well known, around forever, and often mentioned but it can be hard to find sources writing about it directly. For example, I wrote the Wikipedia article for high touch (https://en.m.wikipedia.org/wiki/High-touch) and it was quite annoying to try to find sources. Similarly I wrote one for "smell training" (https://en.m.wikipedia.org/wiki/Smell_training) and the requirements are even higher for anything related to medicine.

Anyway, I'm not gonna say I advise you to do this. But if you're really confident that the article meets the notability guidelines you can also create the article in mainspace (instead of draft). If someone tries to delete it then the guidelines are a little different (vs reviewing drafts) because they have to be sure that there aren't sources to make it notable (not just that you didn't use them). Often times people will actually find sources and add them to the article rather than just complain you don't have good enough ones.

PS: I'm sure someone will come and explain why I'm wrong here but this has generally been my experience.


>wikipedia's desire for reliable sources which generally means newspapers or academic sources

Which is one of the main ironies associated with Wikipedia notability. Something or someone written about in some small-town newspaper or an obscure journal that five people have read can be considered more notable than something with a fairly big online footprint but nothing canonical.

For that matter, there's a ton of pre-web information that just doesn't really much exist outside of primary sources.


One more tip is sometimes you can hop on IRC and ask for live feedback :)


Same here that's why I don't write articles for wikipedia anymore. It is always the same old clique that decides what is worth or not. We are loosing information on the resource that was supposed to make it available for all. All of that in the name of ego back scratching.


I think FOSS projects always struggle on notability, with deletionists targetting sources that are relatively minor FOSS magazines, even those with actual print runs. FOSS projects rarely have the marketing team to pay or nag to get a mention in a larger trade mag..

What's interesting is this is more a phenomenon on the English wikipedia, so if you're disappointed that a search for information on a FOSS project is turning up nothing on en.wikipedia.org, try just switching en for fr or de (both relatively large) and then using google translate or your browser's built-in translate [firefox] to get the info you need.

Sad. And yeah, I feel deletionists are out of control on english wikipedia.


> Contributing to Wikipedia has become really not fun. Or maybe it never was. I tried writing an article about Playwright - perhaps the most common test automation tool these days. It first got rejected and now just has been sitting in review state for 3 months.

It only gets worse. Try getting into a content dispute sometime.


That sounds terrible.

I don't know how Wikipedia works -- is there not a way to submit a smaller/stub page or something, and then once it passes review for notability, then you fill out the rest?

Does Wikipedia expect people to put in days of work before an article gets approved? I hope that's not the case.


Looking at the profile of your reviewer, it seems that they take pride in contributing to the deletion of several Wikipedia articles. It's possible that your article was just a collateral victim of this editor's penchant for deletion.


WTF. 4 month review backlog? Why is there even a review system at all? This isn't a legal journal, this is Wikipedia. Your article looks great. Playwright is a very notable piece of software, extensively used.

I hate when things get mature and enshittify.


It really sucks, but the reason why review happened is because there is an onslaught of people who show up to write promotional articles, many of them paid by PR agencies and reputation management firms.

The previous system of doing post-publish review of new articles had an even worse backlog because it let a huge volume of shit into the front door.

The root cause of this is Wikipedia’s huge influence in Google SEO and knowledge graph. As the open web is dying, it’s one of the few reliable ways to dump information straight to the top of Google SERPs. When I write new Wikipedia articles it is often indexed to the first page of results in minutes.


> Why is there even a review system at all?

I don't create articles these days; it's too much like hard work.

You used to be able to create an article by making a redlink wikilink somewhere, and then clicking it - that would take you to an editor. The new article would appear in mainspace. Do you have to "volunteer" to appear in mainspace? Are all new articles deemed to be "draft", and subject to review nowadays? Or are some users allowed to create new articles in mainspace, and others not? If so, what are the criteria?


The criteria for creating an article in mainspace is to have made 10 edits and have an account more than 4 days old.


Thanks! That's a pretty low threshold to surmount. @lucgagan: you've done the time. Try editing a few articles - copyediting is enough. Then delete your draft, and create a new mainspace article.


OK, panic over. Nothing to see here. I wondered why I had never seen this feature in any of my article creations.


It still works like it used to, there’s just a requirement of having made ten edits first before you’re allowed to. That’s a fairly low bar, and it gets rid of a ton of spam that used to be created.


My first wikipedia edit was correcting the elevation of the tallest point in one of Seattle's neighborhoods. It was immediately removed. In fact, the entire section about the neighborhood's elevation was removed. So, the whole reason I was at Wikipedia in the first place was removed. I thought briefly about adding it back in, because I thought it was super valuable, but upon looking at the profile of who made the edit and finding a "U mad I deleting your work? lmao" style comment at the top, I noped outta there. Petty edit wars on the internet over trivial nonsense sounded like a dumb way to spend time.


Ended up publishing to mainspace as per the guidance of a few commenters.

https://en.wikipedia.org/wiki/Playwright_(software)

Thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: