Hacker News new | past | comments | ask | show | jobs | submit login
20% of requests for Wikimedia Commons are for one image of a flower (wikimedia.org)
1388 points by IfOnlyYouKnew on Feb 8, 2021 | hide | past | favorite | 362 comments



At the height of the browser wars I once woke up to Microsoft hotlinking a small button for downloading our software from the MSN homepage. I tried to reach someone there for hours but nobody cared enough to do something about it. The image was small (no more than a few K), but the millions of requests that page got were enough to totally kill our server.

Finally, I replaced the image on there with a 'Netscape Now' button. Within 15 minutes the matter was resolved.


Back around 2002, I had a pdf icon on my website. It got deep linked by a few others but the number one source of traffic came from the website of a lawyer who specialised in intellectual property. There was something on there about how it was illegal to deep link.

I was tempted to replace with goatse but I think I just changed it to a screenshot of his website saying that it was illegal to deep link.

It soon got changed.


That's a neat example of recursion :)


I honestly don't understand the issue- he linked to a specific page? What's the problem?


I believe they mean the lawyer hotlinked [1] their image so that every visitor to the lawyer’s page would result in an image download from their server.

[1]: https://en.wikipedia.org/wiki/Inline_linking


He deeplinked to an image file, whilst claiming on his website that such action is illegal.


Is deeplinking the same as hotlinking?


Even though it's not illegal!


One of my friends used the same strategy to block DDOS from China : just put "Falun Gong" on there and it was resolved instantly.


I remember someone doing that with the goatse picture. The hotlinker was pissed and all sorts of amusing drama ensued.


That was exactly how I learned what goatse was. My MySpace page was all decked out with images that I was hotlinking from some server... The server owner realized this and replaced all the images with Goatse. One day a friend goes "Hey... uh, what's up with your MySpace page... that's pretty gross". So I went to log in: Goastse. Goastse everywhere (gestures with hand). And my eyes were never the same again ಠ_ಠ

Edit: grammar.


That was popular in the early ebay days when you had to host your own images. A friend had someone selling similar items using his image links. So he changed the images to goatse. Problem solved.


The Tribalwar forums did this to CNN after 9/11, CNN had hotlinked one of those images where people were trying to pick out "demon faces" in the smoke


It wasn't only CNN. A bunch of big news sites linked directly to the image hosted at tribalwar. It all started with some news video of one of the WTC towers smoking. Someone on the forum screenshoted the video and asked "what is this?" because the smoke produced this weird devil-like formation. That picture goes spread around and soon news sites started writing stories saying that triablwar had photoshopped the image and that they were evil and making fun of a tragedy blah blah blah. So basically the news sites were DDOSing tribalwar and lying about them to make them look bad in their sensationalist articles. The administrators of the forums send many emails begging them to stop directly linking to the site and it only got worse and worse. Finally they replaced the image with goatse (with text overplayed giving the true story). If I remember correctly the image was viewed by hundreds of thousands or maybe more people before they were totally removed. That was how tribalwar goatsed the internet. It really was quite legendary.


According to the article IPs downloading this image come mostly from India.

So replace it with the pakistani flag to solve the problem (or start WW3)


> One of my friends used the same strategy to block DDOS from China : just put "Falun Gong" on there and it was resolved instantly.

...because attacks from China are horrified at the thought of disrupting Falun Gong?


Because it is one of the things that will get you added to the blocklists that form part of the Great Firewall of China.

It won't stop a hacker who is probably bypassing parts of that anyway, but the more casual requests such as those caused by deep linking will generally stop getting through.


We used something like this technique back in the Flash days. Sites would straight up steal your games, so one defense was to have the game grab its sprites from a server local endpoint. Thieving sites would get either no graphics or deliberately corrupted graphics.


The old school response, weaponized without being inappropriate.


That's hilarious!

Did they continue to link to your software after that? (I'm curious - what was your software?)


Yes, they did, they actually thought it was quite funny. They even cached the actual download once they realized we wouldn't be able to deal with that either. The software was the first version of the public peer-to-peer webcam software I wrote:

http://web.archive.org/web/20000510010712/http://www.camarad...


Oh my! This is a blast from the past. I was a kid, probably 10 years old or something, and I had a LEGO MovieMaker webcam. I was trying to set it up as a sort of security/monitoring camera for the back door of the small business my parents ran. I remember using this software and supposedly getting it working.

I invited my parents to come see what I had done, and somehow typed the website wrong and ended up on a spanish-language porn site. I could not hit the back button fast enough. Possibly one of the most embarrassing memories of my childhood.

I have no idea what my parents thought I was up to.


Heh. Hilarious story, thank you! Camarades.com had just about everything, from people being born to people dying and everything in between. It was a pretty honest (sometimes brutally honest) slice of life.

One of the most popular cams for years was an old person that was extremely ill and that rarely moved but he had pretty big fanclub and he thought it was quite funny that he was more famous on what eventually became his deathbed than he had ever been while he was still active. After he died his family asked to remove all the images and close the account which of course we did. Makes you wonder if all those people wishing him well over the years kept him going a bit longer. What is interesting is that if you did this today I'm pretty sure the jerks would drown out the nice people by a considerable margin, of course there were jerks back then as well, but on the whole the internet seemed to be a much nicer place to hang out than it is today.


Not sure if you're aware, but it's interesting that you mention Lego as the person you're responding to once accidentally bought literally tons of bulk Lego and later designed an automated Lego sorting machine. It's a fun read:

https://jacquesmattheij.com/sorting-two-metric-tons-of-lego/


Haha I know that pain.

When I was a kid I asked my mom to print me out Grand Theft Auto cheats from Gamewinners.com while she was in work.

Somehow I got the address wrong and she wanted to know why I wanted to print out pages and pages from a site dedicated to men cheating on their wives. Got there in the end though and I still have some of those GTA cheats memorised.


Your mom might have another family in the Greater Toronto Area now, just so you are aware!


My then-wife was watching over my shoulder once as I typed something into the address bar. “Freshmeat.net” auto-completed, drawing a suspicious look from her.


Beautiful. The internet was a truly different place back then...


With 100K visitors / day or so we were in the top 30 websites world wide in 1998. The really big boosts came from the Space Shuttle webcasts and an Yves St. Laurent fashion show webcast from Paris.

Hard to believe now, a typical blog post will already pick up 30K visitors without too much trouble.


I could listen to stories of the Old Net all day.


Enjoy:

https://jacquesmattheij.com/story-behind-wwcom-camaradescom/

And apologies for the non-working images.


Sweet, looking forward to reading.


Serves you right for hotlinking ;-)


Yes, but at least it was my own domain :)

I didn't see that consequence coming when camarades.com shut down. I really should dig up those images and repair the blog but the todo list isn't really getting any shorter on this end.


How is this on the Wayback machine?!


You can click "about this capture" for more information

> Starting in 1996, Alexa Internet has been donating their crawl data to the Internet Archive. Flowing in every day, these data are added to the Wayback Machine after an embargo period.


Fun fact: Amazon's home assistant was named after Alexa Internet. Amazon owns Alexa Internet.


it's not named after it, it's just amazon is so massive they have to reuse brand names. AWS has exhausted not only the supply of IPv4 addresses but also the supply of 3 letter initialisms.


What's an embargo period?


A period during which someone who knows something or has something does not release it to the public.


As pioneer of "<something> On Internet", do you regret not turning out like Russ Hanneman? ;) (OR DID YOU???!)

https://www.youtube.com/watch?v=BzAdXyPYKQo&ab_channel=yate5...

https://silicon-valley.fandom.com/wiki/Russ_Hanneman

I'm just glad I didn't turn out like Erlich Bachman! (OR DID I???!)

https://www.reddit.com/r/SiliconValleyHBO/comments/4jmlv9/wh...

https://silicon-valley.fandom.com/wiki/Erlich_Bachman


I've finally gotten around to watching this series, and it's disturbing how many moments I've watched that were more familiar than they should have been, and too many characters I could instantly put a real name to....


Around...2005? 2006? I discovered someone had deep-linked to an image on work's webserver, where I was admin (being one of the few who knew Linux).

Instead of just outright replacing the image, I set up rules in Apache to check the referer, and if it was our site, serve the correct image. Anyone else, it served up something...questionable.

Problem solved.


In the early 2000s a friend of mine was an active seller on ebay. He shot his own product pictures and professionally designed the description pages. Soon enough his content was stolen by competing shops, hotlinking the images from his server.

Ebay didn't care. So obviously the only option was to create a script that would randomly change the images to something unpleasant (think early 2000s rotten internet content).

Good times.


Wikimedia is unique in running some of the most popular websites with open access to almost all systems. As someone who has never been on the inside of FAANG, I found it rather interesting to browse around the backend infrastructure.

See, for example, their statistics at https://grafana.wikimedia.org/d/000000102/production-logging...


Wikimedia's infrastructure is radically different than most FAANG.

In large part because 99% (+/-) of their traffic is read only. While Facebook and Google have to do heavy workloads for every click and action taken on their services, Wikimedia can cache basically everything. Allowing them to operate on a tiny fraction of the number of machines (and infrastructure) that the rest of the players do.


They also have looser latency SLAs. The only hard requirement is that a user can read back their own writes, but it’s okay if other users are served stale data for a few seconds or minutes even. This makes cache invalidation, one of the most notoriously difficult and expensive operations at large scale, much much easier.


Facebook also has a similar SLA. I've heard that at one point in their architecture (~2010), they literally stored the user's own writes in memcached and then merged them back into the page when rendered. You would see a page consistent with your actions, but if you logged into Facebook as any of your friends your updates might not show up until replication lag passed.


Close, IIRC we cached the fact you had just done a write, and a subsequent read request that arrived on the replica region was then proxied to the primary region instead of serviced locally.


My memory is fuzzy now but this dates back to when there were only two datacenter regions and one of them held all the primary DBs (2011 or so). All write endpoints were served in that region, so if a user routed to the secondary region did a write the request was proxied to the primary region. After doing a write a cookie was set for the user in question which caused any future reads to be proxied to the primary region for a few seconds while the DB replication stream (upon which cache invalidation was piggybacked) caught up, because if they went to the secondary region memcached was now stale.

It hasn’t been this way since around 2013 but again I am fuzzy on how. I think that’s when most such data was switched to TAO, which has local read what you wrote consistency. As long as users landed in the same cluster (and thus TAO cluster) what they wrote was visible to them, even if the DB write hadn’t yet replicated to their region.

FlightTracker postdates my time at FB (ended 2018ish) so I’m not sure how that is used. These systems evolved a lot over time as requirements changed.

I don’t remember anything about writes being batched in memcached and merged in on page load.


Dirty bits at scale


I’m guessing this is ecc memory so likely correcting for bad data.


I think they meant dirty bit as in “a flag that means update needed,” not as in “bit flipped due to glitch.”


Pretty clever. Is that still how it works?


Pretty sure this paper describes what they're doing now: https://research.fb.com/publications/flighttracker-consisten...


I'm not sure if FlightTracker completely replaced the need for the internal consistency inside Tao. You can read about that here: https://www.usenix.org/system/files/conference/atc13/atc13-b...


Interesting that this sounds very similar to how multiplayer games do it.


This is indeed correct. Wikimedia overall uses less than 2000 bare-metal servers, so yes the infrastructure is tiny compared to those.

What can be interesting, I think, is that you have a completely open infrastructure that has to solve problems on a global traffic scale.

If people are interested in knowing more, I suggest you also take a peek at the wikimedia techblog, specifically to the SRE category https://techblog.wikimedia.org/category/site-reliability-eng... and the performance one https://techblog.wikimedia.org/category/performance/


Search is also largely read-only. The advantage Wikipedia has is that its traffic overwhelming goes to the head of the page distribution, so simple caching solutions work very well. Google has a pretty extreme long-tail distribution (~15% of daily queries have never been seen before), and so needs to do a lot of computation per query.


> Google has a pretty extreme long-tail distribution (~15% of daily queries have never been seen before)

Do you have a source for this?

I'd be willing to bet that the ONLY reason why 15% of their daily queries "haven't been seen before" is because they add un-needed complexity like fingerprinting. You're making it seem like they've never seen a query for "cute animals" before when obviously they have. They choose to do a lot of extra leg work because of who you are.

So your claim that 15% of their queries have "never been seen before" is probably inaccurate. I'd be willing to bet that "15% of their queries are unique because of the user, location, or other external factor separate from the query itself."

They've seen your query before. They've just never seen you make this query from this device on this side of town before.


If you took into account user, location, etc. 15% seems too low. I almost never search for the exact same thing twice in the same location.

15% of the queries themselves are unique. https://blog.google/products/search/our-latest-quality-impro...

https://www.google.com/search/howsearchworks/responses/

I work for Google (and used to work on Search).


I'd be interested in seeing how polluted that 15% of new queries is with people blasting malformed URLs or FQDNs into the omnibox of Chrome.


What's so unbelievable about 15%? I personally think it is way lower than I expected. We're clearly not googling in the same way.


I agree with you. Also in my experience less tech-savvy people tend to overcomplicate their queries instead of just entering the relevant keywords which I'm sure accounts for many uniques.


The point is not that. It’s that when you search for “cute animals”, Google shouldn’t be storing that you searched for that, or even care. Your location is arguably potentially relevant but it could be coarse enough except when searching for directions to allow at least some caching.


Hey Igor! Hate to be a bore, but I wanted to provide feedback that your comment may unintentionally come across as aggressive. OP has pretty relevant work experience that I know I’d love to hear more about, but there’s not really any room for them to respond.

I know many folks IRL who work at big tech who have no interest in posting here because the community comes off as very unwelcoming. That’s a shame, because they have insight that would be great to hear. Regardless of anyone’s opinion of their employer.

Apologies in advance if your intent was purely about the topic. I just thought I read something in your tone that might hinder discourse rather than encourage it. I wanted to point it out, in case it was unintentional.


Agreed about the tone. The comment could have been less argumentative — instead of "that's not the point," they could have said "that's not the only reason."

On the other hand, if I'm not responding, it's not because I find HN too abrasive — it's because I am afraid of leaking non-public information. That's why whenever I talk about Google, I try to cite a Google blog post or other authoritative source, or talk about my own personal experience; hence, "I rarely search for the same query twice."


I’m gonna have to disagree with the negative comments above concerning Igors tone. He made his point with clear respectful language that I would be happy to entertain at work, at the bar, at worship or while on a (previous to covid) group run or golf outing. so, to me, it looks like instead of an ‘agree to disagree’ while respecting each other, you disrespect igor by dismissing his arguments due to his tone, which handily allows you to ignore his content, such as it is. Therefore, in my judgement, you guys are being unfair to Igor while also being disingenuous about your reason for policing his tone.


> disrespect igor by dismissing his arguments due to his tone, which handily allows you to ignore his content, such as it is

I didn't dismiss his argument; I said that he was correct right after he posted: https://news.ycombinator.com/item?id=26073488

"That's not the point" can be interpreted as respectful, but it also can be interpreted as argumentative. I chose to assume good intentions, but I offered a different phrasing that would have a higher chance of not being misinterpreted: i.e. using "yes and" instead of "no but": https://www.theheretic.org/2017/yes-and-vs-no-but/


I apologize for the tone. It the start of my comment was clumsily wired and it wasn’t my intention to have it come off as argumentative. The way I read the GP comment to mine was talking about how Google’s tracking of its users’ telemetry was what was contributing to the uniqueness of requests. Your comment to me boiled down to the fact that of course most requests are unique because of tracking location data and the user account. There seemed to be a disconnect because your comment took for granted that user location and account were a part of the search query while the person you were replying to specifically challenged that notion (again in my reading of both). I tried to post a concise bridge between the two concepts, and of course we all see how well I did with that :)

Having said that, I do think this is clearly a sensitive issues, not a purely technical one. I can appreciate the nuance of working for Google and doing excellent work while seeing the company criticized left and right for its business model. I think given the community, while there is opposition to how Google may at certain points conduct itself as a corporation, there is no lack of respect for any individual working there. I certainly view my comment and the discussion of privacy as having 50% to do with Google’s strategy and 50% to do with the technical aspects of whether you can build a search engine that holds user privacy as a core priority rather than trying to launch an ad hominem on you or anyone. And I saw your other comment that agreed with me and the GP comment so I think my first sentence aside, we are on the same page :)


Thanks for responding constructively, Igor.


To me Igors comment is also displaced. He injects activism into a technical discussion (sadly happens very often here on HN). We all know by now that the bigcorps are to a large degree based on data collection. We do not need to be reminded about it each and every day. We are adults, if we don't like it we use alternatives.


Yeah, this is a fair point. My larger point was mostly that HN misses out on some valuable comments by insiders because those people are disincentives by some of the rhetoric and tone when an article on big tech is popular. I didn’t think the comment I replied to was particularly aggressive - it was just something that came to mind when I read it. OP was actually very kind and constructive in their response - a good ending and constructive discussion for us all!


This is right on the money — getting search results for queries that are too personalized to e.g. location means that you can't cache those search results (or if you did cache them, their entries would be useless).


Right, you can cache that query. That doesn't mean that you can cache "two bunnies playing in the snow r/aww reddit".


I think you mean "15% seems too high". Any easy way to think about this is the following: even if search the entire internet you will almost never see the same sentence twice, assuming it's has a certain number of words. There is a combinatorial explosion in possible sentences to write. Search queries are essentially just sentences without stopwords.


Removing stop words is what old school users of IT systems do, because that's what we learned worked best at the time.

Internet users who came online later, from GenZ to many boomers, will often just write conversational sentences and questions.


I don't understand how you compute that estimate.

I doubt you store the history of all searches ever? People don't need a google account to query the engine, others disable history, etc.

Are you saying you still have all searches ever made ever? Because you would need this to say a query hasn't been made before wouldn't you?


Why would you not store every search ever? It's only a few petabytes, and you can find out all sorts of useful info from it.


I don't know how they did it but I suspect that it wouldn't be very hard to model the distribution by sampling a few million queries and extrapolate from that.


You'd only need to store the list of unique searches, but even if that's true and the 15% number is true, that must be a huge amount of data.


https://blog.google/products/search/our-latest-quality-impro...

"There are trillions of searches on Google every year. In fact, 15 percent of searches we see every day are new"


It would still be helpful to know what ‘new’ means.

Does it mean literally the text string typed into the box by the user is new?

Or does it mean the text string combined with a bunch of other inferred parameters we don’t know about is new?


At the lower bound, that's 150 Billion "new" searches per year. There are approximately 50,000 unique english words not including names or misspellings. If google searches were on average for 3 words, it would take 833 years at that rate to go through all the combinations.

Alternatively, if we assume that google has already recorded 20 Trillion unique search queries (~ 1 Trillion new ones per year for 20 years), the odds that a query composed of 3 correctly spelled english words that are not names has been seen before is 1.6%. Even if we restrict queries to those using the most common 1000 words, there's a 50/50 chance of a query composed of 4 words being unique.

Of course people do not just type random words into the search bar and some terms will be searched many thousands or even millions of times, but still if anything the fact that 85% of searches aren't unique seems surprising.


> Does it mean literally the text string typed into the box by the user is new?

I guess that could be the case. Many could be related to things that are on the news. Like, 'the cw powerpuff girls' for the new show that was announced. No one was searching for that until the announcement, probably


Right - but without clarification from Google, we really don’t know what it means.


New for the day or new for the history of the engine?


I think GP post has a point. I've noticed people use Google really differently from how I do. E.g. I would go search for "figure concave" while my brother would search a longer phrase.

Also, speaking of people behaviour, it would not make sense to search everyday for "cute animals", but the volume of searches done for new things people discover as they get older would make more sense. I mean just look at search trends for things like "hydroxychloroquine" for example (and that's not to mention people who get it wrong, i.e. other factors for differing search queries too)

Also, other languages can change the queries depending on how you phrase the sentence too. Add to that the people using other ways to search instead of just visiting google.com and I think you can get pretty close to 10%.

If fingerprinting is the reason, 15% would be a figure too low I surmise. Would that be the case I think that would make probably 20-25% of searches rather than 15%.

It could very well be that they do classify fingerprinted search differently only in some countries and not others? That would/might explain the 15% figure.

I might be wrong and under-estimated fingerprinting techniques for Google. If they have really good fingerprinting techniques, that would reduce the estimate I have in mind to a better number (close to 15, maybe?)


So consider your hydroxychloroquine example again this way;

Nobody has ever searched for hydroxychloroquine before today. Today is the day the word is hypothetically invented. Today 2 million people will search for hydroxychloroquine. But only one of them was the first to do it.

What I know about pop-culture and viral internet culture is telling me that 15% of 1 trillion searches being unique is shady math.

So I am not fully convinced that the 15% claim is completely transparent.


It's a guess, but my thinking is that previously most people who searched term hydroxychloroquine were mainly scientists and other people related to that not your general population. Suddenly covid happens and now large numbers of people learn about this new drug they never heard before, they are gonna search, and I presume this, most wildly different things like: "how does it work?" "does it cause some disease?" "insert something political here about hydroxychloroquine" "did aliens make hydroxychloroquine?" and many more things I lack imagination to come up with and that's only about hydroxychloroquine. I doubt 15% number is about single word cases, but more about combination of words and that seems reasonable. Inventing new words daily seems unlikely, chaining them on the other hand seems plausible.


The vast majority of people don't search for [hydroxychloroquine]. They search for [Is hydroxychloroquine effective in treating COVID-19?] or [What is the first drug that was approved to treat COVID-19?] or [What methods do we currently have to treat COVID-19?]. You can see these on the search results page as the "Common questions related to..." widget. How else do you think Google gets that data?

The folks who use keyword-based searches are largely those who got on the Internet before ~2007. Tech-savvy, relatively well-off, usually Millenial or Gen-X, plugged into trends. This happens to be the demographic dominant at Hacker News. But there's a much larger demographic who just types in whatever they're thinking of, in natural language, and expects to get answers.

Come to think of it, this is also the demographic that doesn't use tabbed browsing, and uses whichever browser ships with their OEM, and often doesn't realize that there's a separate program called a "browser" running when they click on the "Internet", and issues a Google Search for [google] (#3 query in 2010) when they want to get to Google even though they're on Google already but don't realize it, and doesn't know what a URL is. When a big-tech company makes a brain-dead usability decision you don't like, first consider how that usability choice might appear to your grandmother and it might not seem so brain-dead.


> So your claim that 15% of their queries have "never been seen before" is probably inaccurate.

I'm not sure, on my productive days maybe >50% of my Google searches are not very cachable. (for example, I just googled "htop namespace", "htop novel bytes", "htop pss", "htop nightly build ubuntu 14.04")


https://blog.google/products/search/our-latest-quality-impro...

They briefly mention the statistic in the last paragraph.


It's somewhat analogous to the claim that almost every spoken sentence had never been spoken before in the history of language.


Well, you'd be wrong


Meh, it happens.


Wikimedia also has less incentive/drive to meticulously track every interaction on their pages. The level of tracking present on Facebook and Google has to be extremely computationally intensive.


I agree. Another (no contradiction) way of looking at this is that Wikimedia infrastructure is radically different because Wikimedia is radically different.

They need it to be a certain way in order to operate. The limitations and advantages of how software gets made. Why it gets made. The way the software works. How and why product decisions were made over the last 2 decades. What resources they have/had available. It's all a totally different game. Not surprising that different soil and a different climate grow different plants.

One of Google's early coup d'etats, when they were a strategic step ahead of the boomers, was bankrolling gmail, youtube and such. Gmail offered free giant inboxes. They got all the customers. This cost billions (maybe 100s of millions), but storage costs go down every year while the value of ads/data/lock-in and such go up every year. Similar logic for youtube. (1) Buy a leading video-sharing site; (2)bankroll HD streaming because you have the deepest pockets (3) Own online free TV entirely.

That's who Google is, good or bad. How funding works. What products get built. What infrastructure is necessary, possible, affordable. All interlinked. Wikipedia & Google were founded at the same time. Within 5 years (circa 2006) Google was buying charters and fiefdoms. Wikimedia, meanwhile, was starting to take flak for raising 3 or 4 million in donations.

It's kinda crazy that Wikipedia is comparable in scale to FAANGs when you consider these disparities.


You can do quite a bit of processing per page load without issue. Facebook and Google just take it rather past that point into near absurdity, while still being highly profitable.


To be fair, there's a bit of a combinatoric effect of scale * features going on there. I'm sure you could build most of a Facebook equiv. 100x-1000x cheaper if it only served one city instead of the whole planet.


The effects of scale are less combinatoric than you might think. Most people on my Facebook feed are from the same city anyway, even though Facebook is global.


The effects and scale of sales (ads) are very combinatoric, though.


Yeah why do they keep spending billions to build new datacenters when they could just stop being absurd instead?

The contempt on here is crazy sometimes.


The idea of marginal value/marginal cost is that companies will generally continue spending one billion dollars to add size and complexity, as long as they get back a bit more than a billion dollars in revenue.

So it wouldn't necessarily be contradictory if most of their core functionality could be replicated very simply, yet the actual product is immensely complicated. I forget where I first read this point, but probably on HN.


Or maybe you're just reading too much into "absurd" which can just be a colorful word for "an extremely huge amount"


I don't think that Facebook/Google developers are foolish or incompetent. That would be contempt. Instead, I think that Facebook and Google as conglomerate entities are fundamentally opposed to my right to privacy. That they make decisions to rationally follow their self-interest does not excuse the absurd lengths to which they go to stalk the general population's activities.


> I don't think that Facebook/Google developers are foolish or incompetent.

Nobody in this thread is saying that. Parent to you said:

> they could just stop being absurd instead [of building more DCs]

implying FB could build fewer DCs by scaling down some of their per-page complexity/"absurdity". Basically saying their needs are artificial or borne of requirements that aren't.

> conglomerate entities are fundamentally opposed to my right to privacy

That's a common view, but it's not on topic to this thread. This thread is mostly about the tech itself and how WikiMedia scales versus how the bigger techs scale. It has an interesting diversion into some of the reasons why their scaling needs are different.

You could instead continue the thread stating that they could save a lot of money and complexity while also tearing down some of their reputation for being slow and privacy-hostile by removing some of the very features these DCs support (perhaps) without ruining the net bottom line.

This continues the thread and allows the conversation to continue to what the ROI actually is on the sort of complexity that benefits the company but not the user.


I was the one saying absurdity and I think you’re missing the context. Work out how much processing power is worth even just another 1 cent per thousand page loads and perfectly rational behavior starts to look crazy to the little guys.

Let’s suppose the Facebook cluster spends the equivalent of 1 full second of 1 full CPU core per request. That’s a lot of processing power and for most small scale architectures likely adding wildly unacceptable latency per page load. Further, as small scale traffic is very spiky even low traffic sites would be expensive to host making it a ludicrous amount of processing power.

However, Google has enough traffic to smooth things out, it’s splitting that across multiple of computers and much of it is after the request so latency isn’t an issue, and it isn’t paying retail so processing power is little more than just hardware costs and electricity. Estimate the rough order of magnitude their paying for 1 second of 1 core per request and it’s cheap enough to be a rounding error.


Every request at FB is handled in a new container. This isn’t absurd, it’s actually pretty neat :)

Edit: I don’t know what I’m talking about. Happy Monday!


What? Are you calling the context of a HHVM request a container just to confuse people?

Also, there's way more than just the web tier out there.


Wasn’t my intention to confuse, just repeating something I’ve been told by FB folks.

Everyone, please listen to Rachel and never ever me.


Wow that sounds interesting, does anyone know if this is true?


I'm not on the team that handles this, but I highly doubt that this is the case.


is not neat... is freakish


Does it hurt their caching if you browse Wikipedia when signed in?

I recall reading HackerNews used to have that problem, unsure if it still does.


Looking at the source of a Wikipedia page it has my username appearing 6 times so I guess it must reduce caching a bit. Though I guess they could cache the user info bits and the rest of the page and just splice them together.


In interviews with Jimmy Wales, he seems somewhat regretful of not having made Wikipedia a for-profit. At the least, he's fairly adamant that Wikipedia could have been Wikipedia as a for profit.

The way he structured wikipedia, from back-end infrastructure to ownership/governance structure was just the logical way of doing the project. Times were different. Online culture was different.

I don't want to overinterpret the man, or put words in his mouth... but... I got the impression that Wales thinks that if he was starting Wikipedia now, he'd just do it asd a startup and also succeed.

To me, this is almost sad. Besides being an awesome encyclopedia, wikipedia is existence proof for something of scale outside the norm. Something that isn't a corporation. A lot of things are deterministic to the structure of an organization.

For example, take the current postpostmodern war over truth and stuff: platforming/deplatforming, freedom of speech, censorship, bias, manipulation, narrative = power issues, etc. Wikipedia is at the very centre all these problems. Whatever difficulties Twitter is experiencing should be 100X worse for wikipedia. Meanwhile, Wikipedia is withstanding far better, and with far more integrity. I don't think this is a coincidence.

Dunking on wikipedia's budget/spending is popular. Meanwhile, Wikipedia uses <1% of the resources/budget of Twitter. They are operating @ >100X efficiency compared to a realistic for-profit equivalent. That's a flying shuttle.

We know that Wikipedia, Linux & The Worldwide Web are possible because they exist. We literally wouldn't know otherwise. Theory couldn't have gotten us to this knowledge. Each is existence proof for other ways of doing things. They aren't necessarily roadmaps, but I'm a big believer in existence proofs. What Jimmy made is 100X better, more important and non-inevtiable than what Zuck made. The thought that he wants to be Zuck bums me out.


It would succeed the same way Quora does. Much less open, much less universal, much more user hostile, with an almost agressive way to deal with unlogged user.

In terms of financial and organisational success it would probably largely beat what it is now. It terms of benefit to humanity, it would be much worse.

Company + for profit + laws means access to information has to be much more tailored to the laws of each place. "Let's remove tianamen's article or lose your chinese license" kind of things.

I'm for one am glad for the current wikipedia we have, despite it's numerous flaws. I still donate every year, although I wish Wales could stop having it spend its money the same a startup or FAANG does.


That's one option, though I wouldn't necessarily use Quora as a mainline example. They're kind of a $gme for rich people. I think highly enough of Jimmy to bet on him doing a much better job than that.

Stackoverflow is a decent example. Very capable founding team. They explicitly tried to be like a commercial wikimedia. They do embrace quite a lot of openness, notably creative commons... learning from wikimedia successes.

RE "I wish Wales would:" Another consequence for how wikipedia is structured is that Wales isn't the Zuckerberg of Wikimedia. Power is a lot more dispersed.

RE spending/flaws and such: I feel like wikimedia is held to an extremely unfair standard. Who/what should we compare them to?

Wikimedia spend $70m per year. This is probably less than Quora or stackexchange. FB & Twitter (IMO more comparable in terms of scale/importance) spend $55bn & $3bn. Twitter spends 45X more than Wikimedia. Facebook spends almost 1,000X compared to Wikimedia. The bang-for-buck is insane.

Also in terms of flaws in rules/judgement calls. A lot of people are highly critical of wikipedia's "deletionism" related MOs. What articles/edits stay in. How good the rules & procedures are for this. What "camp" has power, and how they treat the other camp. I get that this stuff is contentious.

Meanwhile on Twitter or Facebook, the rule is "I decide." "But it gets us clicks" is the killer argument. Nothing is transparent. Wikimedia is doing a much better job, respecting user & editor rights far more, being a lot less self righteous. Of course it's not perfect, but come on. The "norm" is Facebook's content policy, Twitter's safety department, or Apple's App store approval room. Wikimedia is the one example of being better than that... and for that everyone is always yelling at them.


Quora imo is a horrible website and I rarely find actually good advice on it. At this point I actively avoid clicking on it's links because of how aggressive they are towards non logged users.


Yeah, the Web was quite impressive (though we already had the Minitel), but it was Wikipedia that really blew my mind (even though we already had Encarta). (In fact I consider Wikipedia to be the Web's "killer app", even more than Google and other search engines were.)


Out of all the "killer apps" for the web... wikipedia is the one that implement the www most faithfully. Hypertext articles. Most apps got the web to do x. Wikipedia is what it was made to do.


Yeah. I was about to add that it pretty much has been Tim Berners-Lees vision coming to fruition, but the fact that Wikipedia is centralized has stopped me. But then isn't the Web itself technically 'centralized' on the Internet ? And isn't Wikipedia a great example of pseudonymous strangers (= social decentralization) collaborating with each other ?


Is it really that centralized? Citations and footnote links are a pretty important part of what wikipedia is. I mean, I very rarely click through to see source material, but when I do it's noticeable how much more powerful wikipedia is than a standard encyclopedia.


I can also imagine that he'd say that just because it makes him look/feel better, i.e. it's more of a sacrifice if he gave it for free while he also could've been a billionaire, than if this was the only way Wikipedia could ever have been a success.

Then again, WikiTribune was a for-profit.


I have, and this is is still fascinating. Got any more links you'd suggest?



Just added this comment on the issue:

Hi all, I've been doing a bit of research into possible apps that could be causing this and found two potential culprits that I am currently investigating.

The first is Mitron TV, an Indian TikTok alternative which was made available again on the app store June 6th (https://indianexpress.com/article/technology/tech-news-techn...).

The second is Say Namaste, an Indian Zoom alternative which was launched on the app stores June 9th (https://indianexpress.com/article/technology/tech-news-techn...).

Both fall into the timeline of huge increases, have millions of users and may be using '1280px-AsterNovi-belgii-flower-1mb.jpg' to check the users internet connection - especially for Say Namaste to ensure video connectivity. I've reached out to some developers at both companies and will report back. Let me know your thoughts.

EDIT: I have also noticed the dates match the reopening after lockdown for the whole of India: "This first phase of reopening was termed as "Unlock 1.0"[13] and permitted shopping malls, religious places, hotels and restaurants to reopen from *8 June*." (https://en.wikipedia.org/wiki/COVID-19_lockdown_in_India#Unl... )

Tom


Based on this, I just reversed both Android apps and am not seeing strings related to wikimedia nor asternovi. This doesn't mean it's not obfuscated somehow though. The only app I've found the strings in so far is the "ravn" app proposed by @taviso. As mentioned in the twitter thread though it doesn't seem to have the install base to cause this traffic--


Thanks batch12. In my edit, it could also be related to a check-in app used at public spaces in India - as it increases from the 8th of June which matches when the India-wide lockdown began to lift. Perhaps a reverse of qr code scan checkin apps in India could be useful?


Could be-- I checked about 50 apps from alternative lists that popped up after the ban with no luck except for that one I mentioned before.

Looks like they posted shortly after yours on the ticket that they found the culprit. Guess we'll find out tomorrow if we were on the right path.


Yeah hopefully they have a bit of a write up too about how they worked it out - interesting problem to solve!


I took a look at the apk and noticed this in the manifest. "com.blockeq.stellarwallet.WalletApplication" Stellar Lumens is a fairly popular crypto currency. I wonder if the app has built in support for crypto transactions. If not, maybe it's malware to mine crypto coins.

https://i.imgur.com/o8DllVd.png


It is a crypto chat application:

>Ravn is your portal to the most private messenger as well as Korrax our proprietary token. Stay up to date with Korrax and other Cryptos and join the crypto group chats.

>Messages, images and docs are never stored on a server (after delivery), they’re only locally stored on your own phone. Ravn is not tied to your phone number or email, you only sign up with a username that isn’t searchable or discoverable.


Stupid question: how did you reverse the app in Android Studio?


I downloaded the APK and then used "Profile or Debug APK" under file in Android Studio and ctrl/cmd+shift+f to search for strings.

I don't know much about Android development or APKs but it's not exactly "reversing." from what I understand the profile/debug converts the .dex files from the APK to .smali which is human readable.


You can use the "Analyse APK" feature, but you probably rather want to use tools like jadx or apktool that provides fairly good decompilation.


As far as I know, this is also an image commonly used in machine learning tutorials for image classification of species of flowers. I don't know if the tutorials use the mediawiki source directly though. I do recognize this image though. I think it's in the SciKit Learn O'Reilly book.


I had some random images on a web server years ago - and noticed that something like 99% of my traffic was one image - and searching through refers I realized I was the #1 hit on google images for robot attack cat.

Simpler times.


I had a similar issue. Some 15+ years ago, an image from my blog showed up for people who searched the phrase 'Peanubutter Sex'. The image had nothing to do with peanutbutter nor with sex. My blog is SFW. It was some screenshot of KDE IIRC.

For almost a week it remained the most requested image the post on which it appeared, the most popular.

It did make me uncomfortable, though. Fearing that my rankings would plummet or so.

My takeaway is nothing new: there are weirdo's online.


So where did the connection between "Peanubutter Sex" and your blog come from? Did you ever find out?


I did not.

My blog did have the word "Peanutbutter" on one or two posts. And the word "sex" on another. Maybe at some point both words showed up close to one another when experimenting with some "random articles" sidebar or some "you may also like" list.


Can we see the image? :D


https://imgur.com/8MMET5V - now there are companies that can host things FOR my servants!


Anyone knows how imgur is able to afford that ?


Low cost CDNs like Cloudflare.


Imgur uses Fastly, which I don't think can be classified as low cost.


ads


Even though ad-blockers are ever more common ? (I didn't even think about ads, since I never saw them on imgur !)


They have an app which is ad supported, and they are trying to become of a social network like reddit by requiring you to login for more features, and adding upvotes and other interactivity.


It's the same way Google pulled in $180B last year and Facebook made $86B.


AFAIK both of these are using methods that are either deemed acceptable by adblockers (like how Google was using plain blue/black text instead of stroboscopic gif banners), and/or adblockers have trouble with them because they come from the same source ?


Yes, but only 10 million times.


You got to have hotlink protection on when you are hosting memes. I've learned this the hard way, too.


Remember when some sites would send you a shock image instead of the one you were expecting if it detected hot linking? I don't miss that.


There's a site that's occasionally posted in comments here which is apparently run by someone who hates HN because they serve some image I can't remember to readers from here.


Yeah it was posted not long ago. When it sees HN as the referrer it shows a picture of a testicle in an egg cup.


The "traditional" way of fixing this would be a goatse.cx redirect of the image.

I'm sure there is a more enlightened fix.


...or sending that image[1] jwz sends back upon detecting HN in the referer. I bet they'll find the app in a matter of hours, or at least reduce the traffic drastically.

1. https://www.jwz.org NSFW!


Just learned that this person owns DNA Lounge (and pizza?), and is a founder (early contributor?) of Netscape and Mozilla.org. I've lived and worked in that particular area of SF for years and haven't known this.


One of my company's clients has a beautiful office right above DNA Lounge (well, across the street or just adjacent - it's been a while and I've only been there once). They told me they can hear sound checks from their rooftop patio.


Also, jwz is responsible for xscreensaver.


netscape used to display a spinning compass when you put about:jwz in the title bar

other good ones were about:1994 and about:mozilla

hey, about:mozilla still works in firefox


about:robots also works in Firefox, I know it's been there for a long time but I have no idea if it was ever in Netscape.


about:robots is from the early Firefox releases. Pretty sure it is from Firefox 3.0 development as you can find the same robot in images when searching for Firefox Gran Paridiso Robot.

https://www.google.com/search?q=firefox+gran+paradiso+robot&...


there used to be linux based public terminals in DNA lounge too, IIRC


This makes me wonder why the hell referer headers are still sent by major browsers, especially to third parties. I can’t think of a single reason that benefits the user.


Originally it probably just sounded like a cool feature to see what blog linked to you. Now its been around for so long that so much has been programmed to actually use it. If you turn it off you get every anti bot script blowing up on you.

I think browsers did drop the path from it at least.


For one thing, examining referer is a common way that a server determines a request is not a hotlink. Sure you can do something more complicated with cookies or whatever, but lots of sites are just using referer and they'll break if the client doesn't send it.


But for that it's enough to send it for same-origin requests. No need to send it cross-origin, except for tracking purposes.


That'd still break the distinction between hotlinking and the user using a bookmark or copy/paste to directly open the URL in question.


Letting the sites distinguish between the two does not seem to be in the interest of the user.


Well, it'd mean that any site blocking hotlinking would also automatically block direct bookmarks/URL entry, too, which isn't really in the "interest of the user" either, I'd say.


If Chrome suddenly stopped sending referrer headers, let's be real here, 99% of websites would be fixed within a couple of days at most.


if you are making any sort of content or running a website, it is really useful to know how people found you.


All I get is a scrolling hex editor looking thing. Maybe that redirect has been disabled?


You aren't sending a referrer header (a good thing).


Try from a new profile or incognito.

I saw the described image but after I visited the site directly I couldn't see it any more when redirectly via hacker news. Saw it again when I opened an incognito tab.


Yep, jwz has had a change of heart and sees today's HN as a born again breath of fresh air.


I’m seeing the nut sundae on iOS mobile so I wouldn’t get too happy yet...


For those reticent to click on their work computers but morbidly curious, can someone describe the image?


It's a motivational-poster-type image with a white egg holder in the foreground, but instead of an egg, it's holding one exquisitely detailed hairy, caucasian ball[1]. At the top, the title is "HACKER NEWS" and the bottom text is "A DDoS OF FINANCE-OBSESSED MAN-CHILDREN AND BROGRAMMERS"

1. Is there a collective biological term for scrotum and it's contents that is not general like "genitals" is?


I think he's the only one that uses that? Barely even worth mentioning in comparison.


A permanent redirect to a non-image page (owned by Wikimedia) may achieve the same thing. Either the calling system can't support a HTML response, or it's a webview in which case you could either report an error or provide a notice. Maybe even ask for donations :)


Or just downsample the image to a reasonable size and deal with it. Nothing inherently wrong with having a popular image.


Yes there is when you are hotlinking. Hotlinking in general is considered theft, you are using someone elses bandwidth and could even ddos the host if you are not caching the response.


> Hotlinking in general is considered theft

This is a pretty puzzling idea to me. How could linking something be theft?

To explore this, I shall try a metaphor. Imagine you're on a big social media website (lets call it Programmer Olds) which has an oddity in that 99% of its users use adblock. You then post a link to another small (ad supported) website on your Programmer Olds page, causing a large number of people to click through and download the page using large amounts of bandwidth (for no monetary gain to the site) and possible DDOSing the site.

Have you commited theft?


> This is a pretty puzzling idea to me.

That's because you're responding to an entirely different issue. "Hotlinking" isn't linking to something, it's including a resource that is hosted elsewhere. It's putting <img src="https://concordDance.whatever/images/big_image.jpg"> on my website without asking you. Now if my site ends up on the front page of HN, that could cause a lot of traffic to your site, potentially overwhelming your server or increasing your hosting bill. It's not nice, and rightfully frowned upon.


But from a loss and gain perspective it seems equivalent.

In both cases the site loses bandwidth for no gain due to your actions.


> causing a large number of people to click through and download the page using large amounts of bandwidth (for no monetary gain to the site)

The difference here is that while a lot of users use adblock, there are some that don't. These users can still see the ads. Additionally even though it's a small website, it may lead to new readers that stick around or the content itself may even be sponsored.

The equivilent to hot linking a picture would be like taking the content of a blog post without really linking to the source, because there's no chance of conversions there. If you're linking to the site itself then there's a reasonable chance that users can convert.

So I suggest that it's theft just because the chances of readers being converted is nil while you're using their bandwidth.


Let's say I own a restaurant. Someone comes in and wants a panini. I don't have a panini press, but the restaurant next door does.

If I tell the customer they can go next door to get a panini, I'm not stealing anything. Maybe that restaurant is packed right now and they'ed rather not have an extra customer, but there is a reasonable expectation that they would generally want customers or at least have a means of turning away unwanted customers otherwise.

On the other hand if I break into my neighbor's restaurant, make a panini, then bring it back to my restaurant to serve and make money off of, all without permission from the neighbor, I am most definitely stealing. Even if I doubt the neighbor will mind because he let me come over and make myself a panini once, I can't unilaterally act off that assumption.


Is adblock a form of theft?


No, it's not universally considered a theft. Wikimedia explicitly permits hotlinking[0]. So does xkcd, imgur and tons of other sites.

Of course when someone doesn't want us to hotlink to their assets then don't do it.

[0] https://commons.wikimedia.org/wiki/Commons:Reusing_content_o...


it's so easy to mitigate, though, that the fact that one doesn't sorta implies that one might want randos from the internet to use one's resources to view this image.

it's not theft if you leave it out for everyone to use.


My garden doesn't have a fence, doesn't mean you can host your picnic here.


No, but if I wander into your garden and "injure" myself, I can sue you for damages. You will be held negligent for not properly protecting yourself from preventing other people from injuring themself on your property.


Wikimedia has a User-Agent policy which is being violated here. Hence this is the property owner putting up a sign that says "risk of injury", so if you walk in and injure yourself, you only have to blame yourself for being negligent.


The policy is for how wikipedia will act when encountering clients with certain user-agent headers, not a rule for the clients.


It's a policy how wikimedia acts when clients lack a user agent header, it's therefore effectively a rule for clients as without a proper UA header, they may be blocked indefinitely.


Is this something real (in US, most probably)?


Yes, you can sue anyone for anything. Your suit probably won't prevail, unless you have access to very expensive lawyers and your opponent doesn't.

But you can totally sue anyone for anything, and that makes for entertaining headlines - even though if plaintiff lost promptly


The problem of course is that the "victim" has a lawyer operating on a contingency, whereas you have to pay your legal costs, and generally cannot recuperate them.


In France (at least), all swimming pools are protected by a fence. If you own a pool and don't put a fence around it, you can be held responsible for a child drowning into it.

It is possible this principle applies to other countries and other things than pools.


Here in Russia, if you leave poisonous chemicals like methanol, etc, unmarked or put a bear trap in your locked house behind a locked fence with a generic warning sign, and then someone dies or gets injured by these, chances are you will go to jail. Idk if this applies to accidental traps like pools or rakes in grass. Same for taking a knife out of an attackers hand and stabbing them back. (Yes, our laws protect criminals better than citizens, not joking.)


Interesting. So if I understand this correctly, if someone breaks into your house and gets injured, and they can make a good case for some kind of negligence on your part, then they can successfully sue you?


In Poland setting marked traps on your own, fenced property is illegal and their owner is responsible for any harm they cause, because there exist legal reasons to enter another person's property - for example to fight spreading fire.

However my favourite example is the law that allows any bee keeper to enter any private property if they are pursuing fleeing bee swarm.


Leaving a bear trap goes way beyond negligence, it's literally setting a trap. Similar with unmarked dangerous chemicals, they're required to be marked for good reason.


In Greece if a burglar dies while in your house you can be held responsible, even more so if you have set up traps.


If a judge or an expert is sure that you intended this outcome, and that someone is brave (or dead) enough to admit their own crime.


It's also illegal to set a trap in your own home in the US as well, decided when a property owner, tired of people breaking into his property while he was away, set up a shotgun booby trap that injured a burglar. https://youtu.be/bV9ppvY8Nx4

I wasn't sure if it is the same or similar principle in Russia or a different one that requires active care for a burglar. Unlabeled chemicals causing liability for a burglar seems extreme to me


This is an urban legend in Russia.


Only in your dreams and some dumb countries, not in the rest of the world.


You think this, but how much experience do you have with it? People know that homeowners have insurance. They sue to make the insurance pay out. It happened to my neighbor. So you can make all of the dumb countries comments you want, but it doesn't make it any less real.


I wonder if there's some way to have a frontend cache that or webserver shortcut that looks for that exact url and blurts out the image?

Or maybe wikipedia is already mostly static.

also, I wonder if HN is inadvertently ddos'ing the ticket system ?


This, perhaps disturbingly, was my first thought upon reading the issue.

Things were done very differently back in the day. This problem would have been fixed real quick.


To the people who didn't grow up with 4chan: do not search for this image, its pretty disgusting.


4chan didn't even exist yet when goatse emerged


It seems plausible to me that the, ahem, "spread" of the image was greatly increased through the efforts of 4chan.


** Kadmium changes topic to 'Our hearts are extended to the 18 victims of the recent internet fraud'

http://bash.org/?434593


Hey I'm on that website! IRC used to be so fun and weird back in the day. hanging out on slashnet took up most of my free time in junior high.


Back in School, goatse was extremely well known. That was several years before 4chan. I hadn’t even heard of goatse in a 4chan relation until now.


Maybe widespread but it was already pretty wide open before there was a gap for 4chan to even exist.


I think it was popularized back in the days of Slashdot.


it does not date to alt.tasteless on usenet? (edit: w/r/t goatse)


I was going to suggest Something Awful but you might win, though Wiki pegs it (heh) at 1999...


It's interesting that you equate goatse with 4chan! I'm old :-(


To the people who grew up before 4chan, pls don’t mention tubgirl


I missed the edit window and I’m disappointed in myself for mentioning it by name. Please just don’t Google this unless you’re prepared for an upsetting image, and even then maybe just skip it. You’re probably not as prepared as you think.


I was born after 4chan was created and I found that image on reddit. It's pretty mild; one can tell quickly that it is a doll.


Not having eyelids would certainly make it worse!


lmao


Fairly sure you’d get goatse’d more often on Efnet etc back in the day


goatse signifiantly pre-dates 4chan


s/4chan/slashdot/


what is it?


Big stretched open butthole. Not sure if you need the warning but I’m commenting in case anyone would prefer not to see it despite their curiosity.

Sorry to ruin the fun y’all but there’s images I won’t even mention that I can’t unsee and make me feel seriously ill when I do see them. I don’t want anyone else to feel that way without warning.


What are these images called so we know to avoid them?


Can't speak to the images themselves, but the sites are usually referred to as "shock sites":

https://en.wikipedia.org/wiki/Shock_site


They were known as a "shock site" ( https://en.wikipedia.org/wiki/Shock_site )

The Wikipedia page for https://en.wikipedia.org/wiki/Goatse.cx is text only and without any ascii art.

I'm amused that https://simple.wikipedia.org/wiki/Goatse.cx also exists.


Oh, Goatse is that site.

I remember when I was about 15, before pop-up blockers were really a thing, someone sent me a link to that and it would keep opening popups with that image and you couldn't close all of them :-/

Sometimes people look back to the internet of the 90s with too rose-coloured glasses IMO.


For some memories... http://www.bash.org/?search=goatse&sort=0&show=25

I am personally most amused by #38659


Hey at least if you were on a 90s Mac your computer was probably unresponsive and you could skip to the inevitable force reboot. And browsers didn’t save sessions so you were in the clear as soon as you got to tabula rasa.


I’m honestly not sure you’re asking in good faith so I’m not going to add more (and if you are asking in good faith you’ve got plenty in responses to go on). Also I never knew the name of the one that’s permanently burned into my brain and I’m so glad I don’t.


There were quite a few, lemonparty and meatspin spring to mind, and the various incarnations of "two x one y".



Brilliant.


If it's just used internally by an app to test connectivity as suggested in another subthread, this wouldn't solve the problem.


A red flower rather than a lavender one.


Why does it need to be fixed? The mission of wikimedia is to serve educational content.

Edit: this is a bit unfair, if its a specific app they should be convinced to cache just to avoid unfair resource usage, but hotlinking in general should not be seen as a problem


Presumably they are paying for the servers/bandwidth to support that, and that money is coming from donors.

It's a waste of donors money if someone is using this image as some kind of "is this thing on" test using hacked computers...


It's both a waste of donor money and a starvation of resources for people actually consulting images on wikimedia commons.


i'm sure the revenue model is robust enough to accommodate spikes in traffic.


Any for-profit entity hotlinking Commons is unfair. Heck, they have the right to redistribute freely the image as they see fit, instead of consuming resources that are a common good.

But this goes beyond that - it's some blind check of internet connectivity for the app, and doesn't get shown to the user. We're pretty sure of that, given that with the amount of noise that task generated, if there was an app featuring that image at least one of the ~ 90M daily "views" would've been someone reading these posts.

Now, given we want to be nice, we didn't just blindly block the traffic, although making requests without user-agent is against our UA policy https://meta.wikimedia.org/wiki/User-Agent_policy


This is exactly what I used to do about 17 years ago.


After realizing "wiki[p|m]edia" and "flower" triggered a specific image in my head I was guessing it would be a yellow flower, this one in the corner of https://www.mediawiki.org/wiki/MediaWiki but nope, more interesting than that!


Yes OMG! Didn't expect this one


Same here dude. Was sad when it turned out NOT to be the yellow flower!


20% of Wikipedia Commons requests to their Singapore servers (EQSIN), not globally. That's still a lot, of course.


90,000,000 requests a day. That's some flower.


Replace it with an image saying: “If everyone who sees this flower donated 100 rupees to Wikipedia, this fundraiser would be over in 6 hours.”


Forward to all your indian uncles


On whatsapp


Unfortunately, the image was never displayed by the app that was downloading it as an internet test.


Would an AsterNovi-belgii-flower-1mb by any other name smell as sweet?


Let’s all just be grateful it wasn’t AsterNovi-belgii-flower-100mb.


At one point, I had a very small png file that was large enough to crash Netscape and IE, and later firefox. It had large enough dimensions that browsers couldn't handle it.

Today I'm sure it would be fine; instead I'm frustrated by my inability to create webp images larger than 16000 pixels tall (i was trying to write a data-saver proxy for reading webtoons)


Given that they are coming from India it could be 90M single daily requests!


Clearly the bestest flower though


Sukhbir Singh just commented: Thank you everyone for the comments and suggestions. I just wanted to share that we have identified the app and will update this task tomorrow. (And yes, it is a mobile app.)


Would be curious to get the full story :-/


it's at the bottom of the linked case.


"Please avoid adding drive-by comments such as "hello from Hacker News" to this task as they are not helpful. Thank you"

Why would anyone do such stuff is, as usual, beyond me...

PS. "First!"


The funny thing is, the first instance of that in the thread wasn't "hello from hacker news". It was a "hello to hackernews" from an engineer on the WikiMedia team. https://phabricator.wikimedia.org/T273741#6813995


And that comment was removed by the author a few minutes ago.


Honestly I think it was fine, I don't think it reached annoying levels. I enjoyed reading the thread here and I wouldn't have found it without that comment!


999 out of a 1,000 people know better, but when there's thousands of people...


Even more curiously, the person that did that registered an account and put up a profile picture (I assume of himself), just for that comment...


Superficial reversing shows that the ravn app mentioned, com.app.rcn may use the file as part of a speedtest:

com.app.rcn/smali/com/app/rcn/utils/InternetSpeedCalculator.smali: "hxxps://upload.wikimedia.org/wikipedia/commons/1/16/AsterNovi-belgii-flower-1mb.jpg"

edit: defanged the link to maybe save the wikimedia team some bytes


I'll just be looking forward to the follow-up post on HN announcing when they figure out what the culprit was!

Per the comments, right now the top suspect seems to be the app "Josh" or another TikTok clone because of how traffic surged immediately after the TikTon ban:

https://twitter.com/bwaber/status/1358915338637873154


Reminds me of the time Netgear routers were hardcoded with the IP address of a NTP server at the University of Wisconsin. https://en.wikipedia.org/wiki/NTP_server_misuse_and_abuse#Ne...


I also remembered accidental Snapchat's NTP pool abuse.

https://community.ntppool.org/t/recent-ntp-pool-traffic-incr...


You gotta respect the suggested approach to take preventive measures by banning requests to this individual image without a User Agent header and to try to identify who might be affected. I’m sure I’m not the only one here who would just treat it as abuse and ban without followup.


And now thanks to this HN post, 21% of requests are for that same flower!


I was curious what the actual amount of requests HN would have to muster, and with a frequency of 90,000,000 reqs/day, HN would need to hit it with 4,500,000 requests.


HN ain't that big


This happening now! Some suspect a failure in a CV training pipeline. Others suggest an extremely popular app with a hotlinked image.


It can't be a training pipeline, because the IPs are all around India.

Sample code from Stack Overflow being used by some major app is the most likely candidate. It's also possible that the image fetch call is a vestigial appendix that doesn't even display the image, which will make tracking this down extra challenging.


Perhaps a very inefficient “check if we have working internet access” routine.


i hate it when vestigial appendices go awry. pretty sure the evidence is mounting, however


If it was my site I'd replace the image with goatse and see who complains, but I guess that's a bit drastic for Wikimedia.


Tarpit the image and it will take care of itself.

Same advice I gave a w3c.org admin who was lamenting how much traffic people generate by not caching xml schemas. Yes, you have to serve the requests. But you don't have to try to serve them in 100 ms. If a human is on the other end, 1-2 seconds is just fine. If a human is not, then the human will surely notice when their batch process goes from 3 minutes to 10 minutes because it fetches the same schema 200 times.


Could you end up executing a slow loris style attack on yourself by doing this?

I guess a couple seconds won't matter unless the server is already redlining it and the tarpitted traffic is a small proportion.


Well any time you start yanking levers and spinning dials you'd better know where the breaking points in your system are.

If you care about the traffic because you're already having trouble with that many simultaneous requests, then you are definitely not going to solve that problem by increasing the response time by a factor of 10.

But an important property of reverse proxies is that once the proxy sees the last byte of the response, the originating server is no longer involved in the transaction. The proxy server is stuck ferrying bits over a slow connection, and hopefully is designed for that sort of work load. If the payload is a static file, as it is in both of these cases, then it should be cheap for the server to retrieve them.


Yes, but slowloris isn't really a big deal if you've got a modern http(s) server with async i/o. It costs nearly nothing to have a idle connection while waiting 3 seconds before sendfiling the schema xml.


Can you not run out of sockets though? I know it used to be a thing anyway. Maybe it's handled somehow nowadays.


You can run out of sockets, but that's easy to tune. I don't know the limits on other systems, but FeeeBSD lets you set the maximum up to Physical Pages / 4 with just boot time setings. So about 1 million sockets per 16 GB of ram.

Worst case, if you start running out of sockets because you're sleeping, sample the socket count once a second and adjust sleep time to avoid hitting the cap. Also, you could use that sampling to drive decisions about keeping http sockets open or closed.

I should add, select on millions of sockets is going to suck; so you'll need kqueue/epoll/whatever your kernel select but better interface is.


Even just serving a giant blinking red X gif to 10^-6 of the requests might be sufficient.

Only 10^-6 of the "legitimate" requests would be affected, but a whole lot of the "undesireable" requests would see it...


Serving a giant blinking red gif to unsuspecting internet users is a bad idea. https://en.wikipedia.org/wiki/Photosensitive_epilepsy

There was a better idea posted in comments - serve a picture with a very short explanation and an email to contact.


There is no reason it needs to blink quickly.


It doesn't have to be quick for bad effects. See https://discussions.apple.com/thread/7908738 for example. I can't find the link now, but there was also an article about one of the older designers (IBM?) making sure the terminal cursor blinks in 1:2 ratio to reduce the problem.

Just don't give random unsuspecting people blinking images as a rule.


The first link says "several flashes per second". I was thinking more like a blink every few seconds, which seems to be annoying but not dangerous.


But he's suggesting that they only be served to 10^-6 people - that's one 1,000,000th of a person - I suspect it will have little effect


I'd give them a pass.


Could it be the "Good Morning" like greetings on WhatsApp gone viral ?

https://www.wsj.com/articles/the-internet-is-filling-up-beca...


No, because the images for those are stored by WhatsApp, not hotlinked from Wikimedia.


That’s a good one! Maybe an app which generates this you’r type of image has this flower as one of their sample images in a list which it preload on startup


They did figure it out, a popular chat app in India (they won't name yet) fetches the image but does not display it.

https://phabricator.wikimedia.org/T273741#6815828



Posting the very link they're trying to reduce traffic on doesn't seem like a very helpful thing to do


They were trying to figure out the root cause of a sudden uptick of millions of requests being made for a given image with no user agent or referer, presumably with a view to notifying the app responsible or figuring out a workaround.

A few thousand requests from clearly identifiable as coming from browsers and with a referer header from news.ycombinator.com would not exactly interfere with this and in the grand scheme of things isn't a huge burden in terms of network traffic.


I bet everyone on HN would have wanted to take a look eventually.


Huh, I worked on a site with a similar issue in ~2019. A massive flood of traffic for a single site from Indian mobile apps (~15kqps at peak iirc).

I think it ended up being a sort of mobile-based botnet with a bizarre target, which luckily was deduced from some of the headers sent (they all had a random common header).


What's "kqps"? It's obviously kilo-{somethings} per second, but I don't know what the {something} would be

edit: queries?


15000 queries per second


Yep, queries


Saw a story recently that Indians were bringing down the internet because of sending good morning messages: https://www.wsj.com/articles/the-internet-is-filling-up-beca...

I'd bet that this is the flower of the week for them.


Says it's been going since last June.


What's your point?


Example code that gets copy pasted into production app somewhere?


Possibility and linked in a couple places. They found examples


And of course, now that the image is linked in the report, I've just added an additional request for it by clicking.


With the amount of traffic its apparently getting, HN probably won't make a big impact. Plus, most of us aren't in India, and most of us have normal user agents.


“ Thank you everyone for the comments and suggestions. I just wanted to share that we have identified the app and will update this task tomorrow. (And yes, it is a mobile app.)”


>>Thank you everyone for the comments and suggestions. I just wanted to share that we have identified the app and will update this task tomorrow. (And yes, it is a mobile app.)

Looks like we will know soon.


I'm in India now, is it possible for me to install some traffic snooper and monitor if any wikimedia requests go out? I can then install some popular apps and see if anything bites!


mitmproxy :)


90M requests daily from India? I wonder if KaiOS is checking whether it's got internet access.


This page someone noticed is very interesting.

https://newshimalaya.com/2021/02/09/%E2%9A%93-t273741-invest...

I was sure I'd seen this website before, and sure enough, it's scraping and rehosting almost everything that's posted on HN...


Nothing unusual here, just run of the mill online copyright violation.


Yeah, here's the unusual one :

http://n-gate.com/


It's not 20% of all requests, it's 20% of media requests to one of the clusters (it's said in the issue description) There are 5 clusters.


I remember that MediaWiki installation allowed the configuration that essentially permits the use of Commons files, albeit in that case, the file will be downloaded and cached in the Wiki's own server [1].

That being said, though the image wasn't hotlinked directly, they expressed concerns of DDOS and the possible costs the Foundation has to incur from each load (they even pointed out that it's "fair and reasonable" to point donation link to them).

I would be interested to see how the licensing issue will be handled, though. The photographer licensed this photo as GFDL/CC BY-SA 3.0 [2], and hotlinking may break the term of these licenses.

1: https://www.mediawiki.org/wiki/InstantCommons

2: https://commons.wikimedia.org/wiki/File:AsterNovi-belgii-flo...


Tragedy of the commons


According to the comments, it's probably an Indian TikTok clone that checks internet connection by downloading the picture.


They didn't mention if the 90M requests were unique, perhaps some app doing background refresh and not caching that image?


FTB (from the bug) :

You could even serve another image in its place to this UA, with some text and an email address to contact. You'd probably find out pretty quickly what it is from users of that mysterious thing. A throwaway email address is probably best

Really good idea :-)


Maybe some sort of social network using it as the default profile pic and isn’t caching


> I would suggest that we start banning requests for this image without a UA

Why on Earth would you do anything like that, instead of just renaming the image so the URL stops working (and banning the old URL unconditionally?)


My initial thinking is it's in some flower recognition dataset.


Or it's India's canonical non-hotdog.


What's the right way to solve for this generally? A CORS policy wouldn't be effective since it's not a browser requesting the image.


If the image is not displayed and the stack can handle a lot of tcp connections, perhaps a reverse http slowloris attack : you send the image response headers as slow as you can to keep the tcp connection open but to make the receiver waste its time.

If it's a speed test, they will eventually use another image.


It's likely that the consumer doesn't even display that image. It's probably a dry 1mb download test.

Most indian ISPs, even mobile ones are extremely cheap to not matter a 1mb.


145kB for a connectivity check, ouch. This is a poster child for why many apps guzzle so much data.

(On a 500MB/mo plan you start noticing)


Perhaps they can replace the image with an obviously wrong image (and smaller in size), and then wait for someone to complain


so what is the app? I felt like I read a full blown novel, and the last sentence with the conclusion is missing!


> we have identified the app and will update this task tomorrow

I guess we will find out tomorrow.


Well... posting it here will only increase the number of requests and will make the investigation harder


Title is misleading. It is 20% of requests to the eqsin cluster located at Changi airport.


It isn't _at_ the airport, that's just the closest airport. The WMF names its clusters with the initials of the data center vendor and the closest airport's code.


True. Hahaha


Just replace the image... what app/site is this? Email me at my@mail.com


I wonder how much additional traffic this investigation brought said image.


Heh, a popular consumer electronics product a room mate worked on shipped an update that used example.com as a connectivity test. Apparently they were on pace to rack up $20k/month in server costs. At least their user agent made it obvious who to contact.


> Please avoid adding drive-by comments such as "hello from Hacker News" to this task as they are not helpful. Thank you.

Ironic. Dang could save himself from spam, but not others.


Well I hope they implement some good caching


Now you just need to draw attention to it further by posting it in hacker news. I'm sure none of us are curious to immediately see the picture.


can this be some sort of botnet checking whether a host is connected to the internet or not?


anyone check if it's stegnography?


Nice idea, but if it's getting 90 million requests per day, then either there are a lot of people requesting the same message (so it's not very secret), or the few people requesting it are very forgetful (in that they have to keep re-requesting the same message multiple times per day).

I suppose the contents of the image/"message" could change every day, but presumably that would be very obvious in the edit history of that file[0], unless Wikimedia Commons were suppressing the fact that the file is constantly changing. If they are part of the conspiracy, though, you'd think they would have taken down the task from Phabricator too.

[0] https://commons.wikimedia.org/wiki/File:AsterNovi-belgii-flo...


It seems like it started on the 2020-06-09

https://pageviews.toolforge.org/mediaviews/?project=commons....


Those goddamn white Hare Krishnas are in the airport again, handing out flowers no one wants.

Also: What's your vector, Victor?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: