HN is in the same cluster as 2ch, not Techcrunch, on Twitter

bhouston · on Feb 6, 2016

2d projections of complex multidimensional data are unreliable in the extreme as to adjacency meaning. Most adjacency especially are an artifact of the chosen projection method.

daniel-levin · on Feb 6, 2016

This comment got me thinking: in some applications, Euclidean distance between feature vectors acts as a good proxy for adjacency/similarity. For such applications, an isometry from R^n to R^2 or R^3 should in principle preserve the meaning of adjacency. A quick Google yields [0, 1] a technique for quasi-isometric, and isometric dimensionality reduction. This should mitigate artefacts of adjacency, or non-adjacency, as it were. In other words, you might be able to actually pull off good 2D projections of high dimensional data and still see meaningful relationships.

[0] https://en.wikipedia.org/wiki/Isomap

[1] https://www.aaai.org/Papers/AAAI/2007/AAAI07-083.pdf

ecesena · on Feb 6, 2016

Sammon mapping is another famous example, see [1] for instance for a nice visualization.

[1] http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/AV09...

frozenport · on Feb 6, 2016

>> Provides us with a measure of the quality of any given transformed dataset. However, we still need to determine the optimal such dataset, in terms of minimising E. Strictly speaking, this is an implementation detail and the Sammon mapping itself is simply defined as the optimal transformation;

Somehow its technically challenging to verify the content of this article.

ecesena · on Feb 6, 2016

I was referencing it mostly for the visualization of the "flower" that fails with pca/linear mapping.

The original Sammon's paper is here [1], this said from what I know isomaps are a more widespread tool - but I never found such a good visualization.

[1] http://theoval.cmp.uea.ac.uk/~gcc/matlab/sammon/sammon.pdf

rabidsnail · on Feb 6, 2016

For small distances, yes. If you plot a 2d projection of a dataset that doesn't have much structure you're going to be reading patterns into whitenoise (though this data has some pretty clear clusters, which are probably real). If I were doing something other than writing a fun blog post I would have done cluster analysis with something like DBSCAN.

rryan · on Feb 6, 2016

Also, this is t-SNE: https://en.wikipedia.org/wiki/T-distributed_stochastic_neigh...

The S is for "stochastic" -- i.e. you get a different 2D projection every time you run it on the same inputs. Take it with a grain of salt.

thisisdave · on Feb 6, 2016

>The S is for "stochastic" -- i.e. you get a different 2D projection every time you run it on the same inputs.

That's not the part that's "stochastic"; sensitivity to initial conditions is just nonconvex optimization in action. You get the same thing with most other local embeddings.

The stochastic bit is that the model is based on optimizing "the asymmetric probability, pij , that i would pick j as its neighbor"[0]. Those probabilities and the associated positions in 2D space are not estimated stochastically (e.g. with Monte Carlo sampling) or anything, though.

[0] https://www.cs.nyu.edu/~roweis/papers/sne_final.pdf

personjerry · on Feb 6, 2016

I wonder if I could post a randomly generated graph, label it with HN-interested labels arbitrarily, and get a serious talk started on HN about nonexistent correlations.

hapless · on Feb 6, 2016

TechCrunch reports on us. It is journalism for the spectators. The twitter cluster of people sharing TC links is TC's audience, not participants in TC's subject matter.

Why in blue hell would anyone on HN be sharing TC links? Intuitively it seems more likely that people who share HN links are discussing these matters directly.

bitbckt · on Feb 6, 2016

Interesting parallel observation: when I worked for a regional newspaper some years ago, we rolled out products for the same demo as "mommy blog Twitter". We saw the same sort of isolated behavior - visitors to "mommy blog content" almost never strayed onto our mainstream products.

The same sorts of products delivered to "puppy and kitty" people didn't have the same effect, though the level of vitriol in the comments was similar.

madaxe_again · on Feb 6, 2016

Ditto. Launched (well, we built - client project) a social network for moms nearly a decade ago, and they were Not Interested in anything outside of the core offering - even recipes, which you would have thought would be interesting, weren't - until they rebranded along the lines of "recipes for moms", which changed that interaction overnight.

Some demographics choose tighter filter bubbles for themselves than others, and moms are likely up there, as the single most important thing to mothers tends to be being a mother - it becomes an all-encompassing identity for many.

hkmurakami · on Feb 6, 2016

Considering nicovideo is anti-establishment media (it's owned by Kadokawa, which is an underdog media company with strong subculture roots) and that 2chan "summary sites" double as news sources for the anti-establishment these days, the association seems apt.

newobj · on Feb 5, 2016

This is amazing, one of my favorite articles on HN ever.

I'm really curious what the heck that "eye" is in the bottom right space of the clusters. Some cluster so radically orthogonal to any other content it has an order of magnitude more distance in differentiation?

rabidsnail · on Feb 5, 2016

(original author here) it's a spambot network. If you click the link in that post to the interactive version (this: https://pile-of-junk.s3.amazonaws.com/twitter_scatter_10k.ht...) you can see for yourself.

stephenboyd · on Feb 5, 2016

This is cool. How many sampled tweets did HN links appear in? How many sampled tweets did you have overall?

I'm curious if a sampling error could explain why an English website like HN would get placed with the Japanese language sites. StackOverflow isn't placed by any related sites either.

If the weird results aren't from sampling artifacts, my best guess is that a lot of spambots must be linking to multiple legit sites regardless of relevance.

brownbat · on Feb 6, 2016

I really hope someday we get spambots that start off by trying to make useful contributions. Then later, after building a following, start advertising scams.

I'm confident that, given the right incentives, spam kings could discover conversational AI before any lab.

swerling · on Feb 6, 2016

This is fantastic. Feature request: drag a rectangle over a group of dots, and see them as a text list of websites. As is it's hard to see all the sites that are in a dense dot cluster.

TazeTSchnitzel · on Feb 6, 2016

Quran quotes being grouped with archive.org might be explained by the Internet Archive frequently being used to host Islamist materials.

runn1ng · on Feb 6, 2016

Just today I wondered why are so few journalists picking up the fact that ISIS is using almost exclusively archive.org for uploading their beheading and other PR videos.

i336_ · on Feb 6, 2016

The interactive version is powered by this dataset - http://pile-of-junk.s3.amazonaws.com/domain_similarity_tsne_... - processed by JavaScript inside the page: https://pile-of-junk.s3.amazonaws.com/twitter_scatter_10k.ht...

wodenokoto · on Feb 6, 2016

> Japanese social media twitter (which I'm labelling as "2ch", though it's not just 2ch) is almost completely distinct from what I'm calling "upstanding japanese twitter" (links to mainstream news sites like news24)

I have no idea what the point of the headline is after reading the above part of the post.

Ezhik · on Feb 6, 2016

That's interesting. Never would've made the connection myself, although now that I think about it, some of the most fascinating discussions I've read on HN involved Japanese work culture.

ChuckMcM · on Feb 5, 2016

This is some fascinating analysis. And like the Author I am amazed that Twitter doesn't crack down harder on their spambots.

n0us · on Feb 5, 2016

I've wondered that as well. I'm not "active" on Twitter but I log on occasionally to see if there are any interesting tweets in my feed. Every time I log on I have a new follower from penny stocks twitter, get rich quick schemes, and various other fake profiles. This seems to stay stable at around 20 fake followers as old ones get erased and new ones follow.

It seems like amateurs are more capable at detecting spam than the entire company but I sometimes wonder if they just know about it leave the spam bots because once they crack down, new ones will just pop up. Or if they keep them around at a tolerable level that doesn't drive real users away but still allows them to publish a higher "user count"

egypturnash · on Feb 6, 2016

This may also be in part to more active users of Twitter hitting the "report spam" button on those spam bots. If a spambot tweets at me, I'll go do that. I'm sure I'm not the only one, as I never see a spambot with more than a handful of tweets showing up in my mentions.

So, crowdsource spam detection.

username223 · on Feb 6, 2016

> Or if they keep them around at a tolerable level that doesn't drive real users away but still allows them to publish a higher "user count"

They seem to have figured out that 20 fake accounts is not enough to get you to leave their service.

matheweis · on Feb 6, 2016

twitter should hire op, this is some incredible analysis - and I don't think he counts as an amateur.

Also, they are apparently too busy battling isis (http://www.theguardian.com/technology/2016/feb/05/twitter-de...) to deal with the spam issue effectively.

jonesb6 · on Feb 6, 2016

Well it's whack-a-mole isn't it? Take down one spam network and another crops up with an entirely different methodology and signature. If I was managing a large social network that suffered from bots I would whack until I came across an opponent that did the least possible damage, then weaken it through things like shadow bans etc to the point where it won't die but will operate with the bare minimum amount of damage to the network.

jerrickhoang · on Feb 5, 2016

I think a more interesting problem is not how you can differentiate a spambot with a 'non-spam' bot. I've seen some bots that are really creative and fun on Twitter. I guess it's not really hard to add it to a spam detection ML model

rabidsnail · on Feb 5, 2016

Non-spam bots generally don't follow each other or link to external websites. (I'm also the author of one of the more popular image bots https://twitter.com/a_quilt_bot)

surfmike · on Feb 5, 2016

what is 2ch?

daodedickinson · on Feb 5, 2016

Japanese predecessor of 4chan.

yawawort · on Feb 6, 2016

What you're thinking of is Futaba (www.2chan.net). 2ch is text only and would be closer to Reddit than 4chan (at least culturally).

Rayearth · on Feb 5, 2016

So HN is close to nico (Japanese youtube) and pixiv (Japanese-centric art and fanart site)? Interesting.

forrestthewoods · on Feb 5, 2016

What are all of the other twitters? There is so much undocumented space! I want to know what it all is!

simcop2387 · on Feb 5, 2016

Is the regex search in the demo not working for anyone else (tested both Chrome and Firefox on Win7)

rabidsnail · on Feb 5, 2016

There's no UI for if there are no matches; it just does nothing. Try searching for \.com or something.

Edit: I patched it so it displays an alert if there are no matches.

simcop2387 · on Feb 6, 2016

I see. That patch makes it a lot nicer to find out that none of the sites i wanted to look for show up in the data :)

kitwalker12 · on Feb 5, 2016

(Update) see rabidsnail's suggestion

not working for me on Chrome or Safari either

gohrt · on Feb 6, 2016

why does the hella.cheap site have an SSL cert with an unknown authority?

tokenizerrr · on Feb 6, 2016

It has a COMODO certificate. If you see otherwise you might be getting MITMd.

schoen · on Feb 6, 2016

It has a valid Comodo certificate but forgot to include the full certificate chain, which is probably now the #1 configuration error (I help do support for Let's Encrypt and about 80% of "my cert doesn't work after issuance" problems are that). These bugs are tricky because most browsers cache intermediate certs and then forgive sites that don't send intermediates that the browser knows about, so you can see an error in one browser or device and not another because of different cert caches!

kalleboo · on Feb 6, 2016

I just ran into this today... A site I manage with a Comodo certificate was showing unknown issuer in Firefox and only Firefox, and I've never had it fail before (and we've never had any user reports). Added in the cert chain, error is gone. Dunno if the other browsers had Comodo as trusted or it's common enough that everyone who regularly uses Firefox (I haven't used it in months) has it cached...

kazazes · on Feb 6, 2016

Wouldn't it be more reasonable for browsers to not cache them at all and universally reject missing intermediate certificates? (IIRC correctly, Chrome doesn't mind but Firefox will give you the train conductor)

schoen · on Feb 6, 2016

It would definitely eventually reduce the frequency of this configuration mistake.

Firefox definitely does cache intermediates (I've seen it do so as recently as today).