A new approach to domain ranking

jart · on Dec 3, 2023

Looking at https://explore2.marginalia.nu/search?domain=simonwillison.n... now that's an interesting service. The web has felt isolating since it became commercialized. Bloggers are living in the Google dark ages right now. Having information like this be readily accessible could help us find each other and get the band back together. The open web can be reborn.

marginalia_nu · on Dec 3, 2023

Yeah, this is basically what I've been trying to show people for the last few years. There's still so much wild Internet out there if you go looking for it.

masfuerte · on Dec 3, 2023

The current incarnation of diveintomark.org really doesn't belong in that list. The original went offline more than a decade ago.

marginalia_nu · on Dec 3, 2023

Yes, explore2 is just a demo; an unfiltered listing of the output of this algorithm. For better or worse, it has no concept of dead links. If and when I product it it needs to hook into the search engine's link database better.

marginalia_nu · on Dec 3, 2023

BTW, if anyone wants to dabble in this problem space, I make among other things the entire link graph available here: https://downloads.marginalia.nu/exports/

(hold my beer as I DDOS my own website by offering multi-gigabyte downloads on the HN front page ;-)

estebarb · on Dec 3, 2023

Why don't offer it only via BitTorrent?

marginalia_nu · on Dec 3, 2023

Well I mean I could, but it's easier and more convenient to just put them in a directory on the server than go through all the rigmarole of creating a torrent.

gary_0 · on Dec 3, 2023

Hey, that's an interesting idea for a side project: an easy way to serve a file via torrent. Maybe a commandline utility that uses WebSeeding so you don't need to run a torrent client? Hmm.

(I did a quick `apt search` to see if something like that is already available, but didn't find anything.)

tmcdos · on Dec 6, 2023

Troy Hunt had the same story a while ago, might be helpful - https://www.troyhunt.com/how-i-got-pwned-by-my-cloud-costs/

marginalia_nu · on Dec 6, 2023

I run outside of the cloud and have fixed costs so in that sense it's no prob :-)

robbomacrae · on Dec 3, 2023

In 2012 I was trying to turn my PhD thesis into a product for a better guitar tab and song lyrics search engine. The method was precisely this: use cosine similarity on the content itself (musical instructions parsed from the tabs or the tokens of the lyrics).

This way I wasn't just able to get much better results searching than with PageRank but there was another benefit byproduct of this approach in that you could cluster the results and choose a distinct separate cluster for each subsequent result. With google you would not just get bad results at number 1 but results 1-20 would be near duplicates of just a few distinct efforts.

Unfortunately I was a terrible software engineer back then and had much to learn about making a product.

janalsncm · on Dec 3, 2023

The author describes calculating cosine similarity of high dimensional vectors. If these are sparse binary vectors why not just store a list of nonzero indexes instead? That way your “similarity” is just the length of the intersection of the two sets of indexes. Maybe I’m missing something.

marginalia_nu · on Dec 3, 2023

Yes, that's how it's done.

A dense (bitmap) representation of matrix wouldn't fit in memory, would require about a PB of RAM unless my napkin math is off. The cardinality of this dataset is in the 100s of millions.

(An additional detail is I'm actually using a tiny fixed width bloom filter to make it go even faster)

socksy · on Dec 4, 2023

I'm curious if you had a look at GraphBLAS (ala Redis Graph etc)?

marginalia_nu · on Dec 4, 2023

Nope, did try a few off the shelf approaches, but ultimately figured it was easier to build something myself.

azornathogron · on Dec 3, 2023

You seem to be missing that that's already how it's implemented.

https://www.marginalia.nu/log/69-creepy-website-similarity/

jakearmitage · on Dec 3, 2023

I love the random page: https://search.marginalia.nu/explore/random

This makes me feel in the old open web again.

eek2121 · on Dec 3, 2023

I am surprised nobody has thought about looking into page content itself to help fight spam. If a blog has nothing except paid affiliate links (Amazon, etc.), ads, popups after page loads (news letter signups, etc) then it should probably be down ranked.

I have actually been developing something like that, but it does more, including down ranking certain categories of sites that contain unnecessary filler, such as some recipe sites.

marginalia_nu · on Dec 3, 2023

Result ranking takes a lot of variables, and factors like excessive tracking and affiliate links is one of them in my search engine.

You can poke around in the result valuation code here: https://github.com/MarginaliaSearch/MarginaliaSearch/blob/ma...

freediver · on Dec 3, 2023

Kagi does this (it is one of the main ranking signals).

nemoniac · on Dec 3, 2023

It gives plausible results for websites similar to HN.

https://explore2.marginalia.nu/search?domain=news.ycombinato...

buildbot · on Dec 3, 2023

Aww, sadly nothing for my own websites!

This is such a great idea, often when I find a small blog or site I want more of it! This is the perfect tool to discover that. It’s a clear and straightforward idea in retrospect, as all really great ideas tend to be!

marginalia_nu · on Dec 3, 2023

It's not even really particularly new, technorati did this stuff 20 years ago.

solardev · on Dec 3, 2023

What are the details? The sample page just 404s

marginalia_nu · on Dec 3, 2023

I migrated the blog over to Hugo a while back and I think I lost the data. But no worry, the wayback machine's got a snapshot:

https://web.archive.org/web/20230217165734/https://www.margi...

bayesianbot · on Dec 3, 2023

There's details in the first linked See Also -post: https://www.marginalia.nu/log/69-creepy-website-similarity/

asicsp · on Dec 3, 2023

Discussion for the similarity post: https://news.ycombinator.com/item?id=34143101

solardev · on Dec 3, 2023

Ah, thanks!

ipaddr · on Dec 3, 2023

cosine similarity approach is better than PageRank

marginalia_nu · on Dec 3, 2023

It's still fundamentally PageRank though, it just gets fed website similarities instead of links.

ipaddr · on Dec 3, 2023

Not my opinion only a summary for parent who couldn't load page

vasco · on Dec 3, 2023

The whole post could be this line!

altdataseller · on Dec 3, 2023

To be fair they are two different metrics. Pagerabk measures how authoritative a page is. The cosine metric is for measuring how similar a page is to another one

kazinator · on Dec 4, 2023

Concludes that a certain www.example.com is 42% similar to example.com, when they are exactly the same: one redirects to the other.

The only thing different is the domain names, and those character strings themselves are more than 42% similar.

marginalia_nu · on Dec 5, 2023

It's not comparing domain names or even content, but the similarity of the sets of domains that link to the given pairing. That's sort of the neat pair, how well this property does correlate with topic similarity.

kazinator · on Dec 5, 2023

OK, so certain domains X link to www.whatever.com. Certain domains Y link to whatever.com. The similarity between those is 42% in some metric (like what they link to?)

marginalia_nu · on Dec 5, 2023

Yes, it's a cosine similarity between those sets.

renegat0x0 · on Dec 3, 2023

I have searches the github repo for information for page ranking.

I am newbie in SEO. I would grately appreciate if marginalia provided clean readme about it, about their algorithm.

At marginalia search front page we have access to search keywords, page algorithm is important enough to be at least discussed on layman terms.

How to optimize page, so it could have a high ranking?

I undestand this could be in the code documentation, but I have not yet checked it, sorry.

hliyan · on Dec 3, 2023

The algorithm does not exist to be manipulated in that way. In fact, the article ends with "This new approach seems remarkably resistant to existing pagerank manipulation techniques". It is my opinion (and I know some people will disagree with me) that SEO is harmful and should not exist. Since you're still new to the industry, it might be worthwhile pivoting to a different occupation.

renegat0x0 · on Dec 3, 2023

Hi, I am not familiar that much with page ranking, and terminology. I think that my oryginal questions could have been misunderstood.

I am writing my own web scraper. That is why I am in fact interested in this topic at all.

To distinguish poor pages from better I check HTML pages. I think all scrapers need to do that. I rank pages higher if they contain valid titles, og: fields, etc. Etc.

There is nothing wrong with checking it and asking for what can I do to make my site more scrap friendly.

Thanks,

is_true · on Dec 3, 2023

If you think about it, a lot of people work in "manipulation techniques"

marginalia_nu · on Dec 3, 2023

The way the ranking algorithm works is by comparing the similarity of the inbound links between websites.

So to manipulate the algorithm, you'd need to find an important website, and then find a way of making changes to all the websites that link to that website to add a link to your own website.

dleeftink · on Dec 3, 2023

Not sure if serious? From the post:

> This new approach seems remarkably resistant to existing pagerank manipulation techniques

derelicta · on Dec 4, 2023

surprisingly effective to discover new mastodon or pleroma instances!

kgbcia · on Dec 3, 2023

Explore sample data 404

marginalia_nu · on Dec 3, 2023

https://web.archive.org/web/20230217165734/https://www.margi...

lowkey_ · on Dec 3, 2023

Goodhart's law: "When a measure becomes a target, it ceases to be a good measure."

This could be helpful in the short-term, but I'm skeptical long-term as it'll become just as gamed.

zanethomas · on Dec 3, 2023

Isn't that Google's original algorithm?

marginalia_nu · on Dec 3, 2023

PageRank is. This is a modification of PageRank. The original algorithm calculates the eigenvector of the link graph.

This algorithm uses the same method to calculate an eigenvector in an embedding space based on the similarity of the incident vectors of the link graph.

zanethomas · on Dec 3, 2023

Hmmm, odd. I was under the impression they used cosine similarity based on page content. Once upon a time, based on that 'memory', I created a system to bin domain names into categories using cosine similarity. It worked surprising well.

Regardless, well done!

marginalia_nu · on Dec 3, 2023

Hmm, seems like something that might be used for deduplication maybe?