Looking at https://explore2.marginalia.nu/search?domain=simonwillison.n... now that's an interesting service. The web has felt isolating since it became commercialized. Bloggers are living in the Google dark ages right now. Having information like this be readily accessible could help us find each other and get the band back together. The open web can be reborn.
Yeah, this is basically what I've been trying to show people for the last few years. There's still so much wild Internet out there if you go looking for it.
Yes, explore2 is just a demo; an unfiltered listing of the output of this algorithm. For better or worse, it has no concept of dead links. If and when I product it it needs to hook into the search engine's link database better.
Well I mean I could, but it's easier and more convenient to just put them in a directory on the server than go through all the rigmarole of creating a torrent.
Hey, that's an interesting idea for a side project: an easy way to serve a file via torrent. Maybe a commandline utility that uses WebSeeding so you don't need to run a torrent client? Hmm.
(I did a quick `apt search` to see if something like that is already available, but didn't find anything.)
In 2012 I was trying to turn my PhD thesis into a product for a better guitar tab and song lyrics search engine. The method was precisely this: use cosine similarity on the content itself (musical instructions parsed from the tabs or the tokens of the lyrics).
This way I wasn't just able to get much better results searching than with PageRank but there was another benefit byproduct of this approach in that you could cluster the results and choose a distinct separate cluster for each subsequent result. With google you would not just get bad results at number 1 but results 1-20 would be near duplicates of just a few distinct efforts.
Unfortunately I was a terrible software engineer back then and had much to learn about making a product.
The author describes calculating cosine similarity of high dimensional vectors. If these are sparse binary vectors why not just store a list of nonzero indexes instead? That way your “similarity” is just the length of the intersection of the two sets of indexes. Maybe I’m missing something.
A dense (bitmap) representation of matrix wouldn't fit in memory, would require about a PB of RAM unless my napkin math is off. The cardinality of this dataset is in the 100s of millions.
(An additional detail is I'm actually using a tiny fixed width bloom filter to make it go even faster)
I am surprised nobody has thought about looking into page content itself to help fight spam. If a blog has nothing except paid affiliate links (Amazon, etc.), ads, popups after page loads (news letter signups, etc) then it should probably be down ranked.
I have actually been developing something like that, but it does more, including down ranking certain categories of sites that contain unnecessary filler, such as some recipe sites.
This is such a great idea, often when I find a small blog or site I want more of it! This is the perfect tool to discover that. It’s a clear and straightforward idea in retrospect, as all really great ideas tend to be!
To be fair they are two different metrics. Pagerabk measures how authoritative a page is. The cosine metric is for measuring how similar a page is to another one
It's not comparing domain names or even content, but the similarity of the sets of domains that link to the given pairing. That's sort of the neat pair, how well this property does correlate with topic similarity.
OK, so certain domains X link to www.whatever.com. Certain domains Y link to whatever.com. The similarity between those is 42% in some metric (like what they link to?)
The algorithm does not exist to be manipulated in that way. In fact, the article ends with "This new approach seems remarkably resistant to existing pagerank manipulation techniques". It is my opinion (and I know some people will disagree with me) that SEO is harmful and should not exist. Since you're still new to the industry, it might be worthwhile pivoting to a different occupation.
Hi, I am not familiar that much with page ranking, and terminology. I think that my oryginal questions could have been misunderstood.
I am writing my own web scraper. That is why I am in fact interested in this topic at all.
To distinguish poor pages from better I check HTML pages. I think all scrapers need to do that. I rank pages higher if they contain valid titles, og: fields, etc. Etc.
There is nothing wrong with checking it and asking for what can I do to make my site more scrap friendly.
The way the ranking algorithm works is by comparing the similarity of the inbound links between websites.
So to manipulate the algorithm, you'd need to find an important website, and then find a way of making changes to all the websites that link to that website to add a link to your own website.
PageRank is. This is a modification of PageRank. The original algorithm calculates the eigenvector of the link graph.
This algorithm uses the same method to calculate an eigenvector in an embedding space based on the similarity of the incident vectors of the link graph.
Hmmm, odd. I was under the impression they used cosine similarity based on page content. Once upon a time, based on that 'memory', I created a system to bin domain names into categories using cosine similarity. It worked surprising well.