Hacker News new | past | comments | ask | show | jobs | submit login
Google's PageRank patent has expired (2019) (patents.google.com)
242 points by ddtaylor on June 1, 2022 | hide | past | favorite | 168 comments



I'm not sure how useful this might be to anyone anymore.

Google still uses it for the base ranking. But, the results are then run through a variety of add-on ML pipelines, like Vince (authority/brand power), Panda (inbound link quality), Penguin (content quality), and many others that target other attributes (page layout, ad placement, etc). Then there's also more granular weightings for things like "power within a niche", where a new page might do well for plumbing (because of other existing pages on the site), but wouldn't automatically have any authority for medical topics.


My search results were a lot better in 2006 when, I assume, they didn't have all these ML pipelines...


My view of it is that they basically lost the SEO spam wars. Without the ML pipelines, the top results would all be dominated by the smaller, highly skilled, SEO manipulators. They didn't find a way to cleanly excise that spam, so they resorted to a very imperfect hammer...giving a lot of SEO weight to large corporate entities. So basically, a different kind of spam dominates now.


Minor rant: I can't seem to contact an actual towing company directly, it's usually some external entity that ranks top in Google then they charge you more to find the actual local towing company. It says "Towing company in city name" but it's not local.


This was my experience last year when I wanted to find a locksmith. EVERY result on GMaps that was shown as being in my city was, in fact, some centralized company that seemed to contract out to guys working out of their cars. Every one took my name and said they'd get back to me (which they did).

This is certainly a problem with the locksmith companies, but I think there's also a Maps problem, too: Google enables this kind of commerce, especially businesses that apparently have a physical location but explicitly stated on the phone that I can't stop by there. It reminded me of all of the thousands of Delaware corps housed in a single building.

If Google made it so that you're only listed on the map if you actually have a physical commerce location, that would help a lot -- at least for those kinds of businesses that need it. Towing companies and the like may be an exception?


Oh locksmiths are a fun one.

The type of people get into locksmithing are like those who do computer security for fun - they like to figure out how things work and how they can be broken. Which means that every locksmith wants to figure out how to game the system. And the first thing that they all figured out is that people tend to select whatever locksmith is closest. So they all went and pretended to be in a million places around the neighborhood in hopes that THEY would get selected.

That was the case a decade ago. And it was a nightmare. Glancing at a locksmith search now, Google somehow cleaned that up a lot. But it doesn't surprise me that whoever is on top now is someone who figured out how to game the current system.


Does Google get paid more if they link you directly or if they link you to the middleman?

> The goals of the advertising business model do not always correspond to providing quality search to users. - Backrub paper


> This was my experience last year when I wanted to find a locksmith. EVERY result on GMaps that was shown as being in my city was, in fact, some centralized company that seemed to contract out to guys working out of their cars. Every one took my name and said they'd get back to me (which they did).

I had the same thing happen just last week! The locksmith who came for me gave me his personal phone number though and said if I needed help again, I could call him directly and he could give a better rate due to not needing to pay the centralized company a portion. Despite often having social anxiety with strangers, I've still found that just being friendly and treating people well (in this case still giving a tip alongside the cancellation fee after my super finally arrived just moments before the locksmith) ends up getting me better services than any research I've been able to do online. I ended up making a similar connection with the people we hired to install our curtain rods just after we moved in; I was talking to them about how the movers that brought my girlfriends things over to the apartment ended up damaging her desk in transit, and they said that they also do moving as well as installation/assembly, so next time we end up moving, we also will be able to hire people we know and trust (we ended up hiring them again for a few odd jobs since the first time, and we almost always end up getting one of the same three people we've met and made connections with).


Not sure what the problem is here? You require that a locksmith sits in a brick and mortar store waiting for phone calls? Their tools fit in a backpack, there's literally no need for a physical location. I guess unless you're getting keys copied but any hardware store can do that.


If it's a Google Maps hit, street view and look for an actual storefront? Many mom-and-pop locksmiths have stores with safes etc on display. Though that won't work so well with tow companies.


This is funny for me since I was in Maps in 2010-2011, and "fake locksmith addresses" were a problem even then.


The same is true for me for contractors of any sort. Top links are frequently referral farms.

For local search, I almost always use google maps instead. Maps search isn't perfect, but it beats web search hands down.


It's really annoying - you have to dig around, and many business do not have a web presence at all so unless you can find something local with ads, you find nothing.

Some tricks I've used in the past include local church bulletins (most parishes have some sort of a website and a weekly bulletin with advertising on the back), or local sport team sponsors, or local bars.

For towing, you can also try calling a local dealership or auto repair.


Or their only web presence at all is a facebook page. I don't have a facebook account so this used to be mildly annoying but lately I've noticed it won't let me see at the page at all. I click the link to visit the business's "website" and I'm greeted with a full-page prompt to sign in to facebook.


Yeah, their "local search" stuff seems to use yet a different set of criteria. You'll see similar complaints for other niches like locksmiths. Some set of SEO spammers has figured out how to use services like UPS-Store virtual addresses to fool Google.


There's a tonne of low hanging fruit google completely ignores, for example any page with an amazon referral link is almost certainly spam.

"But wait!", you say, "there are some legit reviewers out there." Yes, there sure are, by my starement is accurate, because for every legit review site with aws referrals, there are tens of thousands of ml created spam referral sites.

And so the real review sites are often lost in the mix regardless, which makss arguments to keep those results pointless.

But google leaves them there, and this is the same sort of site which, if it were an email, would immediately end up in a spam folder.

And beyond this, the other part of the problem is their ridiculous aliasing of search terms, which helps spammy sites come back as a response.

You say google lost? It's not losing, if you just don't care.

Frankly, it's just a return on investment thing. As long as only a few people per tens of thousands bolt, why spend the r&d?


> There's a tonne of low hanging fruit google completely ignores, for example any page with an amazon referral link is almost certainly spam.

This is the root of the problem. If Google targets the low-hanging fruit, the spammers will very quickly find a workaround. Google is trying - and failing - to use more sophisticated signals.

Spam is a virus. Google beneffited from the network effect, eradicating all opposition. But, like a European medieval monarch, it now has a poor immune system because of its lack of genetic diversity. The many, many SEO spammers are constantly experimenting, and they only have to find one flaw in Google's algorithm for it to win and spread quickly.

Google's monoculture cannot save us, however hard they try.


If you target the spam business model, it won't route around it.

But yeah, a healthy set of search engines will get us better results.


I put Amazon referral links on my technical blog whenever it's a hardware project. So far I have made < $5 on it so I should probably remove them. But for others I think it helps justify the time I spent writing up the blog and doing the project.


If there is a numbered list with referral links. 99% it's low effort referral spam trash. How they don't already filter this must be some sort active sabatague


Such links are relatively easy to hide from Google (obfuscate behind a redirect or inject with JS on user action), so if Google started using the links as a signal, spammers would hide the links, and invert the signal — only non-spammers who don't want to risk being delisted for cloaking would be left with the links, and penalized for them.


A different, more profitable kind of spam dominates now.


A kind of spam from which Google profits because it often embeds its ads and/or analytics.


Blog farms are way too popular now, too. People have figured out how Google recognizes "meaningful content", and they generate blog posts and pages with AI and ghost writers to pump out vaguely helpful articles.


They downranked pirated content results in 2018[1] so someone could just use pagerank and include those sites and have a better search engine for that sort of content. There's also the "controversial twiddler"[2] which makes Google only present mainstream opinions on various topics. I had to search yandex.com to even find a good article on it. There's some weird stuff on the Youtube blacklist, like the Las Vegas mass shooting from a few years ago that everyone forgot about that they never figured out the motive for. See the leaked "YouTube Blacklist" under the "Censorship" link in the project veritas article.[2]

[1]https://fossbytes.com/google-downranks-65000-torrent-sites-i...

[2]https://www.projectveritas.com/news/google-machine-learning-...


>smaller

Perhaps in other words, market forces (I'm thinking ROI) and economies of scale will win.


What I was talking about is that there was once a pretty large community of smaller SEO manipulators that did a lot of experimentation on what really worked. Especially in the "black hat" realm. Finding things like areas on big-brand sites that would accept user-generated content, comments, etc, where you could bypass some checks and insert links to sites.

They would experiment in a pretty deep way, varying things like the rate of new links, type of new links, variations of anchor text / bare links, and so on.

It USED to be very effective.

The larger entities don't really have to be that detailed. If you have that brand power, you can just cut partnership/cross-link deals and pay a little attention to things like anchor text in links, contextual text around the link, etc.

Edit: Ah, yeah, agreed. What got lost in all this was good content that had no big brand behind it. The indiscriminate hammer Google used to kill off small-guy SEO spam also pushed a lot of actually good stuff, stuff that never did any SEO at all, off the first page.


For sure. Way back any high Pagerank page linking to another would unequivocally boost the target site, whatever niche, but it's became more nuanced as you say.

What I was meaning was, the players who can manipulate the link graph most cost effectively tend to rank better and it favours those with deeper pockets. Not so bad for competitive niches and high volume terms, but muddies the waters for many other things.


That's because the web was better in 2006.

I think Google has made one change for the worse, though, which is strongly favoring more recent content. Increasingly, I think that change has been a big contributor to the decay of the web since.


Part of the problem is, I think, many searches strongly benefit from up to date content- programming tools, fashion, celebrities, things to do in X area, etc.

It seems that Google has decided that most people want the most updated information when they look for something, which I don't think is entirely unreasonable.

What I would love, however, is a way to turn that off for particular searches. Researching past events, as a trivial example, benefits far more from exact results rather than most recent tangentially related blogspam.


You mention options in passing, but it is to me the root of the problem: Google hates giving up control. Control means ad revenue. So we could have options that would make search extremely efficient for most users, but that would presumably be very hard to monetize in comparison. So we have no options, and everyone gets mediocre to bad results.

Since everyone I know in tech laments Google's decline into uselessness, I'm assuming this is not a sustainable strategy.


> everyone gets mediocre to bad results

Everything I've heard from people I know at Google suggests otherwise. Most searches for most people ... work. I too struggle to have google work in specific research cases, and I would like more power-user toggles, but basic searches like "$celeberty_name photos" or "$my_kids_school calendar" or "pizza places near me" just sorta work.


Hard to know anything for sure since results are "tailored". But Google used to be excellent for technical searches, whereas now it is unhelpful at the best of times. I'm guessing this isn't counted in "most searches gor most people".


> What I would love, however, is a way to turn that off for particular searches

But you can! In search results, click Tools and switch the Any time dropdown to Custom range... and you can specify a date range in the past. (Apparently, the custom option is hidden in the mobile version?!) I'm not sure how precise and dependable it is but it seems at least partially useful when I search for historical events.


I don't actually want to exclude new content, I just don't want to give it priority over older content if the older content is at least equally specific in matching my query.


I think you’ve forgot the websites with walls of white-on-white keywords/links at the bottom, trying to game pagerank. Many of these models were marked improvements compared to the generations before.


I don't understand how you're comparing the two.

You think the results were better in 2006, but the corpus was also different.


There was way less SEO back then. If they continued to use the old algorithm, your results would be worse than what you have now.


What are you saying?!

There was loads of SEO back in 2000 even! It brought alta vista to its knees, the number one search engine of the day.

Google got started, grew, because it filtered all that SEO spammy junk.

Do you have specific stats to back this up? Number of SEO pages vs good ones?

Or are you just presuming?


You're both right. There was loads back then but there was also way less than now.


Page rank wasn’t a pure algorithmic solution in 2006 either.

So, while the problem changed I am not sure if it got harder or Google’s priorities shifted.


Doubt it. Even today, obvious spammy content aggregators show up for many keywords with stuff taken directly from Reddit/Stackoverflow/Superuser/etc.

All of Google's fancy ML could've been replaced with a simple report button - enough people report a site and either trigger a manual review (best option) or just ban them (could be used against competitors... but negative SEO was already a thing and is still widely used).


> My search results were a lot better in 2006 when, I assume, they didn't have all these ML pipelines...

That's like an old person complaining that their body felt a lot better back in 2006 when, I assume, they didn't have to use their walker and glasses all the time...


Search is objectively worse today than it used to be, it's not just that "it's harder to use for old people", it's just worse.


>Search is objectively worse today than it used to be

How would you demonstrate that search is objectively worse? And how would you then show that it's a result of Google's algorithms, and not a consequence of the content of the Internet changing significantly?


>How would you demonstrate that search is objectively worse?

There's a few ways to do that. The easiest is to point out that almost any lucrative search query has -0- organic results above the page fold on a typical monitor today. It's all ads unless you scroll down.

Then, it's not proof, but how much time do you think Google spends on things that sit below the fold and aren't clicked on much? What would the financial incentive be?


> how much time do you think Google spends on things that sit below the fold and aren't clicked on much? What would the financial incentive be?

Quality organic search results is the reason they can have ads above the fold. There's tremendous financial incentive for them to care about that.


It's a result of people generating content solely to cater to Google's algorithms and accumulate ad and referral revenue. The internet changed significantly because of how Google indexes and ranks content. The proof is in the pudding... results for product searches are dominated by SEO referral link blogspam. Entire careers and businesses that didn't exist in 2006 have been built around this.


> How would you demonstrate that search is objectively worse?

How would you demonstrate that being 50 years old is worse than being 25 years old?

You ask people that are 50 years old or older because they have been on both sides.

> it's a result of Google's algorithms

Well, it's simply in front of your eyes: this [1] was not possible in 2006.

Anyway, the fact that you cannot easily find on Google why Google search results are worse, proves that Google search results are worse today than in the past.

https://www.theverge.com/tldr/2020/1/23/21078343/google-ad-d...


I was trying to be clever with my metaphor, but since a lot of people seemed to miss my point, I'll spell it out.

An old person's glasses and walker don't make their body feel worse. They're responses to an underlying change, and in fact make them feel better than they would without them.

Similarly, I'd argue that the ML pipelines and complexities in Google search aren't why search results are worse today. Rather, the web has changed with more SEO spam, walled gardens, content in videos, and search has changed in that you try to find more kinds of information than ever before. It's the underlying changes that make the search seem worse, and all of Google's fancy algorithms are imperfect responses to that. Without them, I'd be surprised if Google's results weren't far worse than in 2006.

The comment I responded to:

> My search results were a lot better in 2006 when, I assume, they didn't have all these ML pipelines...

made it seem like maybe the ML pipelines were somehow causing the decline in quality, rather than simply an imperfect response to changes in the web since then.


It would be cool if we had a snapshot of the web from the time you think it was better and could pit the algorithm of today against the algorithm that was contemporary with the snapshot. It would also be interesting to take the old algorithm and apply it to the web of today.

My bet would be that each algorithm would perform best on the web of its day.


I find that Google search results are still the best of any search engine for specific computer science and other specific tech-related topics, as long as you construct a fairly complex search string. However, for general information on things like world events, local news, national politics, etc. it's become little more than a mirror for corporate and state propaganda outlets. This is likely due to those very ML pipelines mentioned above:

Vince (authority/brand power), Panda (inbound link quality), Penguin (content quality)...

This represents a pretty severe narrowing of results on information and opinions - possibly the worst results are on Youtube searches for newsworthy events. It'd be very interesting to see what kind of content a pure PageRank algorithm-based search engine would generate today, and I'd be very interested in using such a search engine. Now, would it be overrun by SEO? I don't know, but it'd be worth finding out.

I kind of wonder if Google Scholar is purely PageRank or citation-count based, it still gives very useful results with relatively simple query strings.


> I find that Google search results are still the best of any search engine for specific computer science and other specific tech-related topics

even without Google account?

This is my Google experience from my temporary workplace in Barcelona

https://i.imgur.com/X2Avw2v.png

Then I click on "I agree" and whatever thing I search I receive Spanish results because I am in Spain.

Not my idea of "the best of any search engine", but ok...


I never log in, I find the key to better results is to keep adding terms to the query, and force the 'verbatim' option under tools. Restriction to specific domains (.edu etc.) with site: helps a lot, and the not option is sometimes helpful (-this -that).

Youtube search is awful. I once went through news queries on Google searches of Youtube (site:youtube.com) looking for popular independent media outlets (BreakingPoints for example) by -MSNBC, -FOX, -CNBC etc., and ended up with a string about 25 queries long, and those shows are just banned. They were feeding me unpopular corporate media shows with hardly any views or subscribers at that point, but zero access to independent media. Full-on propaganda manipulation.


> That's like an old person complaining that their body felt a lot better back in 2006

It's simply a fact that in 2006 Google search results were better.

Reasons might vary and could well not be Google's fault, but it's not old man yelling at clouds it's provably true.

Of course people that were not using Google before 2006 can't really know, just like people that are not old cannot experience how much better being young is, body functioning wise.


Your search results were a lot better in 2006 when there wasn't yet a huge population of sites cranking out SEOed garbage.

Google is trying, they're just losing the arms race of detecting crap content vs generating it.


Why not crowd-source it?

All Google needs is an obvious, one-click "spam" button for logged-in users. Clicking should add a site to the user's spam blacklist (which they should be able to review).

They know about what users search for and how those searches overlap and separate users into groups (not to mention all the individual details they have access to). When sites are marked as spam by enough different types of users, those sites can then be manually reviewed and their content blacklisted by Google preventing the same or very similar sites with a different URL from appearing.

Unfortunately, Google makes a lot of money from these ad/spam sites, so they have a perverse incentive to keep allowing them.


It is great effort to make a high ranking spam site and trivial to destroy it for Google. It would be a winning battle.

I think that Google just don't care and are happy to be a search motor for Reddit and Wikipedia and to answer questions like "How old is Lady Gaga?".


Man I have such a different search experience than everyone on HN. Google is amazing for me. Can someone give me a search query they think gives "objectively" bad results and maybe some links they would expect to see who up in the top that aren't there? Or is it that you search for something and you don't find anything? Or is it just that people don't like that there are ads?

I genuinely don't understand.


I feel the same way, I think one of the biggest startup opportunities now is to create a decent search engine. Google is absolute rubbish and a new player needs to step in and take away that market share that google is throwing away by means of a terrible user experience.


I strongly disagree that there's a startup opportunity in Google's relatively poor search performance.

Obsessive programmers like myself really want a search engine that does better and helps us find the weird corners of the web.

Most people do not care. They can type a question into Google and get an answer, and that's all they're looking for.


People love to say that Google has "relatively poor" performance.

Relative to what? Google has better performance than any other search engine. It's relatively poor compared to the imaginary ideal search engine that gives you exactly the result you want for any query regardless of whether the information even exists or not.


I think your startup plan should not be: "Step one: Displace Google Search." Instead, you find the underserved niche, people who care about quality search results (maybe in a specific domain), and you address that. As you conquer the small market you expand.

Google really does seem to be frustratingly bad these days.


And we have add a variety of competing search-engines that still aren't as good. It isn't like the idea hasn't been tried. For example DuckDuckGo, or even Bing, but there have been half-dozen others too.


Indeed. It's an arms race. Search engines versus SEO.


Wholeheartedly agree, but with an important caveat - Google is no longer fighting for the best interest of it's users, it's fighting for the best interests of it's advertisers. How useful the results are is now secondary to "how much can this search be monetized". The former was important in the early days when Google needed to be competitive on search.


Most directly, if the search results were perfect you would never need to click on an ad.

The most evil thing Google is doing now is pushing brands to buy ads on their own name so that they appear above the organic #1 search result. It's like the time Facebook decided it wouldn't send messages to followers who like you unless you paid up.


> The most evil thing Google is doing now is pushing brands to buy ads on their own name

It's not like this is a new thing. The court cases about it are almost 20 years old: https://en.m.wikipedia.org/wiki/Google_v._Louis_Vuitton


That doesn’t mean with present day’s internet content and the terms you search for in present day would have returned better results using google from 2006


I think advanced SEO tactics would destroy the results in 2006 though.


Ditto.


I've always thought their main intentions would be topical pagerank and author pagerank. Obviously Google have made many leaps with natural language, but measuring 'intent' and inherent trust is still an unsolved one. Google+ may have solved it. I remember hearing about an "author rank" type patent.

All the same, the link graph has been truly bastardised because of it being such a prominent facet of ranking for an audience.


I have not used PageRank to rank search results but to detect monitoring sensors in P2P botnets as part of my master thesis and it worked pretty well. It is quite effective in detecting sensors and the only way to prevent detection is to reduce incoming edges, thereby requiring many sensors to monitor a large botnet.

So the algorithm is still useful not only for search results.


please, can you share a link to your thesis?


Sorry for the delay. Disclaimer: I turned the thesis in already but I'm still waiting for my supervisor's feedback. So it might be completely wrong and has not been reviewed until now.

The thesis itself can be found here: https://git.vbrandl.net/vbrandl/masterthesis/raw/branch/mast...

Some groundwork I built upon: "SensorBuster: On Identifying Sensor Nodes in P2P Botnets" https://git.vbrandl.net/vbrandl/masterthesis/raw/branch/mast...


> "power within a niche"

In my country we have a website that has complete monopoly over sales of used items, services etc. I'm always amazed that Google is able to determine and put it in top 3 results for a wide majority of searches. "laptop mouse" brings up that site, but "kitchen sink" doesn’t, presumably because people don’t buy used sinks.


If google returned the used item site for "kitchen sink" and nobody clicked on that result, it would be easy for an algorithm to make the connection that this is a bad result that shouldn't be ranked very high.


I never thought about "brand power" being a metric but that definitely makes sense. I'd be very curious to see what search results would look like if that weight was inverted (more "brand power" the lower the rank). I don't think it would be _better_ but it might make exploring lesser known bits of the internet more accessible.


Google was more cagey about Vince and brand power than they were about other updates. The popular theory was that there was some hand-picked set of "popular brand urls" and that pagerank flowing from those had more weight (beyond the already heavy weight they would have because of their link graph).


And yet, the results are worse and worse.


To be fair, they don't seem to teach the next generation about search operators.

I've found just using quotes, OR/AND, and some other basic stuff can often get me what I need.

The place where Google does well is that they're a monopoly, so things like image search that require a lot of resources, they are better at, especially paired with their geographic location. (USA! USA! USA!)

Edit: I hit enter and the post went out while drafting.


I've found search operators and quotes are less and less effective — especially on Google, but on other search engines as well — as time goes on. I think OR/AND were removed a long time ago, and things like + and quotes aren't effective, because they're still subjected to the same processing (stemming, synonym substitution, whatever ML nonsense Google does) as plain queries. So in general, you don't get what you're searching for; you get what most people making a similar query would get.

It's made it harder to search for bits of poetry, quotations, or song lyrics, especially.



>I've found search operators and quotes are less and less effective — especially on Google

On Google, I treat my search terms like a Venn diagram if I want good results.

Eg: want an article about the texas blackouts, but not the ones a decade ago?

Type "texas blackout npr 2022" minus quotes.

But if you do that on other engines, it may be MUCH more literal, the literal intersection of those terms, and I need to do the opposite: use as few terms as possible, possibly paired with using the site: operator, intitle operator, or other things.

>It's made it harder to search for bits of poetry, quotations, or song lyrics, especially.

Yeah to be completely clear, my default is DuckDuckGo, then very rarely I fall back to Google, but often if I'm doing that it's because I didn't want to trouble a librarian -- they talk about privacy, but I had a series of unfortunate events when I told one I want to use books as much as possible because I absolutely don't want some of these tech bros to know what I'm looking up.

(That dichotomy of folks who know information science and those who have critical thinking or coding skills needs to end, now. I'm an alumni of one of the highest ranked schools of information science in the world, and I will not be figuratively or literally extorted into a PhD to get roles others get with a bachelors.)


I had the same idea but the "2022" seems to be the bit which gets nearly always ignored (in my subjective opinion)


Quotes aren't what they once were. They don't guarantee that the term you quoted will be in the results.


I find using the 'verbatim' option under tools is usually necessary to get the exact phrase. However if I also try to use an exact date range (uploaded in past five years, say), then the verbatim option is automatically disabled. Can't do both at the same time, it's really annoying.


>They don't guarantee that the term you quoted will be in the results.

They do guarantee it. Sometimes Google is buggy and messes up in determining what's page is actually visible on the page compared to just being somewhere in the HTML.


That's a loose definition of guarantee, I think. They know what's on the page because they include a brief summary in the results, but even there quoted terms don't always appear.

Here's an example someone posted a few months back where google decided the user didn't really want what they said they wanted:

https://news.ycombinator.com/item?id=30132344


The page Google saw might not be the one you get if you click the link they give you. This kind of thing is in the SEO bag of tricks.


I increasingly find that formulating queries as a proper question gives the best results. Often even in the summary on the top.


When was the last time search operators actually mattered on Google?


Try Yandex image search, you'll be surprised how good it is.


Yandex has an excellent reverse image search.


The results seem to bring them enough income so I'd say "worse" depends a lot on the point of view.


Yes but it's still the king of the jungle and all the alternates are mostly shit (ddg, bing are far worse).

I did learn recently about Kagi on HN and that is marginally better than bing but still doesn't beat Google imo.

I just hope there is more and better competition like this for Google soon. That's the only thing that can put google back in its place and put an end to their evil things.


Are you logged in? Google works best when search history is enabled. Just saying.


But does it make them less money?


"The goals of the advertising business model do not always correspond to providing quality search to users."

http://infolab.stanford.edu/~backrub/google.html


And that's one of the major issues with google search: one metric which would be really good for a search engine is the inverse of the % of advertising related html / javascript in a page. Because that would minimize the utility of SEO for those link farms that only write text to mine ad views/clicks (like food recipes).

Alas, google would shoot themselves in the feet by promoting pages that consume less in their own advertising products.

If the US government split those two business units into different companies, we could have decent searches (google search could still profit by placing ads on its page results)


Whatever you do, whichever ML model you develop, a query-independent ordering of all documents will always be necessary since all distributed IR systems will over-retrieve according to some form of term matching and then apply sophisticated scoring to extract the best documents. You can't score trillions of docs for every query.

Google still uses PageRank, but at the risk of stating the obvious, the current PageRank is much more sophisticated than the one found in the bibliography.


> You can't score trillions of docs for every query.

That completely depends on how you model the queries. It can be a TB sized relation all the way into needing more bits than there are atoms on the universe.


This is not a generic discussion, we are discussing web search IR in particular, but if you want to be pedantic, the TB sized dataset could be the PageRank values and "scoring" could be the ordering of these values.


The TB dataset is PageRank indexed by search term. You can correlate them more and more and end up with exponentially more data, but even the smallest one (that you'll need a hack of servers to query) is quite useful already.


That doesn't mean it isn't useful, for example for a search engine for an internal wiki where you don't have to worry about SEO or spam. And you said yourself google still uses it, so it would probably still be useful to other search engines, including niche search engines, just not as the only ranking mechanism.


Does Google still even use page rank for individual pages? Seem to me that domains are way more important to the base rank.


Is there a way to figure out how a site (or page) is performing against Vince, Panda and Penguin?


Where can I learn more about all these models you mention?


I'm not sure. Most of that is from information Matt Cutts used to share when he was the public face of Google's search quality team. Since he left in 2014, they've been very quiet about the space. If you search for things like "matt cutts panda", "matt cutts vince", and so on, you'll see some of what he used to share.


A CS YouTube channel (Reducible) just put out a fantastic video about the PageRank algorithm: https://www.youtube.com/watch?v=JGQe4kiPnrU


How did they even get that patent? It's a well known 70s era bibliometric algorithm.


The patent was actually assigned to Stanford University, and Google then licensed it (this is also how most pharmaceutical discovery is patented and licensed). A fundamental problem is that these patents all rely on federal funds from US taxpayers to one extent or the other, and while it may make sense for Stanford to hold the patent, there's a strong argument that any US entity should be able to license it (not just one exclusive license in other words).

Prior to Bayh-Dole legislation in the 1980s, this was the case for university-held patents: they could be not be exclusively licensed. Repealing that legislation would be a good idea to avoid the rise of monopolisitic behemoths like Google/Alphabet.


Doesn’t answer the question. Rephrased: Why was the patent granted if prior art existed from the 70s?


The patent approval process isn't perfect. Had you violated the patent, and had Google sued you, you could always argue prior art. What would your evidence be?


As I understand it it's much harder and more expensive to win a prior art claim after a patent has been granted.


The patent office could have just missed the prior art, or more likely is that the claims are more narrow than and represent an improvement upon said prior art.


The patent doesn't cover just ranking documents based on their citations; it also covers various ways to extract "citations" from a webpage, weighting the pages to determine their importance, and doing all of this efficiently enough that you can process millions of queries per second.


Conan The Librarian should have protected us from the evil algo


> It's a well known 70s era bibliometric algorithm.

Citation needed


I was thinking of the Pinski-Narin method, but there are other earlier uses of eigenvector calculations over very similar data: https://arxiv.org/pdf/1002.2858.pdf


What is the name?


Google won because they started early and had the right algorithm. Back in 1998, Google's PageRank was an innovative algorithm that calculated relevance based on counting backlinks instead of parsing the word counts in embedded HTML text like other search engines. This made Google way better than any other available search engine back then (Lycos, Yahoo, AltaVista, etc.), and within weeks, everyone was switching to Google.

A successful Google competitor just needs to be way better than what’s currently available. Nobody seems to have the next big idea for a better search engine yet.


> way better than what’s currently available

Hand-curated walled garden sub-web that bans SEO spam with an iron hand.


wikipedia? Beyond their remit for some queries for sure, but they fit the mold.


A search engine that only searched sites that wikipedia links to might be a fairly decent source in fact.

If monetisable, it would turn gaming wikipedia into a whole new level of shitshow of course.


Mhmm great idea


> A successful Google competitor just needs to be way better than what’s currently available. Nobody seems to have the next big idea for a better search engine yet.

I would speculate that Apple could roll their own search engine using Siri.

We'll see what happens in either WWDC or in a few years, but it seems to be quite early for that and certainly have a long way to go.



Google is a victim of the googlification of the internet. With “the internet” consisting of about 5-10 walled gardens, each with their own custom search, I find I use general search engines a lot less now anyway.


Can anyone explain why the PageRank patent was valuable when an alternative (Prof Jon Kleinberg's Hypertext Induced Topic Search or HITS algo) was published at the same time and available for use?

https://en.wikipedia.org/wiki/HITS_algorithm


Kleinberg's HITS algorithm actually came first. PageRank simplified HITS and made it more practical, by performing the ranking mechanism in batch mode for the entire web graph. If you read the original HITS paper, you'll see that HITS computes the top two eigenvectors for the cluster of all pages containing the query word (and pages pointing to those pages). Computing new eigenvectors for every search query obviously is much slower than PageRank.


The HITS algorithm has one drawback that it is super easy to game it. PageRank is not entirely resistant, but its a little more robust. For an algorithm that is used in the real world where commercial interests rules, sensitivity to adversarial attacks is very important -- Game Theory inspired algorithms will not be out of place. Note, one can definitely adopt 'hardening' strategies for both HITS and PageRank. Even around 2004 a naive, as presented in the tech-report, version of the PageRank algorithm was unusable without tweaks. Had you tried the unmodified version, you would have found that the top tier pages were mostly porn.

There's also the fact that PageRank was presented as a query independent computation that could be done ahead of time and HITS as a query dependent computation. A resourceful enough person could however modify HITS into topic-wise precomputed HITS scores to be combined at run time based on the query.

Ok so here is how one can game HITS. Create a harvester page that points to lots and lots of popular, high traffic pages on the internet. By virtue of doing this it can accumulate a lot of Hubs score which it can redirect as an Authority score to an intended page.


> The HITS algorithm has one drawback that it is super easy to game it. PageRank is not entirely resistant, but its a little more robust.

> Create a harvester page that points to lots and lots of popular, high traffic pages on the internet. By virtue of doing this it can accumulate a lot of Hubs score which it can redirect as an Authority score to an intended page.

I'm not sure how HITS is any more "easy to game" than PageRank? As far as I understand it, the differences are almost entirely limited to performance characteristics, not semantics. The example you give doesn't seem to be specific to HITS (as opposed to PageRank) in any way.

(I'm also not sure how "game theory" is relevant here, unless by "game theory" you just mean "the idea that people will try to game it".)


One could pose this as an adversarial game. For the simplistic case consider two participants -- (i) the ranker that chooses a ranking scheme (we need to constrain the space of ranking schemes somehow for this to lead to any useful formulation), (ii) web page who tries to outrank other pages by strategically linking to other pages, and possibly buying links to itself from other pages. One can give (ii) a budget to add and delete links and pages that it can control. In this framework one then try to compute what's an equilibrium strategy. The multiplayer version is a lot more complicated.

If you check my original comment, I gave a simple scheme to attack HITS rank. The main drawback is that one can 'harvest' Authority score using 'out-links'. Outlinks are cheap and easy, compared to 'inlinks'. Sybil attack is a little harder for Pagerank.


> If you check my original comment I gave a simple scheme to attack HITS rank. Sybil attack is a little harder for Pagerank.

OK, but how is it harder for PageRank? I can't really see any differences in the semantics of the two algorithms, so I'm not sure what kind of added vulnerability one or the other could have.

> One could pose this as an adversarial game.

Yeah, I appreciate that, that's what I was referring to as "the idea that people will try to game it". It's not really the kind of 'game' that would be considered in game theory, though, because it doesn't have any interesting or emergent properties - the designer's response will just be "oh yeah we should stop people gaming our algorithm".


> OK, but how is it harder for PageRank?

If you are familiar with the algorithms, which I assume you are, you can work it out.

To make my page score high on the PageRank score I need to acquire links from high PageRank score pages. This is a lot harder because it depends on a) in-links and b) high PageRank pages. With Hits, its easy for one page to harvest a high Hub score. All that is needed is to outlink to known good pages (authority). Providing outlinks is trivial. Once so harvested, one can direct that flow to a designated page to give it a high Authority score.

> It's not really the kind of 'game' that would be considered in game theory

Why not ? Formalize the strategy spaces of both the players and its a very valid game in the Game Theory sense. For the ranker you have to consider some functional space of functions over a graph. For the page player it has a budget of alterations it can make to the graph.


> With Hits, its easy for one page to harvest a high Hub score. All that is needed is to outlink to known good pages (authority). Providing outlinks is trivial. Once so harvested, one can direct that flow to a designated page to give it a high Authority score.

Are you saying that you think HITS doesn't recursively score the quality of references by their own scores? That's not true. It does exactly what PageRank does in that respect: a page's score depends on the score of those which reference it, which in turn depends on... etc.

The 'hub' vs 'authority' distinction is interesting but not really relevant here: we're considering a page's 'authority' score, which depends on the 'hub' score of those who outlink to it, and at that point we're just doing PageRank [again, except performance-wise and arguably freshness-wise].

Like I said: the only non-trivial differences between them are implementation / performance-related, not semantic.

> Why not ? Formalize the strategy spaces of both the players and its a very valid game in the Game Theory sense. For the ranker you have to consider some functional space of functions over a graph.

Yes, again: possible to frame it as a formally valid problem if you really want to; still not an interesting one. We're only talking about this because you want to maintain that your earlier statement was true.

"You have to consider some functional space of functions over a graph" gives no detail (besides that, yes, you can model something–maybe documents, maybe people, who knows?–as a graph) and sounds like something written by a person with a gun to their head.

Or maybe I'm wrong and there's a fascinating problem which you just don't want to divulge to me.


> Are you saying that you think HITS doesn't recursively score ...

I doubt that reading comprehension is that hard a skill to master. I dont see where I have said anything about recursion or their lack of. If you want to have an imaginary conversation between yourself and what you think I have said, you can continue. I do not need to participate in that. I am sure you alone will suffice.

I think going back to the Pagerank and HITS papers carefully and understanding them will be illuminating. You keep saying they have no semantic difference, which cant be further from the truth. The scores are the eigenvectors of very different matrices, and HITS scores are straight forward to manipulate. BTW the papers cite each other stating in what way the other is different, if their difference was mere implementation detail and no semantic difference, I doubt they would stand as published papers.

The Hub score is not something interesting that I happened to say but a core part of the HITS paper. It is by virtue of the Hubs score that the Authority scores are defined (and vice versa) in the paper. It so happens, its easy to bump up the Hub score of a page by adding few strategic outlinks.

> Yes, again: possible to frame it as a formally valid problem if you really want to; still not an interesting one. We're only talking about this because you want to maintain that your earlier statement was true.

That's your opinion. I am just countering your categorical claim that there is no game theory formulation possible here. If your yardstick for your assertion that no game theoretic formulation is possible is your inability to find one that's interesting to you, there is not much I can do about it. All I can say is that such an yardstick is not very popular or useful.

A game theoretic formulation with a budget constraint adversary is a very natural setting. Research on link spam resistant node ranking algorithms are a thing, as is evaluating how stable are the rankings produced by some of these proposed algorithms to (potentially motivated and adversarial) changes to the links in the graph. SIGIR, WEBKDD proceedings on link analysis and rankings would be a good place to look.

You will need some background in matrix perturbation analysis, especially perturbation analysis of principal eigenvector to understand some of the results. Perturbation analysis of finite Markov chains will also suffice.

https://ai.stanford.edu/~ang/papers/ijcai01-linkanalysis.pdf

is by far one of the easiest papers to read in this area (Andrew Ng has focused on other areas of research since this paper). Note that the stability bounds in that paper can be easily tightened... left as an exercise for you. As you will see in the paper, HITS scores are easier to alter (equivalently stated, they are unstable) compared to Pagerank scores.

> Or maybe I'm wrong and there's a fascinating problem which you just don't want to divulge to me.

This is hardly the forum for extended discussions on a research topic. With the pointers and sketches that I mentioned a competent grad student would be able to fill in/ develop it further.


From your linked Wikipedia page:

"it is executed at query time, not at indexing time, with the associated hit on performance that accompanies query-time processing" - for a search engine that planned to take over the Web that might have been a dealbreaker.


Seems like an obvious optimization though. Is that enough to grant a patent?


Google search has gotten so bad compared to how it used to be that this might actually be useful.


PageRank performs fairly poorly on the web as a whole these days, because the nature of linking has changed since 1998. If you constrain it to the parts of the web that is still basically hypertext documents, it's as good as ever.

That's largely what I've been doing with my search engine.


Have you considered widening the definition of 'hyperlink'? For example, using references in text as well. Or also (lightweight) sentiment analysis. For example, I'm sure many people refer to wikipedia all the time, without giving a link. Or Hacker News. This seems like it should represent a "weak link" between those pages. It would probably be worthwhile to use some kind of normalization on pages/domains to avoid spamming to game this system.

I also think it's worth exploring a simple reputation system. Why not have reputable users evaluate results? (approve/disapprove buttons) I think almost any reputation-free system will eventually fail to bots or cost a huge amount to win the tug of war against SEO. Reputation mostly solves the issue.

Speaking of reputation, I think the great insight would be to apply a ranking algorithms to reputations themselves -- if you can't trust your users, you fall back on the same problem.

To rank reputations, clearly PageRank doesn't work because it values all users equally, which is unfortunately not sybil-resistant. I think one approach is to have an "invite system", where your reputation is associated to who invited you, with earlier users having greater weight somehow (also the administrator can manually assign trustworthy users).

This also suggests a way to formulate distributed trust. You can join a "trust network" by trusting a certain user(s), and then you import users who they trust as well. (I believe this is the rough idea behind Web of Trust, although I believe WoT is not algorithmic -- it should have been!) The problem with this approach for a practical search engine is that you can't aggregate results (you would need to store each user's vote and compute a personal ranking every time) -- so I think in practice a useful compromise is to give the user a choice of a few "trust sets". You trust <Public User A> here? Join this trust set. You trust <Public User B>? Join this other trust set. (Combining a small number of trust sets should be trivial)

As for an algorithm, something like:

-- By trusting other users, up to 100*(1-sqrt(N)/N)% [note] of your trust points will be redistributed (diminishing the impact of your choices).

[note] Obs: Formula arbitrary, chosen to approach 100%

-- The total trust conserves, and phenomena like cyclic trust are not a problem due to conservation.


Does PageRank perform poorly because the nature of linking has changed, or has the nature of linking changed because PageRank is no longer the main arbiter of search results?


From its inception, PageRank performed well even on documents linked without PR in mind, so the latter seems unlikely.


(Founder of Neeva here) Query-independent page and site signals like PageRank (or its site variants) have limited utility in a search ranker. Mostly, they are useful in weeding out bad pages and sites from your index, tie-breaking when you hit shard limits in retrieval and a few other edge cases.

The signals that matter the most: 1. Anchor text (and all variants of smearing and distinguishing between high quality and low quality anchors) 2. In aggregate, which pages got clicked on on any given query (and all smearing variants -- using ngrams, embeddings, ...). At Neeva, we use it for retrieval and scoring. 3. Query understanding signals mined from the query-click bi-partite graph and the query-query session refinement graph. 4. Page summarization signals built on top of 1 and 2 and body text. 5. To a lesser extent, query-independent page quality signals

Whether you use term-based retrieval or nearest neighbor (embedding) retrieval, a heuristic combination of signals or LambdaMart, whether you calibrate to human eval or clicks, whether your topical relevance function is hand crafted or uses a combination of deep learning and IR signals are all details past that.

tldr; there's a lot of craft in a search ranker, and no one silver bullet. Definitely not just PageRank.


Cool. Now let's get the Library of Congress or NARA -- which are natural homes for this kind of thing -- to implement the public option free of profit-driven cruft.


That would actually be pretty cool. Too bad government software has a tendency to come out shitty and expensive.


If the government committed to the project being OSS, they might be able to recruit some reasonably skilled programmers.


That's mostly because the bar for entry is very low and the bid goes either 1) guaranteed company because of congressional lobbying or 2) the company bidding the lowest price who has even a -bit- of a change to complete it rather than companies with proven track records to produce at scale.


It's true in the main, but orgs like TTS (tts.gsa.gov) and 18F (18f.gsa.gov) are really changing things. Library of Congress also has a labs group.


Revolutionary at the time - but have there have been any new breakthrough strategies/algorithms for search in the last decade?


In general the biggest breakthrough in that space has been machine learning powered relevance scoring and NLP. PageRank worked by using high level metadata (links between pages). We now have methods that can incorporate metadata information and analysis of the content itself.


Sure, SEO gammification(TM).

You didn't specificy that the breakthrough needed to be useful to the users or the people wanting to monetize search.



This is a video about the "21e8 index". I have never heard about that. There seem to be one scientific paper about it, by Provides Ng, "21E8: Coupling Generative Adversarial Neural Networks (GANS) with Blockchain Applications in Building Information Modelling (BIM) Systems" from 2021 which does not have any other citation. Also Google does not really provide any information on 21e8.

How is this a representative example?


Proof of work to rank pages? That sounds like the worst idea I’ve ever heard. I’m no fan of Google injecting ads into the top spots of their results but at least you can get to the “real” results by just scrolling down. With this paradigm, every link becomes paid “ad”, and the ranking is completely dominated by companies and individuals with big pockets, not any objective or even subjective measure of the quality. And they’re all ads paid for by wasting energy.


Previous discussion from when this was news:

https://news.ycombinator.com/item?id=20067712


It's becoming clear to me that building a new crawler and indexer is the next moon landing of our time.


I don't think Google uses this anymore, this is just not relevant these days


I though PageRank rated pages based on their rank smell.

Now that the patent's expired, somebody can create a free open source implementation called PageStink.


Still beyond what readily available tech can do. PageCrank, on the other hand...


Eric Schmidt: However, our report says that it's really important for us to find a way to maintain two generations of semiconductor leadership ahead of China. Now, the history here is important. In the 1980s, we created a group called SEMATECH. We had a bunch of semiconductor manufacturing in America. Eventually that all moved to East Asia, primarily Singapore, and then South Korea and now Taiwan through TSMC. The most important chips are made in Samsung and TSMC, South Korea, and Taiwan. China has had over 30 years to plan to try to catch up. It's really difficult.

Eric Schmidt: We don't want them to catch up. We want to stay ahead. We call for all sorts of techniques to try to make sure that we rebuild a domestic semiconductor and semiconductor manufacturing facility within the United States. This is important, by the way, for our commercial industry as well as for national security for obvious reasons. By the way, chips, I'm not just referring to CPU chips, there's a whole new generation, I'll give you an example, of sensor chips that sense things. It's really important that those be built in America.

https://www.hoover.org/research/pacific-century-eric-schmidt...

What are the chips that "sense things" and that Schmidt wants so much to prevent from being available to China and any other country?

Short answer: electromagnetic sensors used to spy on everybody so Google can show them "relevant ads" and profit. You will find them embedded on your CPU, on the nearest cellphone tower's transmitter and on Starlink satellites.

Slightly longer answer: since the 1980s, Silicon Valley has used semiconductor radars to collect data about what you think (your inner speech) by means of machine learning with data extract from wireless imaging of your face and body. It has proved very convenient for them, as this enables blackmail, extortion, theft, sabotage and murder like nothing else. They can do this because they design the semiconductor used on your phone, computer, TV, car and for your telecom supplier's network equipment, which makes possible to embed silicon trojans everywhere.

Don't underestimate what machine learning can do. e.g. Study shows AI can identify self-reported race from medical images that contain no indications of race detectable by human experts.

https://news.mit.edu/2022/artificial-intelligence-predicts-p...

Also, don't underestimate the number of people Silicon Valley is willing to kill to maintain a monopoly, as you may be the next victim.


They should have copyrighted it. They could have received new protections each time it was tweaked. If it's good for the Big E Mouse Corp, it should be good for the DontBeEvil types as well. Protection in perpetuity.


Great idea. I wonder how that would have played out in court. Any lawyers around?


21e8 replaces pagerank https://youtu.be/6HYdTtIoyts


I'm not sure if it's a joke or if he's for real... It's basically replacing google with "serach engine" where every place is bought.

Coming from SEO world this looks like a joke.


Selling the search result slots is just Overture all over again.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: