It will be fun to see Google's algorithm for ranking search results.

otherotherchris · on April 23, 2022

"Here's 100PiB of unlabeled neural net weights. Knock yourselves out."

simion314 · on April 23, 2022

>"Here's 100PiB of unlabeled neural net weights. Knock yourselves out."

You need to give the user an explanation on why you blocked his account, but if Google is kind enough to add on top the secret neural network then some people would be happy to have a look at it and find even more garbage in it.

NullPrefix · on April 23, 2022

>Violation of guidelines and community standards.

ohgodplsno · on April 23, 2022

Which is itself detected by a 80PiB neural network, based on the 60TB output of new rules that another neural network spits out every week based on the temperature outside of the corner office and the taste of Sundar's coffee this morning.

ClumsyPilot · on April 24, 2022

So glGoogke results went to shit because the ciffee became terrible?

ohgodplsno · on April 24, 2022

The coffee roast temperature and grind is decided every day by yet another ML algorithm, as Google has effectively an unlimited army of ML researchers and infinite computing power. A rogue PhD on a cocaine binge unfortunately tuned the parameters too high once, and the results have been getting worse ever since, as a result of Sundar being increasingly disappointed by the coffee, but not being able to do anything about it because "it's the algorithm"

jasfi · on April 23, 2022

They want to know how the algorithms work, not the data itself.

amelius · on April 23, 2022

"Bad programmers worry about the code. Good programmers worry about data structures and their relationships."

-- Linus Torvalds

ohgodplsno · on April 23, 2022

Linus never had to deal with hundreds of petabytes of search data, nor ML black boxes, to be fair.

otherotherchris · on April 23, 2022

Google doesn't know what the algorithm is anymore. The whole site is a black box.

notacoward · on April 23, 2022

Same at FB as far as I could tell while I was there. "The algorithm" is a misnomer, popularized by the press but really kind of silly. There are really thousands of pipelines and models developed by different people running on different slices of the data available. Some are reasonably transparent. Others, based on training, are utterly opaque to humans. Then the weights are all combined to yield what users see. And it all changes every day if not every hour. Even if it could all be explained in a useful way, that explanation would be out of date as soon as it was received.

I'm not saying that to defend anyone BTW. This complexity and opacity (which is transitive in the sense that a combined result including even one opaque part itself becomes opaque) is very much the problem. What I'm saying is that it's likely impossible for the companies to comply without making fundamental changes ... which might well be the intent, but if that's the case it should be more explicit.

manquer · on April 23, 2022

What needs to be shared is a high level arch not nuts and bolts.

At a broad level:

what are the input sources like IP address , clicks on other websites etc you use to feed the model.

What is the overall system optimized for , like some combination of engagement , view time etc, just listing them if possible in a order of preference is good enough

Alternatively what does your human management measure and monitor as the business metrics of success .

I want to know what behaviors (not necessarily how ) are used , I want to know what is feed trying to optimize for , more engagement, more view time to etc

This is not adversarial, knowing this helps as modify user behavior to make the model work better.

Users already have some sense of this and work around it blindly , for example YouTube has heavy emphasis on resent views and search . I (and am sure others) would use signed out user to see content way outside my interest area so my feed isn’t polluted with poor recommendations. I may have watched 1000’s hours of educational content but google would still think some how to video I watched once means I need to only see that kind of content.

Google knows it is me sure even am signed out, but they don’t use it change my feed that’s the important part and knowing that can help improve my user experience

ekianjo · on April 23, 2022

> Google doesn't know what the algorithm is anymore

You are an insider?

tyingq · on April 23, 2022

They haven't talked much detail since Matt Cutts left, but over time they did sort of outline the basics. That the core ranking is still some evolution of PageRank, weighting scoring of page attributes/metadata and flowing it down/through inbound links as well. But then altered via various waves of ML, like Vince (authority/brand power), Panda (inbound link quality), Penguin (content quality), and many others that targeted other attributes (page layout, ad placement, etc).

Even if some of that is off, the premise of a chain of some ML, and some not ML, processors means they probably can't really tell you exactly why anything ranks where it does.

dehrmann · on April 23, 2022

It's clear the public and lawmakers like the idea of knowing how the algorithm works, but what you posted is about as deep as people can reasonably understand at a high level. I don't think they realize how complex a system built over 20 years that's a trillion-dollar company's raison d'être can be.

zo1 · on April 23, 2022

Those sound like awesome potential features. Allow users to assign 0-100% weights for each of those scoring adjustments during search,and show them the calcs (if you can).

tyingq · on April 23, 2022

Supposedly there's thousands of different features that are scored, and those are just the rolled-up categories that needed their own separate ML pipeline step.

Like, maybe, for example, a feature is "this site has a favicon.ico that is unique and not used elsewhere" (page quality). Or "this page has ads, but they are below the fold" (page layout). Or "this site has > X amount of inbound links from a hand curated list of 'legitimate branded sites'" (page/site authority).

Google then picks a starting weight for all these things, and has human reviewers score the quality of the results, order of ranking, etc, based on a Google written how-to-score document. Then tweaks the weights, re-runs the ML pipeline, and has the humans score again, in some iterative loop until they seem good.

There's a never-acted-on FTC report[1] that describes how they used this system to rank their competition (comparison shopping sites) lower in the search results.

[1] http://graphics.wsj.com/google-ftc-report/

Edit: Note that a lot of detail is missing here. Like topic relevance, where a site may rank well for some niche category it specializes in. But that it wouldn't necessarily rank well for a completely different topic, even with good content, since it has no established signals it should.

dehrmann · on April 23, 2022

> and those are just the rolled-up categories that needed their own separate ML pipeline step.

AKA ensemble models.

jasfi · on April 23, 2022

I doubt it, they should know what the various algorithms are, especially the most important ones that drive most of the ranking. But their competitive advantage would be on the line.

wetpaws · on April 23, 2022

Data is already an algorithm

thedeadfish · on April 23, 2022

Google manually adjusts its results for censorship reasons. This is probably why google has gotten so much worse, they don't want information to be freely accessible, they only want things they approve of to be seen.

otherotherchris · on April 23, 2022

I reckon you're right, but I doubt that it's manual or under Google's control. Google is too important a tool of control to be left in the hands of Silly Valley idealists.

I've always wondered why Sergey Brin and Larry Page retired when they did, it coincides almost exactly with the beginning of the SERP quality decline. Wonder what sort of conversation they had with intelligence to quietly walk to the door, cash out, and say nothing about the company since.

blihp · on April 23, 2022

What happened was they got what they wanted: full control of running the business. Then they quickly learned that was actually a lot of work and not very much fun, made some fairly unpopular decisions (business, product and policy) with a fair amount of public backlash, put Sundar in charge and backed away.