Mathematics at Google

btilly · on Sept 26, 2012

Seeing PageRank discussed reminds me of a piece of fun trivia. The idea for PageRank came out of the success of the Science Citation Index, which ranks papers according to how often they have been cited. The idea of trying to study the structure of citations in academia came out of people who were inspired by a 1948 essay, As We May Think.

But that essay's main topic was an imagined technology called memex, to be implemented with an automated indexing system and microfilm. This technology is the first description of hypertext, which inspired multiple technologies. The second successful consumer application that I'm aware of that used hypertext was the web. (The first was HyperCard from Apple.)

Thus Google started as the application of one set of techniques inspired by As We May Think to a technology that was also inspired by As We May Think.

See http://www.theatlantic.com/magazine/archive/1945/07/as-we-ma... for the essay itself. Do keep in mind that it was written one year after the transistor was invented, but the author already had 2 decades of experience with computing.

killerbat00 · on Sept 26, 2012

Vannevar Bush's ideas about information organization and consumption in the future were eerily accurate. Reading about the history of Memex and the roots of the Information Architecture field in general is something I highly recommend for anyone interested in Information Science, etc.

JumpCrisscross · on Sept 26, 2012

Recommended book(s)?

killerbat00 · on Oct 1, 2012

You can read Bush's original paper, As We May Think (where he presents the Memex) here: http://www.theatlantic.com/magazine/archive/1945/07/as-we-ma...

Here is a database of his papers: http://findingaids.loc.gov/db/search/xq/searchMfer02.xq?_id=...

andrewcooke · on Sept 26, 2012

bush is mentioned in mirowski's "machine dreams", and it's not very complementary. that book is one of my favourites, but it's a post-modern, opinionated, soft-science (he's a history / philosophy / economics guy) take on post-war economics (and a whole pile of surrounding subjects). i suspect most people will hate the book as much as i love it, but you might be interested (perhaps you can find a library copy to check out....)

pav3l · on Sept 26, 2012

Someone just started a discussion: http://news.ycombinator.com/item?id=4577865

pdog · on Sept 26, 2012

Previous discussion (two years ago, with comments):

http://news.ycombinator.com/item?id=1565764

j2kun · on Sept 26, 2012

Look at the slide entitled Gmail (5), and compare the picture with the first graph on my blog post http://jeremykun.wordpress.com/2011/08/11/the-perceptron-and...

It just goes to show, Google steals content without attribution just like everyone else.

tlammens · on Sept 26, 2012

I am trying really hard to ignore the Futurama picture. Really hard!

j2kun · on Sept 26, 2012

At least I gave the source :) And I don't claim I'm any better.

tlammens · on Sept 26, 2012

Correction, you added the source after my comment ;) But indeed, you didn't make that claim.

j2kun · on Sept 27, 2012

I noticed that I didn't add the source, but when I wrote it I thought the source was self-evident from my mention of "Fry" and the animation style.

pav3l · on Sept 26, 2012

Wait, they just used your graph and didn't even contact you about it?

j2kun · on Sept 26, 2012

Shocking, right? I demand compensation! :)

gms · on Sept 26, 2012

It's one guy who happens to work for Google, rather than the whole company.

j2kun · on Sept 26, 2012

I imagine they have a release process to get something on research.google.com, though. So the company is endorsing it.

gms · on Sept 26, 2012

What do you expect them to do? Somehow do an image-similarity search throughout the web for every image used in papers by their employees?

jackpirate · on Sept 26, 2012

Yeah, it's not like they just have a giant image search engine laying around!

patmcguire · on Sept 26, 2012

YouTube already has copyright infringement bots that mark copyrighted content and either give the holder the option to put ads on it or take it down.

Search also puts a lot of effort into identifying duplicated content to punish content farms. I've heard they're making progress on detecting algorithmically spun articles too (http://en.wikipedia.org/wiki/Article_spinning).

Oh, and they have it for images. Here's the search result for the one they used: https://www.google.com/search?tbs=sbi:AMhZZivFQlmjC8rcxWC0MZ...

So yes, it'd be trivial for them to do so.

j2kun · on Sept 26, 2012

I would expect them to read it once, and bounce it back to the author saying something along the lines of "Cite your sources, please." That's certainly within Google's power, no?

SIULHT · on Sept 26, 2012

But if it's on the web, doesn't it belong to Google now?

j2kun · on Sept 27, 2012

Aw crap. That was in the ToS wasn't it?

nine_k · on Sept 26, 2012

BTW completely feasible with Google's resources. Can be fully automated, too.

carlosaguayo · on Sept 26, 2012

No, but you would expect some kind of seriousness... Universities don't need to run image-similarity through papers published by their students right? Plagiarism is a big thing.

gms · on Sept 27, 2012

Right. They tell them the policy and make sure it is understood, then they trust them because they cannot police every single image.

chmodd · on Sept 26, 2012

They do (almost) the same thing at youtube, don't they?

Muzza · on Sept 26, 2012

Using the same logic, a company can never be said to "do" anything. Everything is (ultimately) done by a real person.

iamaaronlol · on Sept 27, 2012

The difference is in whether or not the individual is representing a company.

dxbydt · on Sept 26, 2012

The article has a section on the math used in Google Maps, which points to

http://algo2.iti.kit.edu/schultes/hwy/esaHwyHierarchies.pdf

which says - there are 24 million places in the USA, connected by 29 million roads. You need 4 hours 15 minutes to pre-process this information. From then on, it only takes 7 milliseconds to find the shortest path from one place to another by running the Multilevel Query Algorithm, which is a souped up version of Dijkstra and runs 2000 times faster than Dijkstra's Shortest Path algorithm.

Is that right ? 24 million choose 2 is 288 trillion, so do an all paths search, then have a lookup table with 288 trillion entries, store that in HDFS, slap an LRU caching layer atop that, and you wouldn't have to run any graph query algorithm at all, so should be able to do much better than 7 ms ... just thinking out loud.

psykotic · on Sept 26, 2012

Good luck precomputing all those pairwise shortest paths. Storing the table might not be too bad. But standard algorithms like Floyd-Warshall are O(n^3) in the number of vertices. There are faster algorithms based on fast matrix multiplication but the precomputation time would still be prohibitive. Keeping it up to date would be even worse. The construction of a new highway could require updating the shortest paths for an enormous number of pairs.

The economic argument would go like this: The revenue generated from your maps business is proportional to the number of queries actually processed, not the total number of conceivable queries. The queries processed is such a tiny subset of the possible queries that you want your computational expenses to track the former, not the latter.

With a hierarchical shortest paths algorithm, you can still precompute all pairwise shortest paths at the coarser level. For road navigation you might precompute the shortest path between every pair of interstate highway exits in the USA like this paper is doing. In an open-world game like Skyrim you might precompute the shortest paths between all towns and other major hubs and points of interest. That might not yield truly optimal paths. For game use it's close enough and has the benefit of corresponding to how humans naturally navigate.

Sidebar: The old Crash Bandicoot games used precomputed shortest paths in a neat way. Their navigation was based on a triangular mesh, so every triangle had up to three edgewise neighbors. Thus for a navmesh with n triangles, they needed lg(3) n(n-1)/2 <= n(n-1) bits to store the table. For convenience they probably stored this in a redundant form requiring 2n^2 bits = n^2/4 bytes. But with only 64KB of additional memory, this still let them support 512 navmesh triangles per level with lightning-fast path finding.

dxbydt · on Sept 26, 2012

>For road navigation you might precompute the shortest path between every pair of interstate highway exits in the USA like this paper is doing.

Oh, so they are populating a lookup table, not for the entire nodeset but just the tuples denoting interstate highways. All I was advocating was a much larger ( and more expensive) lookup table :) btw, you've edited your response like 6 times now...everytime I try to reply there's new info in there!

psykotic · on Sept 26, 2012

> btw, you've edited your response like 6 times now...everytime I try to reply there's new info in there!

Sorry! The lack of preview for HN comments is hell for my iterative writing style.

jacobr · on Sept 27, 2012

Increase the "delay" in your settings to get a delay until the comment is visible to others.

dhimes · on Sept 26, 2012

A lot of us write in an offline text editor and paste the results when we are more-or-less finished (the finish line is an asymptote for people like us!).

psykotic · on Sept 27, 2012

Dijkstra is cursing us from beyond the grave.

sb · on Sept 27, 2012

Hm, thanks for the nice remark on the game stuff, I kind of like that.

While agree with your primary criticism, I think the principle of locality applies well to this problem. While there are 24 million places (supposedly, never checked that from the paper), I'm positive that most maps queries are served to the areas with higher population density. Therefore, it might make sense to use the look-up table approach for densly populated areas to reduce the search time. (OTOH, the computer scientist in me reminds me that the 7ms might pale in comparison to latency times for mobile phones receiving the map information...)

psykotic · on Sept 27, 2012

> Therefore, it might make sense to use the look-up table approach for densly populated areas to reduce the search time.

Yes, that's my favorite way to apply hierarchical path finding. In the two-level path finding system I wrote for my homebrew MUD as a kid, I precomputed all pairwise shortest paths within each zone (a zone had on the order of 100 rooms) and then all pairwise shortest paths between zone-to-zone transitions (a typical zone had around 5 transitions and each transition belongs to two zones). With these tables, here's all you need to do to find a shortest path between a given pair of rooms:

If the two rooms are in the same zone, consult the intra-zone table for their zone and you're done. Otherwise, for both rooms, find the shortest paths to all transitions in their current zone using the intra-zone tables. Let's say there are 5 transitions in both zones. Then for each of the 20 transition pairs, consult the inter-zone table to find the shortest path between them. That gives you a small set of candidate paths (at most 20 in this example) and you pick the shortest among them.

That's 5 + 5 + 20 = 30 fast table lookups and a comparison of 20 numbers to find the shortest path between any pair of rooms.

The storage cost for this is (100 * 5 / 2) * ((100 * 5 / 2) - 1) / 2 bytes ~= 30 kilobytes for the inter-zone table and 100 * (100 * 99 / 2) bytes ~= 500 kilobytes for the intra-zone tables. That wasn't a trivial amount of memory in those days but it wasn't prohibitive either. Compare that to 10,000 * 99,999 / 2 bytes ~= 50 megabytes for directly storing shortest paths between all pairs of rooms. You couldn't store all that in memory and having to read from disk would more than wipe out the 30:1 advantage in the number of lookups versus the two-level approach.

I'm sure you see the analogy with road networks and highways: a transition is like a highway exit, etc. The important point is that there is a way of decomposing the connectivity graph into subgraphs such that only a small number of edges enter and leave each subgraph.

You can see in my calculation that almost all the space is taken up by the intra-zones tables. It would certainly help to compute intra-zone shortest paths on demand with caching, as you suggest.

azth · on Sept 26, 2012

...and use up tons of hardware when you can get by without doing that?

cjdrake · on Sept 26, 2012

This is a fantastic publication! I do some part-time mathematics tutoring, and kids are always wondering where math is used in "real life". Since kids are all familiar with Google, this should resonate with them.

tantalor · on Sept 26, 2012

My mathematician friend pointed out that all that "research at google" requires "experience with large data sets and quantitative analysis". They want statisticians, not mathematicians.

micro_cam · on Sept 26, 2012

I'm not a googler but I do know a bit of Math and Statistics and many of the approaches they take can be cleverly reduced to optimization problems or matrix math both of which are more the domain of applied mathematicians then statisticians.

danso · on Sept 26, 2012

My day-to-day programming consists so much of process and simple boolean logic that I hardly ever use math more challenging than 1 + 1 and 1 != 0. It's great to review how math can greatly influence the potential of your code.

btilly · on Sept 26, 2012

My experience is that people who are aware of opportunities to use math find them. And people who are not aware conclude that they weren't really there.

For example are you a web developer? Have you started using A/B testing yet? If yes, that's math. If no, then you are in a position to make your employer a substantial amount of money.

For another example my very first project in my first programming job required me to build a set of canned reports. There was one user request which turned into, "We want to do this, but don't know if it is possible." I wound up using inclusion-exclusion techniques that I'd learned in grad school to solve that problem! (The problem was how to ship a local database without detailed information they shouldn't have about the rest of the company, and yet still be able to present reports comparing themselves to the rest of the company. And those reports had to drill down. So if you were responsible for 5 sites you'd be able to generate those reports for everyone that you could see, or for each site individually.)

wickedchicken · on Sept 26, 2012

> simple boolean logic

A SAT solver[1] can automatically figure out a range of acceptable solutions for your conditions. 'I want this goal, is there a way to make that happen?' Security people love this. SMT solvers are even crazier[2] and can automatically determine solutions to linear equations or inequalities (this comes in handy when analyzing loops). Z3 has an online version at [3].

Also, if you have a lot of nasty cascaded if-statements, a K-map[4] and espresso[5] can reduce those down to a minimal set of branches. Note that fewer branches means fewer branch prediction penalties, which means faster code!

[1] http://en.wikipedia.org/wiki/Boolean_satisfiability_problem

[2] http://media.tumblr.com/tumblr_m8ywn1iesH1r6uu3b.gif

[3] http://rise4fun.com/z3/tutorial/guide

[4] http://en.wikipedia.org/wiki/Karnaugh_map

[5] http://embedded.eecs.berkeley.edu/pubs/downloads/espresso/in... and ftp://ftp.cs.man.ac.uk/pub/amulet/balsa/other-software/espresso-ab-1.0.tar.gz

pav3l · on Sept 26, 2012

Do they still heavily rely on PageRank? With the amount of traffic data Google has, I would expect more statistical approaches based on what users click (rather than graph algorithms based on how the web is linked) to be the backbone for ranking their results.

j2kun · on Sept 26, 2012

PageRank is definitely still used, although it's doubtfully the sole determinant of ranking results. See http://jeremykun.wordpress.com/2011/06/21/googles-page-rank-...

technel · on Sept 26, 2012

I agree with your statement, but I'm confused by the story in the link... I haven't come across a message board that doesn't use rel="nofollow" in a few years (that NYT article is from 2010). How would these negative reviews bolster his PageRank?

wahnfrieden · on Sept 26, 2012

Google seems to still weight those links, just differently or with less weight.

franze · on Sept 27, 2012

i'm always disgusted when a googler publishes anything about pagerank. yes, google is still using links as a single in their search result, but the original pagerank paper is from 1998, we can sure as hell be sure that they iterrated/rewrote it a few thousand times since then.

every time a (any) googler now publishes a paper (or presentation) about (the old) page rank a horde of too well paid SEOs is pointing to it for the importance of pagerank, and why linking out is bad, and whatever bullshit (i.e.: later in this thread the based on nothing hypothesis about "nofollow") they come up with. everytime it happens, the SEO bullshit dance starts anew. (in my opinion pagerank is thought-cancer, please see my old TC article for more about this http://techcrunch.com/2010/07/07/startups-linking-to-your-co... )

iskander · on Sept 26, 2012

It's a closely guarded secret but at the very least we know that every page is annotated with many different signals, which are combined with magical secret sauce.