How Google Code Search Worked (2012)

burntsushi · on March 12, 2020

I am planning on adding functionality like this to ripgrep. If folks have opinions on how it should work, I'd love to hear from you! https://github.com/BurntSushi/ripgrep/issues/1497

_5qp2 · on March 12, 2020

Thank you for ripgrep! I will mention it in that thread too but I'm AFK so before I forget...

I'm imagining a "drill down" TUI with rg and fzf. fzf can be good for both filenames and other filter-downs. Thinking of breadcrumbs and easily stepping forward or backward, ability to easily bookmark/"pin" parts of search paths as presets for easy reuse later, etc.

EDIT: I recognize this would be outside of the scope of rg itself, I'm voicing it in case it sparks ideas about the functionality you're thinking of adding. I'll think more about it and see if I can explain better

rkhassen9 · on March 12, 2020

Code search was “too good to be true, gotta pinch myself” awesome. I miss it to this day.

Here is how I used it - I’d type in some code I was working on and the search result would show similar code and how it was used. Great for debugging and thinking by looking at similar solutions. Sigh.

tehlike · on March 12, 2020

For me it was f:cc$ f:contentads some keyword I wanted to learn more about. Then a bunch of cross refs.

londons_explore · on March 12, 2020

You can still play with it here: https://cs.chromium.org/

fzzfff · on March 12, 2020

wow this site is so laggy, even on chromium

londons_explore · on March 12, 2020

Google engineers have superfast high end desktops and laptops, and 10 gbit internet connections. They don't tend to optimise their internal tools for low spec machines or internet connections.

_5qp2 · on March 12, 2020

Maybe you saw it but no need to sigh anymore:

https://news.ycombinator.com/item?id=22551856

https://opensource.google

j_knowles · on March 12, 2020

This gives me so much nostalgia!

greenyouse · on March 12, 2020

Weird, I was just looking into google code search this weekend so I could use something like it on my work computer. It's a little surprising that big co git storage companies don't have a proper code search tool as part of their package. I use Bitbucket right now but the search is built over Elasticsearch and special characters aren't handled so regular expressions won't work.

A couple open source projects that I've seen are Hound and Zoekt. Hound actually uses this code search backend with a nice frontend in React. Zoekt is what I was going to use since it scales really well, is faster, and has good search operators for filtering by repo name, language, etc. Google was using Zoekt until recently for code search across all their open source repos.[0]

[0]https://cs.chromium.org/

Cyph0n · on March 12, 2020

We use Opengrok at Cisco. It’s a pretty barebones interface, but it works well.

sebastos · on March 12, 2020

Interesting, because the search function that Cisco's intranet provides for documents and such is perhaps the single most useless piece of technology I've ever encountered. You could search something like "401k plan" and you'd get marketing materials written in Japanese. Utter trash.

dvirsky · on March 12, 2020

The current internal CodeSearch is one of the best tools available for Google engineers. It's really a marvel.

tehlike · on March 12, 2020

Code search, critique, Piper/citc/cider are amazing for developing

Power drill is fantastic for drilling down data. So much money was made thanks to this one.

dvirsky · on March 12, 2020

Piper was actually a big source of frustration for me. Yeah it's dead simple, but once you have a CL chain, you're entering a world of pain. I've switched to Fig a while ago and haven't looked back. Beyond a tiny fix I'll start editing from CS or a throwaway citc client, it's just simpler to use fig. I've been able to juggle 4-5 CL chains easily and it makes my workflow much easier. Also splitting CLs before review is much simpler with Fig.

ignoramous · on March 12, 2020

Are these code tools built using https://github.com/kythe/kythe ? Any other OSS projects by Google that back these?

cameronbrown · on March 12, 2020

I believe so. Kythe seems to have spawned out of the internal CodeSearch.

_5qp2 · on March 12, 2020

Something that is not by Google but which you would probably like - SourceGraph. Has commercial options but it is OSS.

https://about.sourcegraph.com/

jules · on March 12, 2020

Do you have a list of all major Google tools and why they're better than what's available elsewhere? (if so)

antoinealb · on March 12, 2020

In my opinion, what makes them really great is the tight integration that they have. For example, since the whole company uses one build system and one single repository, you can build a truly awesome IDE that knows about every library in the company and can autocomplete for it. Same for code search, where cross references are accurate and work cross languages (for example a class generated from Protobuf).

londons_explore · on March 12, 2020

Outside Google, the percentage of my coding time I spend hunting some dependencies source tree for the relevant header files or documentation or "Where on earth is this constant defined" is huge.

With codesearch, answering those kind of questions is near instant.

cameronbrown · on March 12, 2020

:) https://github.com/jhuangtw-dev/xg2xg

dvirsky · on March 12, 2020

So CodeSearch, Critique, Borg, Sherlog, Cider come to mind as top notch tools that are not available outside. As far as libraries go, the C++ Fibers thing is incredible and I don't think it's open.

Blaze is amazing (albeit a bit slow) but Bazel should be more or less the same, haven't used it. Dremel, Spanner, Tensorflow, Proto, grpc are all available outside. Abseil (https://abseil.io/) is a great library available to everyone.

_pxkn · on March 12, 2020

Other great articles on regexes from rsc:

https://swtch.com/~rsc/regexp/regexp1.html

https://swtch.com/~rsc/regexp/regexp2.html

ninkendo · on March 12, 2020

I’ve always wondered why regular expressions and full text indexes are the best thing we expect out of a code search engine.

I mean, we’re talking about code here. Text meant to be interpreted and understood by a compiler. Why can’t we do better?

Why can’t I say “show me everyone that’s calling this function”, like an IDE lets me do? Or “show me functions that accept <type> as one of their arguments and return <type>”, in a way that integrates with the real grammar/AST of the language(s) in question, without resorting to clunky regular expressions?

I should be able to write structured queries against a codebase, with regexes being just one part of that query language.

michaelt · on March 12, 2020

Only some languages can readily support such features.

For example C has a preprocessor and linking step driven by a build system. And C has a bunch of different build systems available, some of which are procedural rather than declarative.

Maybe you'll need to support package management - if a function signature calls for a CopyOnWriteArrayList do you need to know what the subclasses and superclasses of that type are? Do you need to resolve all the dependencies to be able to do that?

If you're thinking "No problem, everyone compiles their programs in CI anyway" - are you happy to skip indexing unused code and uncompileable code?

And of course you'll be chasing after language and build tool changes - not only to one language, but every language.

On the other hand, a nice simple grep? Sounds much simpler to me.

londons_explore · on March 12, 2020

Modern Google Codesearch does exactly this.

You can try it out here: https://cs.chromium.org/

In typical google style, the documentation is all google-internal, but by clicking bits of source code you should figure out most of the commands.

Doesn't work well on mobile unless you have a beefy CPU - sorry!

ovi256 · on March 12, 2020

It's worth taking a look at livegrep (Try it here: https://livegrep.com/search/linux) as an alternative to your git provider's code search.

Quoting patio11: "I intend to boot up a livegrep instance on the first day of every startup for the rest of my life. It borders on miraculous."

It is indeed very good.

hnaccy · on March 12, 2020

I'm baffled at how bad github code search even for enterprise github deployments. Is there some third party solutions that are popular or standard?

newusertoday · on March 12, 2020

try sourcegraph https://github.com/sourcegraph/sourcegraph . It is backed by company so you can buy enterprise support as well if required.

generatorguy · on March 12, 2020

for some reason I always thought russ cox was 60 years old (even 15 years ago) with a big grey neck beard. boy was I wrong!

kick · on March 12, 2020

He's not that old, but he's not that young, either. He worked on Plan 9 for like a decade before this happened. A few years before he put out this blog post, he finished his PhD thesis. While a bit silly to hire someone with that much experience as an intern, it is how most PhDs are treated.

generatorguy · on March 12, 2020

I was actually thinking of Alan Cox!

kick · on March 12, 2020

Alan Cox is in his fifties!

enneff · on March 12, 2020

Try to imagine my shock and dismay when, after working with Russ for a few years, I discovered that we were the same age.

bminor13 · on March 12, 2020

A minor note in the article reads:

> To minimize I/O and take advantage of operating system caching, csearch uses mmap to map the index into memory and in doing so read directly from the operating system's file cache. This makes csearch run quickly on repeated runs without using a server process.

Does anyone know some resources where I can read more about this technique? (how-to, pros/cons, caveats, etc.) I'm interested in figuring out the best way to have a commandline tool persist state that it can quickly access across multiple runs, but so far a background server process is the only technique I'm familiar with.

akkartik · on March 12, 2020

Yeah, you have to run a server. When a process exits, all its mmap'd pages are reclaimed. Just like any other memory.

bminor13 · on March 12, 2020

But this excerpt says that this technique obviates the need for a server process? Are they saving the contents of memory into files using mmap, and then using this state on every run?

akkartik · on March 12, 2020

I just reread your quoted passage and noticed the reference to the OS's file cache. Ok, this is outside my knowledge :)

chubot · on March 12, 2020

https://github.com/Debian/dcs/

https://codesearch.debian.net/research/bsc-thesis.pdf

YesThatTom2 · on March 12, 2020

Wait... /usr/include on Mac OS Lion included constants for DATAKIT?

Was that a joke? Does someone have a Lion system around that can verify?

mjlee · on March 12, 2020

https://github.com/apple/darwin-xnu/blame/a449c6a3b8014d9406...

It's been there for 11 years in this repo. Along with DECnet. Suppose there's no real drive to remove it.

ausjke · on March 12, 2020

opengrok is the best I have used so far, java-based, can search _huge_ code base (e.g. android source code, linux kernel, whatever you throw at it)

https://oracle.github.io/opengrok/