Hacker News new | past | comments | ask | show | jobs | submit login
How Google Code Search Worked (2012) (swtch.com)
168 points by rsc on March 12, 2020 | hide | past | favorite | 43 comments



I am planning on adding functionality like this to ripgrep. If folks have opinions on how it should work, I'd love to hear from you! https://github.com/BurntSushi/ripgrep/issues/1497


Thank you for ripgrep! I will mention it in that thread too but I'm AFK so before I forget...

I'm imagining a "drill down" TUI with rg and fzf. fzf can be good for both filenames and other filter-downs. Thinking of breadcrumbs and easily stepping forward or backward, ability to easily bookmark/"pin" parts of search paths as presets for easy reuse later, etc.

EDIT: I recognize this would be outside of the scope of rg itself, I'm voicing it in case it sparks ideas about the functionality you're thinking of adding. I'll think more about it and see if I can explain better


Code search was “too good to be true, gotta pinch myself” awesome. I miss it to this day.

Here is how I used it - I’d type in some code I was working on and the search result would show similar code and how it was used. Great for debugging and thinking by looking at similar solutions. Sigh.


For me it was f:cc$ f:contentads some keyword I wanted to learn more about. Then a bunch of cross refs.


You can still play with it here: https://cs.chromium.org/


wow this site is so laggy, even on chromium


Google engineers have superfast high end desktops and laptops, and 10 gbit internet connections. They don't tend to optimise their internal tools for low spec machines or internet connections.



This gives me so much nostalgia!


Weird, I was just looking into google code search this weekend so I could use something like it on my work computer. It's a little surprising that big co git storage companies don't have a proper code search tool as part of their package. I use Bitbucket right now but the search is built over Elasticsearch and special characters aren't handled so regular expressions won't work.

A couple open source projects that I've seen are Hound and Zoekt. Hound actually uses this code search backend with a nice frontend in React. Zoekt is what I was going to use since it scales really well, is faster, and has good search operators for filtering by repo name, language, etc. Google was using Zoekt until recently for code search across all their open source repos.[0]

[0]https://cs.chromium.org/


We use Opengrok at Cisco. It’s a pretty barebones interface, but it works well.


Interesting, because the search function that Cisco's intranet provides for documents and such is perhaps the single most useless piece of technology I've ever encountered. You could search something like "401k plan" and you'd get marketing materials written in Japanese. Utter trash.


The current internal CodeSearch is one of the best tools available for Google engineers. It's really a marvel.


Code search, critique, Piper/citc/cider are amazing for developing

Power drill is fantastic for drilling down data. So much money was made thanks to this one.


Piper was actually a big source of frustration for me. Yeah it's dead simple, but once you have a CL chain, you're entering a world of pain. I've switched to Fig a while ago and haven't looked back. Beyond a tiny fix I'll start editing from CS or a throwaway citc client, it's just simpler to use fig. I've been able to juggle 4-5 CL chains easily and it makes my workflow much easier. Also splitting CLs before review is much simpler with Fig.


Are these code tools built using https://github.com/kythe/kythe ? Any other OSS projects by Google that back these?


I believe so. Kythe seems to have spawned out of the internal CodeSearch.


Something that is not by Google but which you would probably like - SourceGraph. Has commercial options but it is OSS.

https://about.sourcegraph.com/


Do you have a list of all major Google tools and why they're better than what's available elsewhere? (if so)


In my opinion, what makes them really great is the tight integration that they have. For example, since the whole company uses one build system and one single repository, you can build a truly awesome IDE that knows about every library in the company and can autocomplete for it. Same for code search, where cross references are accurate and work cross languages (for example a class generated from Protobuf).


Outside Google, the percentage of my coding time I spend hunting some dependencies source tree for the relevant header files or documentation or "Where on earth is this constant defined" is huge.

With codesearch, answering those kind of questions is near instant.



So CodeSearch, Critique, Borg, Sherlog, Cider come to mind as top notch tools that are not available outside. As far as libraries go, the C++ Fibers thing is incredible and I don't think it's open.

Blaze is amazing (albeit a bit slow) but Bazel should be more or less the same, haven't used it. Dremel, Spanner, Tensorflow, Proto, grpc are all available outside. Abseil (https://abseil.io/) is a great library available to everyone.



I’ve always wondered why regular expressions and full text indexes are the best thing we expect out of a code search engine.

I mean, we’re talking about code here. Text meant to be interpreted and understood by a compiler. Why can’t we do better?

Why can’t I say “show me everyone that’s calling this function”, like an IDE lets me do? Or “show me functions that accept <type> as one of their arguments and return <type>”, in a way that integrates with the real grammar/AST of the language(s) in question, without resorting to clunky regular expressions?

I should be able to write structured queries against a codebase, with regexes being just one part of that query language.


Only some languages can readily support such features.

For example C has a preprocessor and linking step driven by a build system. And C has a bunch of different build systems available, some of which are procedural rather than declarative.

Maybe you'll need to support package management - if a function signature calls for a CopyOnWriteArrayList do you need to know what the subclasses and superclasses of that type are? Do you need to resolve all the dependencies to be able to do that?

If you're thinking "No problem, everyone compiles their programs in CI anyway" - are you happy to skip indexing unused code and uncompileable code?

And of course you'll be chasing after language and build tool changes - not only to one language, but every language.

On the other hand, a nice simple grep? Sounds much simpler to me.


Modern Google Codesearch does exactly this.

You can try it out here: https://cs.chromium.org/

In typical google style, the documentation is all google-internal, but by clicking bits of source code you should figure out most of the commands.

Doesn't work well on mobile unless you have a beefy CPU - sorry!


It's worth taking a look at livegrep (Try it here: https://livegrep.com/search/linux) as an alternative to your git provider's code search.

Quoting patio11: "I intend to boot up a livegrep instance on the first day of every startup for the rest of my life. It borders on miraculous."

It is indeed very good.


I'm baffled at how bad github code search even for enterprise github deployments. Is there some third party solutions that are popular or standard?


try sourcegraph https://github.com/sourcegraph/sourcegraph . It is backed by company so you can buy enterprise support as well if required.


for some reason I always thought russ cox was 60 years old (even 15 years ago) with a big grey neck beard. boy was I wrong!


He's not that old, but he's not that young, either. He worked on Plan 9 for like a decade before this happened. A few years before he put out this blog post, he finished his PhD thesis. While a bit silly to hire someone with that much experience as an intern, it is how most PhDs are treated.


I was actually thinking of Alan Cox!


Alan Cox is in his fifties!


Try to imagine my shock and dismay when, after working with Russ for a few years, I discovered that we were the same age.


A minor note in the article reads:

> To minimize I/O and take advantage of operating system caching, csearch uses mmap to map the index into memory and in doing so read directly from the operating system's file cache. This makes csearch run quickly on repeated runs without using a server process.

Does anyone know some resources where I can read more about this technique? (how-to, pros/cons, caveats, etc.) I'm interested in figuring out the best way to have a commandline tool persist state that it can quickly access across multiple runs, but so far a background server process is the only technique I'm familiar with.


Yeah, you have to run a server. When a process exits, all its mmap'd pages are reclaimed. Just like any other memory.


But this excerpt says that this technique obviates the need for a server process? Are they saving the contents of memory into files using mmap, and then using this state on every run?


I just reread your quoted passage and noticed the reference to the OS's file cache. Ok, this is outside my knowledge :)



Wait... /usr/include on Mac OS Lion included constants for DATAKIT?

Was that a joke? Does someone have a Lion system around that can verify?


https://github.com/apple/darwin-xnu/blame/a449c6a3b8014d9406...

It's been there for 11 years in this repo. Along with DECnet. Suppose there's no real drive to remove it.


opengrok is the best I have used so far, java-based, can search _huge_ code base (e.g. android source code, linux kernel, whatever you throw at it)

https://oracle.github.io/opengrok/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: