Lucene: The Good Parts

latchkey · on June 5, 2015

Here is a bit of interesting history that I bet few people know about.

Doug Cutting worked on VTwin at Apple long before he wrote Lucene. This was a codename for Apple's search technology they were building into the OS. I knew about it because at the time I was building computer labs for my college and bought a lot of Apple hardware (about a million dollars worth) and my rep added me to the private beta.

Many years later, when I saw the Lucene project being worked on and saw that Doug was behind it, I immediately reached out to him and asked him if he was interested in joining the Jakarta group, as I thought this would be a great addition to our growing community.

Needless to say, the rest is history, but I feel like if nobody had reached out to Doug, he may not have gotten as much exposure and may not have been motivated to start the rest of the amazing projects that he's worked on, including Hadoop.

=)

[1] https://www.google.com/search?q=apple+vtwin+codename

derefr · on June 6, 2015

> SQL was not then, and is still not now, a very good blob or document storage system.

What does that even mean? SQL is a wire protocol for relational queries. Document/blob/key-value stores can be made to speak SQL. Most of them just choose not to implement SQL support for some weird reason.

sk5t · on June 6, 2015

SQL is not a wire protocol, it is a language/grammar/loose-set-of-language-conventions. The wire protocol is up to the server and driver to agree upon.

I'll posit that most non-relational stores resist supporting any SQL because it would so quickly make plain how handicapped are their query and set-based capabilities.

pixelmonkey · on June 6, 2015

(Author here.) As others have pointed out, I am here referring to SQL as a broad term for "SQL RDBMSes". When I say that it is not a good blob or document storage system, what I mean is this: You can plop unstructured data into a SQL RDBMS using something like JSON serialization, but it's not generally a good idea. Document storage requires flexible schemas, so SQL schemas are also not generally a great idea -- you end up with tables with lots of nullable fields that are usually empty.

As an example, imagine a single database where you want to store actual desktop documents, such as the formats supported by Apache Tika: https://tika.apache.org/1.8/formats.html -- If you try to model this using a SQL schema, you'll likely be in for a world of pain. From a UX standpoint, a user just wants to "search across all documents", but you have hundreds of heterogeneous types with varying degrees of field-level compatibility.

bunderbunder · on June 6, 2015

What do you mean by "not generally a good idea"? In the past I've had little cause to complain about using XML fields to store and search semi-structured data in MS SQL Server.

I have read that it doesn't scale out as nicely as document stores do, and knowing SQL Server I don't have trouble believing that. Personally I'm a very long way away from needing to worry about scale out[1] in the applications I use a DBMS for, though, so that's never really kept me up at night.

Word I've heard on the street is that the story's similar for PostgreSQL.

[1]: http://yourdatafitsinram.com

Tenhundfeld · on June 6, 2015

I think you're being overly pedantic – or maybe deliberately obtuse. I can't tell.

In discussions of "NoSQL" technologies, it's common to use SQL as a shorthand for traditional, relational databases which generally support SQL. We're talking about databases like MySQL, Oracle, Postgres, MS SQL Server, etc.

Sure, Andrew could have been more precise with this language, but from the context, do you honestly think he's using SQL to mean specifically Structured Query Language, the 4G language developed in the 70's?

altcognito · on June 6, 2015

Is it "overly" pedantic when making broad criticism of a given technology to at least be using the word that identifies said technology? RDBS is really common way to identify this class of products.

Tenhundfeld · on June 18, 2015

You know what. I think you're right. It's not overly pedantic to say that the article shouldn't use SQL to mean RDBS – or RDBMS, as I see it more commonly.

My comment was specifically in response to your approach, not your overall point: > What does that even mean? SQL is a wire protocol for relational queries.

First, if we're being pedantic, I disagree with your usage of "wire protocol" to describe the SQL language, but I get what you're saying. What I was trying to comment on was your feigned confusion. It was perfectly clear to me what the author meant by "SQL" in the context of the article, and I think you understood too.

If you had said something like, "SQL is a language and bad term to describe Relational Database Management Systems (RDBMS), which is what the author really meant," I wouldn't have commented. I probably would have upvoted actually, because I agree.

Anyway, not a big deal, I don't think we're in conflict with the substance of your point, just how you chose to express it in this one example. Have a great day. :)

pixelmonkey · on June 6, 2015

I was just informed today that this article, "Lucene: The Good Parts", which I wrote a few months ago, was also just published in Hacker Monthly for this month's print/digital issue (June 2015):

http://hackermonthly.com/issue-61.html

ddorian43 · on June 5, 2015

I still don't understand why lucene can't use the original document for aggregations ? I undestand that it will be slower if the field isn't indexed, but it should be doable like rdbms do.

boomzilla · on June 5, 2015

It can, as long as you store all the fields that will be needed for aggregation. Stored, non-indexed field retrieval can be slow as it might involve a lot of random seeks, so it might be slower than relation dbs.

What use case do you have in mind?

ddorian43 · on June 5, 2015

I mean in the case of _source field that elasticsearc uses. But seems that is separate. My usecase is I want to do querying+filtering on all fields of the document. In this case you have to:

1.store the field or document so you can get back the value (ex: when querying the document)

2.index the field for filtering

3.separately index the field for aggregation(doc_values)

While in rdbms you only need(1).

prasanthv · on June 6, 2015

Great article! Never even knew about Lucene.

bracewel · on June 6, 2015

sigh HTTPS mixed content... really?