Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Elasticsearch uses Lucene at the core of it's search capability. https://lucene.apache.org/core/

Basically it examines each record you want to index and generates a set of features and metadata to aid in rapidly finding what you're looking for. Some of the features listed on that site are:

  ranked searching -- best results returned first
  many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
  fielded searching (e.g. title, author, contents)
  sorting by any field
  multiple-index searching with merged results
  allows simultaneous update and searching
  flexible faceting, highlighting, joins and result grouping
  fast, memory-efficient and typo-tolerant suggesters
  pluggable ranking models, including the Vector Space Model and Okapi BM25
  configurable storage engine (codecs)
Basically, a standard non-index database search (i.e. using LIKE) is fairly stupid and generally defaults to a full-text scan of each row in the table. A b-tree index (https://en.wikipedia.org/wiki/B-tree), for columns with only one word, dramatically reduces the scope of that search field and makes such searches almost instantaneous. However, it doesn't help with multi-word text fields and the database engine has to go back to the very slow full-table scan and check each record individually.

Lucene notes the position of each word in a text string and has a number of techniques to figure out which records have words similar to the ones you're looking for, at relatively similar positions (i.e. close to each other, in the same order, etc) and narrows the search scope to those records. In short, it's not searching the individual records, it's searching it's own database for what it already knows about the contents of that record.

EDIT: however, for a log search a b-tree would still be fine, because every log entry generally would have a similar structure. For example, if you're looking for a specific error message, that message is not going to dramatically change from one moment to the next. So having that table/column indexed with a b-tree would allow you to search for that specific error string and quickly pull up all the results, regardless of size.

Just make sure you set up your SQL query to have a line like:

  WHERE column.errorvalue LIKE 'Generic Error Code 1%'
instead of:

  WHERE column.errorvalue LIKE '%Code 1%'
As soon as you put that first '%' sign in front of the first letter SQLite ignores any index you have for that table and does a full table scan, which is very slow. (https://www.sqlite.org/optoverview.html#the_like_optimizatio...)

That said, if you think you're going to get to 80GB you might want to look at an alternatively solution to SQLite or, at the very least, version your databases by month and then use the ATTACH DATABASE command if you need to mine historical data (or even write a small script to search each database individually). SQLite isn't really designed to separate tables across different disks and it's not fun to regularly have to back up 80GB files.



Would any of the various full-text search plugins (e.g. https://www.sqlite.org/fts5.html) make searching your 80GB database reasonable?


@snowflakeonice

What a time to be alive! I've just started populating the sqlite virtual fts table. I will report back with my findings!


Any news. I am very interested


Okay it finally (30 seconds ago!) finished indexing.

The reason it ran out of disk space is that I included 3 columns to index on (in this case: name, path, filename) and it ballooned my 66GB db to 185GB!

However, every single query afterward was instantaneous. Literally milliseconds to pull 90K results from three full-text columns across 500 million rows. And the search words were anywhere in the column, not just the beginning. Incredible. I'm simply blown away.

All I did was this single command: CREATE VIRTUAL TABLE fts USING fts5(hash UNINDEXED, title, path, filename);

and then wrote a normal INSERT statement to populate it like I would a regular table. It was so painless.

Just be aware that each column appears to drastically increase the size of the DB.

I'm so excited! I have so many other databases to try this out on!

EDIT: now I'm going to move it off the SSD to a mechanical drive and see if it still holds the same performance.


Wow, thanks for posting your results. This all sounds very promising. I am thinking about gathering a few pieces of data from custom sensors around the house in order to determine what I can do to cut down on energy costs (well that is the excuse ;)) and instead of using kibana and the like I’d rather use a lightweight SQLite dB, your stats make me hopeful of using it for this.

Also it is all just very interesting.


Final Comment on this: after moving it to a slow mechanical drive the query speed dropped dramatically, as expected. What was almost instantaneous on the SSD took anywhere from 40 to 120 seconds on the slower drive. However, previously the dumb full table scan took anywhere from 120-240 seconds on the SSD and I never even bothered trying on the slow drive!


Hahaha, it's still running. I ended up running out of drive space on the SSD I started it on, which began a lengthy process of shuffling things around, cleaning indexes and VACUUMing two very large databases (60 and 80GB, which ended up running all night), and finally (as of this morning) restarting the process of populating the FTS table. I will respond when it's finally done, I promise!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: