Riot – Full-text search engine in Go

wolfgarbe · on Oct 23, 2017

It seems the whole index is kept in RAM. Thus the index size is limited by the amount of RAM available. This explains the impressive indexing and search performance (1M blog 500M data 28 seconds index finished, 1.65 ms search response time, 19K search QPS) The Persistent storage data is stored to the hard disk solely when the program closes. The data is then restored from the hard disk when the program restarts ( https://github.com/go-ego/riot/blob/master/docs/zh/persisten... ). This is a limited approach compared to Lucene/Solr/Elasticsearch LSM which handle high-volume inserts to its indexes with a log-structured merge-tree (LSM) and where the index size is only limited by the available hard disk space.

markpapadakis · on Oct 23, 2017

1.65ms for what kind of queries? Also, is that 1M blog posts, all weighting 500mb in total size(characters)?

ethanwillis · on Oct 23, 2017

I wonder if they're using succint data structures.

I'm in bioinformatics and the first time I implemented a wavelet tree to reduce the size of genomes in memory.. It was just breathtaking.

dogruck · on Oct 24, 2017

Neat, TIL about wavelet trees. https://en.m.wikipedia.org/wiki/Wavelet_Tree

jagger27 · on Oct 24, 2017

That's a beautiful structure. Thanks for sharing.

ausjke · on Oct 23, 2017

very interesting, can you elaborate and on it a little more?

I need quick fuzzy search on a low-end embedded device that has limited storage(both RAM and HDD), was thinking about putting the index on a server with plenty RAM then do websocket or RPC for that.

ethanwillis · on Oct 24, 2017

There's a very good blog post for the implementation details here: http://alexbowe.com/wavelet-trees/ I had a decent implementation in Python, but it's on my old macbook that I would need to dig up. If you're interested you can add me on telegram: @rightcheek.

Now to go with Wavelet trees you may or may not need to know about suffix arrays and optimal suffix array construction. Take a look at this: https://en.wikipedia.org/wiki/Suffix_array This is what's going to give you space efficiency in combination with a wavelet tree. And the wavelet tree also gives you good rank/select efficiency.

Edit: Here's a suffix array construction algorithm implementation I did (not sure if it's fully correct) https://github.com/ethanwillis/comp7295_final/blob/master/sa... It is based on this paper: https://local.ugene.unipro.ru/tracker/secure/attachment/1214...

fizx · on Oct 23, 2017

FWIW, a few years back, I was averaging <5ms for a benchmark of live traffic (many thousands of queries per second) on Pinterest queries against their full dataset on a single EC2 box with a weeks worth of customization on top of Lucene.

The existing stuff isn't slow.

markpapadakis · on Oct 24, 2017

Trinity - depending on the execution mode - is over 100% faster for certain queries compared to Lucene, and Lucene is already very fast. It all comes down to the postings ling codecs, and the iterators design/impl anyway.

veni0 · on Oct 24, 2017

The data is a real-time storage disk, the index size is limited by the available hard memory space.

veni0 · on Oct 24, 2017

Yep, the current index will be anyway in memory, I am writing a branch of the cache disk.

veni0 · on Oct 24, 2017

The document is improving.

wiremine · on Oct 23, 2017

Sidebar: I wish open source authors would think a bit harder about naming their projects. Here's some other projects already named riot:

- http://riotjs.com/

- https://riot-os.org/

- https://github.com/vector-im

KitDuncan · on Oct 23, 2017

And riot games. Basically the worst name possible for SEO.

KGIII · on Oct 23, 2017

Then again, the language is 'Go.' I am not sure that was the best naming choice and I've heard complaints that it was initially difficult to search for it. My understanding is that 'golang' has helped as a query. So, I guess the situation has improved but I understand it was problematic at first.

To be clear, I've never used the language. I actually dislike programming, though I've decided to get back into it because I have a couple of projects I want to poke at. I'm now deciding between Java and Python.

Maybe I should do an 'ask HN' submission.

chungy · on Oct 23, 2017

"Ask HN" sounds like a great idea.

I think Python is the much better choice to start out, but Java is pretty great too. It's hard to find languages that are actually bad choices... maybe COBOL.

KGIII · on Oct 24, 2017

I have done some COBOL and even some PASCAL. Yeah, I'm old.

I hired professionals in 1995. I was done doing any of the coding by 2000. I sold and retired in 2007.

I took only one course in C. Everything else was learned on my own, informally.

terminalcommand · on Oct 24, 2017

I think Python would be better to start with. It is interpreted so you will get sane error messages. The community is omnipresent. If you'd like instant answers, you could just pop in to #python on irc.freenode.net.

Most of the times all your questions will be answered by a simple google search, as there are mountains of good questions and answers on stackoverflow.

IMHO for you, getting back into programming will be as simple as opening the interpreter and starting to type.

I don't think programming has changed, the basic mentality is still the same. And nothing beats good old experience.

The new niche "trends" such as asyncronous programming, actor-based programming etc. could be easily learned by lingering on HN for a while :).

Best of luck getting back on the keyboard!

cr0sh · on Oct 23, 2017

If you dislike programming, Python may be the saner choice.

ShabbosGoy · on Oct 23, 2017

Brainfuck is the sanest choice.

KGIII · on Oct 23, 2017

I'm not new to programming. I just haven't done much for a whole lot of years. Brainfuck, I'm familiar with it, is never the sanest choice.

I've narrowed it down to Python or Java. I've done C, C++, BASIC, QBASIC, Perl, PHP, and even some COBOL. I've played with a few others.

boto3 · on Oct 24, 2017

I'd say the choice really depends on the what you are trying to do. Python, Java (and even Golang) all have their pros and cons. If you are doing personal projects that involve with data processing, and maybe a simple webapp, nothing beats Python and its ecosystem. If you are planning to grow a team, and the projects are enterprise-ish, then Java is a good contender. In my case, while I wrote some 30K lines of Golang in my last job a few years back (and still got compliments from the current maintainers to this day), I just tolerated the language and never enjoyed it. Before that, I did a lot of Java and before that C and Perl. I don't program daily anymore, and only play with data science / ML ideas these days, and so I do Python (mostly in Jupyter notebooks) and run the scripts on real datasets in remote servers, with a simple Flask app to display the results/charts.

curun1r · on Oct 24, 2017

If you're considering Java, you might consider a Java-derivative like Kotlin. It's similar enough to Java that it's suitable for basically any task where Java will work. The IDE support is great and the learning curve for anyone who writes Java will be quick. After having used Java for close to 20 years and now having tried Kotlin, I see very little reason to start a project from scratch in Java these days.

ShabbosGoy · on Oct 23, 2017

Haha, my reply was more of a good nature jab @ GP's comment on Python.

Have you thought of learning a new programming paradigm? Why not check out Elixir or Rust?

KGIII · on Oct 24, 2017

Rust is out because of their cultlike traits and their need to turn a language into a political statement. Elixir I know nothing about and an absolute necessity is a wide variety of educational tools, existing projects, libraries, and help sites where people are free to tell me when I'm being an idiot.

I learn best from people with exacting standards.

LamaOfRuin · on Oct 24, 2017

>Rust is out because of their cultlike traits and their need to turn a language into a political statement.

If you're going to eliminate languages for that, you're basically out of options.

veni0 · on Oct 24, 2017

Good names are some well-known things, no way.

allenz · on Oct 23, 2017

Also, I use luci.criosweb.ro/riot

Jemaclus · on Oct 23, 2017

Also, for consideration: Bleve (https://github.com/blevesearch/bleve)

I'm in the process of building my own search engine (as a learning exercise, but also because it's related to my day job). I've learned that it's one thing to write a full-text search engine, like this one, and it's quite another to do field-specific searches with faceting support and so on, like Algolia and Lucene-based search engines do.

That said, this is clean and simple. I like it. I can definitely learn from this.

markpapadakis · on Oct 23, 2017

Supporting faced search and other functionality requiring access to per document field-values is just an extension over the core IR functionality.

Tracking (document, field) values can be used for query by range or by geolocation primitives (that's what Lucene does, where it will index that data into a special tree-like structure, and for each query, it will build a custom 'iterator' and use it along with other iterators to match documents), and for static ranking of matched documents.

BTW, Lucene and Algolia are vastly different in terms of the underlying architecture.

atombender · on Oct 23, 2017

What is Algolia's underlying architecture like? Are there any papers or code?

markpapadakis · on Oct 23, 2017

See links about Algolia arch (and other related material) here: https://github.com/phaistos-networks/Trinity/wiki/IR-Search-...

gdillon · on Oct 23, 2017

Thanks for the links, Mark! (Disclaimer, I work for Algolia.)

As Mark mentions on his summary page, the best place for that kind of information is our CTO's "Inside the Engine" series (8 parts).

https://blog.algolia.com/inside-the-algolia-engine-part-1-in...

Jemaclus · on Oct 23, 2017

That's why I mentioned them separately. :)

amelius · on Oct 23, 2017

Could you please include a bibliography in your project? That way, others can more easily find which techniques you are using, and of course it can help others (or even your future self) in figuring out how the code works.

CSDude · on Oct 23, 2017

Last time I checked, bleeve was very slow compared to Lucene, I really hope it gets better over time.

StreamBright · on Oct 23, 2017

I was planning to write one too, wanna team up? :)

unicornporn · on Oct 23, 2017

Kind of taken https://about.riot.im/

lylejohnson · on Oct 23, 2017

Not to mention http://riotjs.com/.

speps · on Oct 23, 2017

What about https://www.riotgames.com ? More so because they have an engineering blog, which is very interesting : https://engineering.riotgames.com/

literallycancer · on Oct 23, 2017

Mentioning Pendragon et al. in a discussion about originality? Rich.

whyever · on Oct 23, 2017

I think the company is a bit bigger than one person by now.

krautsourced · on Oct 23, 2017

Also http://luci.criosweb.ro/riot/

dpcx · on Oct 23, 2017

Not that I'm against people building tools in their language of choice, but how does this compare to Sphinx (http://sphinxsearch.com/)?

pQd · on Oct 23, 2017

It's worth mentioning that the original Sphinxsearch project has been stagnant for the past year. There's a new lively fork - https://manticoresearch.com/

Beyond being a user of both I don't have affiliations with either of them.

dpcx · on Oct 24, 2017

Good to know. Any idea what happened?

pQd · on Oct 24, 2017

That's what I've gathered based on the online discussions:

Possible reasons why the original project stagnated: [0].

Mention of the manticoresearch as a fork project was removed from the sphinx forum [1] - so I can guess that developers who moved to the new project did not part on good terms with Andrew - the original author.

that's the last reply in that thread from person involved in Manticore, posted on the 23rd of Oct 2017:

" aditirex just replied to 'Sphinx search fork':

===cut=== > But your the people who are already using sphinx, why we should change?

The open-source version of Sphinx received 5 code commits since November last year, from which 3 are related to building stuff. Last release was 12 months ago. There are also a lot of unresolved reported bugs (many of them are crashes) in the bug tracker. Andrew said a while ago that the open-source version would only receive fixes (which doesn't seem to happen either). No one wanted to do the fork, it was the only way several big users saw it in order to continue using Sphinx. Don't ask me how we got into this situation, I'm not the right person to answer to that.

> What are the main benefits rather than using Sphinx?

Manticore is pretty much continuing Sphinx. Last year we had 4 developers + Andrew working on the code, 3 of them are working now on Manticore.

If Sphinx just works for you there is no reason to switch. But we're adding new features, fix existing bugs, the software is tested by some big users before getting released, you get a software that has support from it's developers.

> Is Foolz\SphinxQL\SphinxQL working?

Everything works as before, it's a fork, not a total new software. "

[0] http://sphinxsearch.com/blog/2017/07/24/sphinx-2017/ [1] http://webcache.googleusercontent.com/search?q=cache%3Ahttp%...

didip · on Oct 23, 2017

Last I used Sphinx, it is tied to MySQL. Is that still true?

If so, then Sphinx could be a deal breaker to some.

mrweasel · on Oct 23, 2017

It hasn't been tied to MySQL in the last 10 years, so I'm wondering if it ever was.

You can connect to Sphinx using a MySQL client, use it as a MySQL storage engine or using MySQL as a data source. But it's not specifically tied to MySQL.

dpcx · on Oct 23, 2017

Sphinx can connect to MySQL or Postgres, but can also read XML and [TC]SV, among others. Does this do something better? The examples seem to be where the user is providing the data to the indexer, which Sphinx can also do.

hardwaresofton · on Oct 24, 2017

A little late but if you didn't know you could do reasonably fast Full Text Search with SQLite... Now you know:

https://sqlite.org/fts3.html

https://sqlite.org/fts5.html

burntsushi · on Oct 24, 2017

Does it suffer the same limitation as PostgreSQL's fulltext search? i.e., It doesn't use corpus frequencies in its ranking function. (I skimmed the docs but couldn't immediately find my answer.)

hardwaresofton · on Oct 24, 2017

I'm not sure -- but I'm going to guess yes? I tried to look around and couldn't find an answer either...

sAbakumoff · on Oct 24, 2017

I mean, we have bleve[0]. What else do you need, really?

[0] https://github.com/blevesearch/bleve

veni0 · on Oct 24, 2017

Multi-language and distributed support, simple.

sAbakumoff · on Oct 24, 2017

bleve search does support multiple languages.

veni0 · on Oct 24, 2017

But like the Chinese must use plug-ins.

billconan · on Oct 24, 2017

can this replace elastic search?

I'm looking for a light weight elastic search alternative.

conmarap · on Oct 23, 2017

Ah, this is awesome! So far I had to rely on Lucene++, and it could get complicated at times. Using go is something I had been wishing for.

veni0 · on Oct 24, 2017

Yep, go can be a simple deployment.

tenken · on Oct 23, 2017

Also for consideration: https://github.com/coleifer/scout

> RESTful search server written in Python, powered by SQLite.

dest · on Oct 23, 2017

A comparison with Yacy would be interesting IMHO

maxpert · on Oct 23, 2017

I wish I could understand mandarin, such nice projects need some good translation.

veni0 · on Oct 24, 2017

Thanks for understanding, the document is improving.

ct520 · on Oct 23, 2017

Does anyone know anything like this that will search pdf text, or tiff?

mikey_p · on Oct 24, 2017

http://tika.apache.org

Which I'm pretty sure can be embedded in Solr, has plugins for Elasticsearch and others.