Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Riot – Full-text search engine in Go (github.com/go-ego)
221 points by veni0 on Oct 23, 2017 | hide | past | favorite | 66 comments


It seems the whole index is kept in RAM. Thus the index size is limited by the amount of RAM available. This explains the impressive indexing and search performance (1M blog 500M data 28 seconds index finished, 1.65 ms search response time, 19K search QPS) The Persistent storage data is stored to the hard disk solely when the program closes. The data is then restored from the hard disk when the program restarts ( https://github.com/go-ego/riot/blob/master/docs/zh/persisten... ). This is a limited approach compared to Lucene/Solr/Elasticsearch LSM which handle high-volume inserts to its indexes with a log-structured merge-tree (LSM) and where the index size is only limited by the available hard disk space.


1.65ms for what kind of queries? Also, is that 1M blog posts, all weighting 500mb in total size(characters)?


I wonder if they're using succint data structures.

I'm in bioinformatics and the first time I implemented a wavelet tree to reduce the size of genomes in memory.. It was just breathtaking.


Neat, TIL about wavelet trees. https://en.m.wikipedia.org/wiki/Wavelet_Tree


That's a beautiful structure. Thanks for sharing.


very interesting, can you elaborate and on it a little more?

I need quick fuzzy search on a low-end embedded device that has limited storage(both RAM and HDD), was thinking about putting the index on a server with plenty RAM then do websocket or RPC for that.


There's a very good blog post for the implementation details here: http://alexbowe.com/wavelet-trees/ I had a decent implementation in Python, but it's on my old macbook that I would need to dig up. If you're interested you can add me on telegram: @rightcheek.

Now to go with Wavelet trees you may or may not need to know about suffix arrays and optimal suffix array construction. Take a look at this: https://en.wikipedia.org/wiki/Suffix_array This is what's going to give you space efficiency in combination with a wavelet tree. And the wavelet tree also gives you good rank/select efficiency.

Edit: Here's a suffix array construction algorithm implementation I did (not sure if it's fully correct) https://github.com/ethanwillis/comp7295_final/blob/master/sa... It is based on this paper: https://local.ugene.unipro.ru/tracker/secure/attachment/1214...


FWIW, a few years back, I was averaging <5ms for a benchmark of live traffic (many thousands of queries per second) on Pinterest queries against their full dataset on a single EC2 box with a weeks worth of customization on top of Lucene.

The existing stuff isn't slow.


Trinity - depending on the execution mode - is over 100% faster for certain queries compared to Lucene, and Lucene is already very fast. It all comes down to the postings ling codecs, and the iterators design/impl anyway.


The data is a real-time storage disk, the index size is limited by the available hard memory space.


Yep, the current index will be anyway in memory, I am writing a branch of the cache disk.


The document is improving.


Sidebar: I wish open source authors would think a bit harder about naming their projects. Here's some other projects already named riot:

- http://riotjs.com/

- https://riot-os.org/

- https://github.com/vector-im


And riot games. Basically the worst name possible for SEO.


Then again, the language is 'Go.' I am not sure that was the best naming choice and I've heard complaints that it was initially difficult to search for it. My understanding is that 'golang' has helped as a query. So, I guess the situation has improved but I understand it was problematic at first.

To be clear, I've never used the language. I actually dislike programming, though I've decided to get back into it because I have a couple of projects I want to poke at. I'm now deciding between Java and Python.

Maybe I should do an 'ask HN' submission.


"Ask HN" sounds like a great idea.

I think Python is the much better choice to start out, but Java is pretty great too. It's hard to find languages that are actually bad choices... maybe COBOL.


I have done some COBOL and even some PASCAL. Yeah, I'm old.

I hired professionals in 1995. I was done doing any of the coding by 2000. I sold and retired in 2007.

I took only one course in C. Everything else was learned on my own, informally.


I think Python would be better to start with. It is interpreted so you will get sane error messages. The community is omnipresent. If you'd like instant answers, you could just pop in to #python on irc.freenode.net.

Most of the times all your questions will be answered by a simple google search, as there are mountains of good questions and answers on stackoverflow.

IMHO for you, getting back into programming will be as simple as opening the interpreter and starting to type.

I don't think programming has changed, the basic mentality is still the same. And nothing beats good old experience.

The new niche "trends" such as asyncronous programming, actor-based programming etc. could be easily learned by lingering on HN for a while :).

Best of luck getting back on the keyboard!


If you dislike programming, Python may be the saner choice.


Brainfuck is the sanest choice.


I'm not new to programming. I just haven't done much for a whole lot of years. Brainfuck, I'm familiar with it, is never the sanest choice.

I've narrowed it down to Python or Java. I've done C, C++, BASIC, QBASIC, Perl, PHP, and even some COBOL. I've played with a few others.


I'd say the choice really depends on the what you are trying to do. Python, Java (and even Golang) all have their pros and cons. If you are doing personal projects that involve with data processing, and maybe a simple webapp, nothing beats Python and its ecosystem. If you are planning to grow a team, and the projects are enterprise-ish, then Java is a good contender. In my case, while I wrote some 30K lines of Golang in my last job a few years back (and still got compliments from the current maintainers to this day), I just tolerated the language and never enjoyed it. Before that, I did a lot of Java and before that C and Perl. I don't program daily anymore, and only play with data science / ML ideas these days, and so I do Python (mostly in Jupyter notebooks) and run the scripts on real datasets in remote servers, with a simple Flask app to display the results/charts.


If you're considering Java, you might consider a Java-derivative like Kotlin. It's similar enough to Java that it's suitable for basically any task where Java will work. The IDE support is great and the learning curve for anyone who writes Java will be quick. After having used Java for close to 20 years and now having tried Kotlin, I see very little reason to start a project from scratch in Java these days.


Haha, my reply was more of a good nature jab @ GP's comment on Python.

Have you thought of learning a new programming paradigm? Why not check out Elixir or Rust?


Rust is out because of their cultlike traits and their need to turn a language into a political statement. Elixir I know nothing about and an absolute necessity is a wide variety of educational tools, existing projects, libraries, and help sites where people are free to tell me when I'm being an idiot.

I learn best from people with exacting standards.


>Rust is out because of their cultlike traits and their need to turn a language into a political statement.

If you're going to eliminate languages for that, you're basically out of options.


Good names are some well-known things, no way.


Also, I use luci.criosweb.ro/riot


Also, for consideration: Bleve (https://github.com/blevesearch/bleve)

I'm in the process of building my own search engine (as a learning exercise, but also because it's related to my day job). I've learned that it's one thing to write a full-text search engine, like this one, and it's quite another to do field-specific searches with faceting support and so on, like Algolia and Lucene-based search engines do.

That said, this is clean and simple. I like it. I can definitely learn from this.


Supporting faced search and other functionality requiring access to per document field-values is just an extension over the core IR functionality.

Tracking (document, field) values can be used for query by range or by geolocation primitives (that's what Lucene does, where it will index that data into a special tree-like structure, and for each query, it will build a custom 'iterator' and use it along with other iterators to match documents), and for static ranking of matched documents.

BTW, Lucene and Algolia are vastly different in terms of the underlying architecture.


What is Algolia's underlying architecture like? Are there any papers or code?


See links about Algolia arch (and other related material) here: https://github.com/phaistos-networks/Trinity/wiki/IR-Search-...


Thanks for the links, Mark! (Disclaimer, I work for Algolia.)

As Mark mentions on his summary page, the best place for that kind of information is our CTO's "Inside the Engine" series (8 parts).

https://blog.algolia.com/inside-the-algolia-engine-part-1-in...


That's why I mentioned them separately. :)


Could you please include a bibliography in your project? That way, others can more easily find which techniques you are using, and of course it can help others (or even your future self) in figuring out how the code works.


Last time I checked, bleeve was very slow compared to Lucene, I really hope it gets better over time.


I was planning to write one too, wanna team up? :)



Not to mention http://riotjs.com/.


What about https://www.riotgames.com ? More so because they have an engineering blog, which is very interesting : https://engineering.riotgames.com/


Mentioning Pendragon et al. in a discussion about originality? Rich.


I think the company is a bit bigger than one person by now.



Not that I'm against people building tools in their language of choice, but how does this compare to Sphinx (http://sphinxsearch.com/)?


It's worth mentioning that the original Sphinxsearch project has been stagnant for the past year. There's a new lively fork - https://manticoresearch.com/

Beyond being a user of both I don't have affiliations with either of them.


Good to know. Any idea what happened?


That's what I've gathered based on the online discussions:

Possible reasons why the original project stagnated: [0].

Mention of the manticoresearch as a fork project was removed from the sphinx forum [1] - so I can guess that developers who moved to the new project did not part on good terms with Andrew - the original author.

that's the last reply in that thread from person involved in Manticore, posted on the 23rd of Oct 2017:

" aditirex just replied to 'Sphinx search fork':

===cut=== > But your the people who are already using sphinx, why we should change?

The open-source version of Sphinx received 5 code commits since November last year, from which 3 are related to building stuff. Last release was 12 months ago. There are also a lot of unresolved reported bugs (many of them are crashes) in the bug tracker. Andrew said a while ago that the open-source version would only receive fixes (which doesn't seem to happen either). No one wanted to do the fork, it was the only way several big users saw it in order to continue using Sphinx. Don't ask me how we got into this situation, I'm not the right person to answer to that.

> What are the main benefits rather than using Sphinx?

Manticore is pretty much continuing Sphinx. Last year we had 4 developers + Andrew working on the code, 3 of them are working now on Manticore.

If Sphinx just works for you there is no reason to switch. But we're adding new features, fix existing bugs, the software is tested by some big users before getting released, you get a software that has support from it's developers.

> Is Foolz\SphinxQL\SphinxQL working?

Everything works as before, it's a fork, not a total new software. "

[0] http://sphinxsearch.com/blog/2017/07/24/sphinx-2017/ [1] http://webcache.googleusercontent.com/search?q=cache%3Ahttp%...


Last I used Sphinx, it is tied to MySQL. Is that still true?

If so, then Sphinx could be a deal breaker to some.


It hasn't been tied to MySQL in the last 10 years, so I'm wondering if it ever was.

You can connect to Sphinx using a MySQL client, use it as a MySQL storage engine or using MySQL as a data source. But it's not specifically tied to MySQL.


Sphinx can connect to MySQL or Postgres, but can also read XML and [TC]SV, among others. Does this do something better? The examples seem to be where the user is providing the data to the indexer, which Sphinx can also do.


A little late but if you didn't know you could do reasonably fast Full Text Search with SQLite... Now you know:

https://sqlite.org/fts3.html

https://sqlite.org/fts5.html


Does it suffer the same limitation as PostgreSQL's fulltext search? i.e., It doesn't use corpus frequencies in its ranking function. (I skimmed the docs but couldn't immediately find my answer.)


I'm not sure -- but I'm going to guess yes? I tried to look around and couldn't find an answer either...


I mean, we have bleve[0]. What else do you need, really?

[0] https://github.com/blevesearch/bleve


Multi-language and distributed support, simple.


bleve search does support multiple languages.


But like the Chinese must use plug-ins.


can this replace elastic search?

I'm looking for a light weight elastic search alternative.


Ah, this is awesome! So far I had to rely on Lucene++, and it could get complicated at times. Using go is something I had been wishing for.


Yep, go can be a simple deployment.


Also for consideration: https://github.com/coleifer/scout

> RESTful search server written in Python, powered by SQLite.


A comparison with Yacy would be interesting IMHO


I wish I could understand mandarin, such nice projects need some good translation.


Thanks for understanding, the document is improving.


Does anyone know anything like this that will search pdf text, or tiff?


http://tika.apache.org

Which I'm pretty sure can be embedded in Solr, has plugins for Elasticsearch and others.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: