It seems the whole index is kept in RAM. Thus the index size is limited by the amount of RAM available. This explains the impressive indexing and search performance (1M blog 500M data 28 seconds index finished, 1.65 ms search response time, 19K search QPS)
The Persistent storage data is stored to the hard disk solely when the program closes. The data is then restored from the hard disk when the program restarts ( https://github.com/go-ego/riot/blob/master/docs/zh/persisten... ). This is a limited approach compared to Lucene/Solr/Elasticsearch LSM
which handle high-volume inserts to its indexes with a log-structured merge-tree (LSM) and where the index size is only limited by the available hard disk space.
very interesting, can you elaborate and on it a little more?
I need quick fuzzy search on a low-end embedded device that has limited storage(both RAM and HDD), was thinking about putting the index on a server with plenty RAM then do websocket or RPC for that.
There's a very good blog post for the implementation details here: http://alexbowe.com/wavelet-trees/ I had a decent implementation in Python, but it's on my old macbook that I would need to dig up. If you're interested you can add me on telegram: @rightcheek.
Now to go with Wavelet trees you may or may not need to know about suffix arrays and optimal suffix array construction. Take a look at this: https://en.wikipedia.org/wiki/Suffix_array This is what's going to give you space efficiency in combination with a wavelet tree. And the wavelet tree also gives you good rank/select efficiency.
FWIW, a few years back, I was averaging <5ms for a benchmark of live traffic (many thousands of queries per second) on Pinterest queries against their full dataset on a single EC2 box with a weeks worth of customization on top of Lucene.
Trinity - depending on the execution mode - is over 100% faster for certain queries compared to Lucene, and Lucene is already very fast. It all comes down to the postings ling codecs, and the iterators design/impl anyway.
Then again, the language is 'Go.' I am not sure that was the best naming choice and I've heard complaints that it was initially difficult to search for it. My understanding is that 'golang' has helped as a query. So, I guess the situation has improved but I understand it was problematic at first.
To be clear, I've never used the language. I actually dislike programming, though I've decided to get back into it because I have a couple of projects I want to poke at. I'm now deciding between Java and Python.
I think Python is the much better choice to start out, but Java is pretty great too. It's hard to find languages that are actually bad choices... maybe COBOL.
I think Python would be better to start with. It is interpreted so you will get sane error messages. The community is omnipresent. If you'd like instant answers, you could just pop in to #python on irc.freenode.net.
Most of the times all your questions will be answered by a simple google search, as there are mountains of good questions and answers on stackoverflow.
IMHO for you, getting back into programming will be as simple as opening the interpreter and starting to type.
I don't think programming has changed, the basic mentality is still the same. And nothing beats good old experience.
The new niche "trends" such as asyncronous programming, actor-based programming etc. could be easily learned by lingering on HN for a while :).
I'd say the choice really depends on the what you are trying to do. Python, Java (and even Golang) all have their pros and cons. If you are doing personal projects that involve with data processing, and maybe a simple webapp, nothing beats Python and its ecosystem. If you are planning to grow a team, and the projects are enterprise-ish, then Java is a good contender. In my case, while I wrote some 30K lines of Golang in my last job a few years back (and still got compliments from the current maintainers to this day), I just tolerated the language and never enjoyed it. Before that, I did a lot of Java and before that C and Perl. I don't program daily anymore, and only play with data science / ML ideas these days, and so I do Python (mostly in Jupyter notebooks) and run the scripts on real datasets in remote servers, with a simple Flask app to display the results/charts.
If you're considering Java, you might consider a Java-derivative like Kotlin. It's similar enough to Java that it's suitable for basically any task where Java will work. The IDE support is great and the learning curve for anyone who writes Java will be quick. After having used Java for close to 20 years and now having tried Kotlin, I see very little reason to start a project from scratch in Java these days.
Rust is out because of their cultlike traits and their need to turn a language into a political statement. Elixir I know nothing about and an absolute necessity is a wide variety of educational tools, existing projects, libraries, and help sites where people are free to tell me when I'm being an idiot.
I'm in the process of building my own search engine (as a learning exercise, but also because it's related to my day job). I've learned that it's one thing to write a full-text search engine, like this one, and it's quite another to do field-specific searches with faceting support and so on, like Algolia and Lucene-based search engines do.
That said, this is clean and simple. I like it. I can definitely learn from this.
Supporting faced search and other functionality requiring access to per document field-values is just an extension over the core IR functionality.
Tracking (document, field) values can be used for query by range or by geolocation primitives (that's what Lucene does, where it will index that data into a special tree-like structure, and for each query, it will build a custom 'iterator' and use it along with other iterators to match documents), and for static ranking of matched documents.
BTW, Lucene and Algolia are vastly different in terms of the underlying architecture.
Could you please include a bibliography in your project? That way, others can more easily find which techniques you are using, and of course it can help others (or even your future self) in figuring out how the code works.
It's worth mentioning that the original Sphinxsearch project has been stagnant for the past year. There's a new lively fork - https://manticoresearch.com/
Beyond being a user of both I don't have affiliations with either of them.
That's what I've gathered based on the online discussions:
Possible reasons why the original project stagnated: [0].
Mention of the manticoresearch as a fork project was removed from the sphinx forum [1] - so I can guess that developers who moved to the new project did not part on good terms with Andrew - the original author.
that's the last reply in that thread from person involved in Manticore, posted on the 23rd of Oct 2017:
"
aditirex just replied to 'Sphinx search fork':
===cut===
> But your the people who are already using sphinx, why we should change?
The open-source version of Sphinx received 5 code commits since November last year, from which 3 are related to building stuff. Last release was 12 months ago.
There are also a lot of unresolved reported bugs (many of them are crashes) in the bug tracker.
Andrew said a while ago that the open-source version would only receive fixes (which doesn't seem to happen either).
No one wanted to do the fork, it was the only way several big users saw it in order to continue using Sphinx. Don't ask me how we got into this situation, I'm not the right person to answer to that.
> What are the main benefits rather than using Sphinx?
Manticore is pretty much continuing Sphinx. Last year we had 4 developers + Andrew working on the code, 3 of them are working now on Manticore.
If Sphinx just works for you there is no reason to switch. But we're adding new features, fix existing bugs, the software is tested by some big users before getting released, you get a software that has support from it's developers.
> Is Foolz\SphinxQL\SphinxQL working?
Everything works as before, it's a fork, not a total new software.
"
It hasn't been tied to MySQL in the last 10 years, so I'm wondering if it ever was.
You can connect to Sphinx using a MySQL client, use it as a MySQL storage engine or using MySQL as a data source. But it's not specifically tied to MySQL.
Sphinx can connect to MySQL or Postgres, but can also read XML and [TC]SV, among others. Does this do something better? The examples seem to be where the user is providing the data to the indexer, which Sphinx can also do.
Does it suffer the same limitation as PostgreSQL's fulltext search? i.e., It doesn't use corpus frequencies in its ranking function. (I skimmed the docs but couldn't immediately find my answer.)