I've used Xapian extensively, but not this new Xapiand tool, so I can only speak to the actual library. Xapian is a C++ library that accesses index data files directly on disk.
There are bindings for various languages, say Python, let's you do 'import xapian' and get FFI bindings to the library, then you basically open your on disk index files and issue queries.
Xapian supports many concurrent readers, but only one writer. It's not a server, there are no protocols. Maybe that's what this Xapiand tool adds. In general the overhead is very, very light, just enough ram to hold the library code, the OS takes care of all the filesystem level caching.
Many of the very same concepts that are in Lucene, Documents, Terms, weights, flavors of BM25 relevance ranking, query parsing trees, relevancy operators, etc, all apply to Xapian as well.
I love Xapian, the quality of its recall is excellent and indexing performance very hard to find fault with. There's just a tiny problem - it's stuck with the GPL, despite a long effort to relicence the code going back years.
Is there actually precedent for a GPL library, clearly setup to be consumed with an API, to require GPL of the whole binary? On the one hand that's clearly not the case in the Linux API, on the other hand the LGPL exists. But was curious if this is actually settled or just too murky to live with.
I know the situation of the Linux kernel. The discussion hinges on what is a "derived work". Kernel modules are very interlinked with the kernel, particularly since there are no stable APIs inside the kernel. It seems strange that just linking to a GPL library that defined an API makes you a derived work. But that is the FSF position:
This is why I always do libs as LGPL but it seems strange to me that it's even needed. If I've defined a proper opaque API, to be consumed by external code I know nothing about, it's strange to then argue that library callers are derived works and LGPL is explicitly needed.
There are many gpl/lgpl dbs with bsd drivers (ex: scylladb). Dont know about library though (you should be able to use the library externally I think just like you do with the db)
Maybe xapian library just needs a little push from a larger community to make relicensing faster, Xapiand could help towards that end by helping brining in more people which can help. Xapiand source code is itself licensed as MIT (before compiling), and xapian community is already taking big steps towards relicensing.
I checked out Xapiand several months ago after I stumbled across it during a fit of Github browsing. It certainly seems fast and very easy to add documents but there is so little documentation that I was unable to test it out in any significant way. I'm very interested to see where the project goes, especially if Xapian itself switches away from the GPL.
Somewhat unrelated but I've written a restful search server that's powered by Sqlite's full-text search. Extremely lightweight python (flask) app. Nice for blogs or small projects, if I say so myself!
Still beta, last release[0] was 0.8 on 2018 November 18.
It is also not clear to me what, if anything, already integrates with this, and therefore how much code I need to write to try it out and compare against ElasticSearch.
Well for one, Java is ridiculously memory hungry. The resource costs of Elasticsearch is the #1 reason I'm not using it. I've seen a few projects which had the aim of reimplementing the Elasticsearch backend in Rust, but were incomplete. That would be my ideal solution, personally.
Yes, exactly the same for me, I've tried all the configuration possible with the jvm (that I know of) but nothing really worked to make it use a more reasonable amount of memory.
this is a limitation. For instance if you have billions of docs you need 200 tiny servers and have to deal with the communication/administration/monitoring between all of those. anything around 32gb and you will have perf drops everytime the GC works too hard..
the author can license his own source code however he likes. only, when distributing the compiled binaries, you're required to provide the whole sources under libre (gpl compatible) terms.
Can we attribute some of this renewed zeal in the search space to the creation of more approachable systems languages (i.e. Golang and Rust)? Maybe I just haven't been watching the search space but I feel it wasn't always this full of new projects putting up good numbers.
There are a lot of differences between Golang and Java. As much as I dislike writing Java when I have a choice, the JVM (with Java or whatever else on top) is a very capble tool... Could you explain what you mean by there being "no pros"?
Are you maybe trying to get at the difficulty of tuning the JVM?
While I definitely agree with you on the broad strokes of the differences between rust/c++/c and java/golang (representing languages without runtimes and those with them respectively), I'd say that golang is a bit more than a java alternative if we consider more than whether a runtime is included or not.
Of course, if the only consideration is whether a runtime is there or not, golang is identical to java but also identical to common lisp or maybe even interpreted languages like python.
I do want to point out that it's possible to write horribly buggy code in c++/c (less so in rust :), which can tank performance/efficiency when compared to a java/golang program. All things considered though, the ceiling on performance and efficiency is of course higher in manual memory management land.
As a native english speaker, the earlier phrase ("tantivy is to toshi as lucene is to elastic search") is easier for me to understand. I find your phrase a bit harder to understand, but it looks like just the kind of reorganization other languages do structure wise -- I don't know how to express it in proper grammatical terms, but the way the prepositions are swapped around makes it seem like native english words but with a non-english structure.
It might have to do with the use of Analogy questions in the SAT (a standardized test all but required for high school students wanting to attend good colleges in America), though it looks like they've been removed?[0].
"_____ is to ___ as ____ is to ______" was the verbatim format of those test questions.
> Ranked search (so the most relevant documents are more likely to come near the top of the results list) with built-in support for multiple models from the Probabilistic, Divergence from Randomness, and Language Modelling families of weighting models. Custom user-supplied weighting models are also supported.
Could someone explain in a little more detail what these terms mean?
Nobody has. There's no visibility or community around it which is a constant problem with Yahoo's open source projects. The only thing that really took off was Hadoop but there was very little back then.
Vespa is also far more heavy and complex than any other search systems mentioned here.
Has been in production far longer than any other open source solution. Runs at scale across Yahoo, powering even Ad systems, with live configurations pushes. Everything you need for highly available product. It also has capabilities to be used for more complex uses cases around AI.
I really hope this is the long awaited thing. From the number of commits, this project looks huge, Elasticsearch is really a big liability for resource limited deployments, I have seen some smaller projects made in Rust and Go but can't compete with Elasticseatch at any level, but this looks different and I hope it does.
I was genuinely wondering recently why the Hackernews site search tool didn't show a very recent article with very obvious keywords, that google, for example, had in the first place in the results, when adding "Hn" to those 2 keywords in the search field. Is it a matter of the indexing, it means the article was too recent and it wasn't yet in the Algolia's(I believe) based HN's search tool memory; and in this case google copied it to memory faster; Or is this purely a matter of the Algorithms themselves? The algorithms for sure sound to be the matter, when the case is that of a search for an old article, that should be in memory already. It seems unnatural. Algolia has a free tier for open-source projects, what is very nice of them and thanks. I genuinely wonder if those algorithms are indeed so complex to justify those comparatively weaker behaviors seen at HN's internal search.
https://hn.algolia.com/ only searches the text content of the submitted stories (as in the title and url) and the comments.
It doesn't index the actual article content nor take into account links across sites and content like Google does. Algolia (as self-described) is designed to search for things (like products in a ecommerce store) rather than text with concepts, relations, and entities in a knowledge graph like Google.
Maybe too late now, but i would like to point that I went to read Elasticsearch, Algolia, and Xapialand product descriptions before I made my comment, so I know well what Algolia is for, and in the example I gave, I was searching for a Headline, not for internal content inside the comments, and It was a headline that was in the first page of results at HN on that moment, so, I think I have phrased my comment in a polite way towards Algolia, understanding that search has more moving parts then the pattern matching algorithms of the logical core.
:ps I am sincerely grateful for the information you gave in your comment, about Entities, Concepts and Relations on a Knowledge graph. This is exactly the kind of info I was looking for when I made the comment, so It was enlighting to know that, and thank you again.
Google search with site:news.ycombinator.com (and optionally a time limit, which I wish wasn't limited to past hour/day/week/month/year) seems consistently superior to what Algolia provides.
Algolia is YC company, so I assume that's the main reason it's being used. But that it does such an awful job with such a simply structured site isn't compelling.
Hey latch, I've been working on the Algolia-based HN search and would love to improve it to provide you with a better search experience.
Do you think about any specific improvements? Would you mind sharing with us some non-working queries? We can follow-up here and you can also open issues on https://github.com/algolia/hn-search
I have just retested and the problem I mentioned before doesn't exist anymore, It has happened around Christmas time, back then, the search returned not very relevant results, Now it shows at the 9th position what google shows at 1th (exactly the article I was searching back then, still as first match on google), but this is a minor difference of order of the first page so I have to say the search is working pretty as expected now. Sorry for not retesting it before making that comment and thanks for keeping the search good.
:all right I didn't want to say it but there: russia hypersonic
Thanks for sharing, this is a good example where the first 2 hits on Algolia have a "better" textual relevancy (proximity between words is better, because of the "-based" word in the middle) but where the 3rd hit is most probably the one we want to see first because it has more than 100 points while the 2 others have 3.
Let me share that to the team and see whether we can try something.
https://xapian.org/history
I've used Xapian extensively, but not this new Xapiand tool, so I can only speak to the actual library. Xapian is a C++ library that accesses index data files directly on disk. There are bindings for various languages, say Python, let's you do 'import xapian' and get FFI bindings to the library, then you basically open your on disk index files and issue queries.
Xapian supports many concurrent readers, but only one writer. It's not a server, there are no protocols. Maybe that's what this Xapiand tool adds. In general the overhead is very, very light, just enough ram to hold the library code, the OS takes care of all the filesystem level caching.
Many of the very same concepts that are in Lucene, Documents, Terms, weights, flavors of BM25 relevance ranking, query parsing trees, relevancy operators, etc, all apply to Xapian as well.