Hacker News new | past | comments | ask | show | jobs | submit login
Recoll: A desktop full-text search tool (lesbonscomptes.com)
111 points by karlicoss on Oct 21, 2021 | hide | past | favorite | 30 comments



Many projects really miss some section on their homepage which briefly compares itself to similar or related projects. I assume there are many related projects. Most big desktop environments (commercial or open source) provide some similar service.

I wonder about e.g. Baloo (Baloo is the file indexing and file search framework for KDE Plasma, https://community.kde.org/Baloo). Is this similar? Or different? How is it different?

Just yesterday my dad complained about a couple of Baloo error messages, Baloo crash reports, disk full, Xsession log full of Baloo messages. And he was not really aware about what Baloo actually is.

The problem is similar to the one here: https://askubuntu.com/questions/1214572/how-do-i-stop-and-re...

So I disabled the content indexing, as it was suggested there. The stack trace of the Baloo crashes also suggested that it was related to indexing the content of some documents.

I think this is a problem for many such file indexing services. Even Spotlight on MacOSX frequently crashes when it tries to index some documents. Does Recoll handle this differently?

If Recoll is in general better than Baloo, maybe it should replace Baloo? Or Baloo can be extended to use Recoll under the hood?


Baloo Author here, though I haven't contributed code for over 4 years now.

Baloo used to use Xapian under the hood, like Recoll, when initially launched, but in 2014 (I think?) I moved away from it to a custom DB built on top of LMDB. It resulted in a massive performance increase, one of the main criteria was for results to always take less than 'x milliseconds' (I can't remember exactly how much) so that KRunner (similar to OSX Spotlight) would always feel snappy. Another criteria was for Baloo to operate well with minimal memory and CPU, which wasn't the case with Recoll (Things might have changed since then)

I do remember ensuring that Baloo always did content indexing in another process, though I can't remember at what point it blacklists a file.

Sorry, this isn't the best answer to your question, but most of the details are now quite fuzzy to me, and I haven't kept up with how the project has changed after I stopped contributing to KDE.

In retrospect though, I definitely should have spent far far more time ensuring that Baloo doesn't end up in situations like the one you described, and putting far more limits on the IO and CPU usage it's allowed to use.


I've been running a setup with Recoll and https://github.com/ArchiveBox/ArchiveBox for a few months now [1]. Each morning archivebox scrapes all new links that I've put into my (text-based) notes and saves them as HTML singlefiles. Then Recoll indexes them.

It's very fast and ~4 lines of code. It's surprising how often I rediscover old blog posts & papers that are much better than what Google yields me.

From my experience Recoll isn't very good at searching for aliases sadly.

[1] https://siboehm.com/articles/21/a-local-search-engine


Tangential but I really love the look of your site, especially the footnotes displayed adjacent to the text. Did you create the layout yourself? I have been looking for a theme like this for a paper discussion website.


Not my site, but it looks like a theme based on Tufte CSS: https://edwardtufte.github.io/tufte-css/


Sounds cool, thanks for the inspiration!


Other "personal search engines"

https://thesephist.com/posts/monocle/ written in personal language, takes multiple doc sources

https://apse.io interesting because it takes periodic screenshots and performs OCR

https://keminglabs.com/finda/ rust, fast, seemingly abandonded


I would like to have a personal search engine for the web. It should only index my content on the web (social media, blog post, comments on hacker news, etc...)


Orher personal search engines: locate and grep.


Recoll isn't just a personal search engine and it does all the things you mentioned and more. It is built on Xapian.


Does it OCR periodic screencaps? I.e. the subtitle text of something you're watching on netflix? There's no saved file to index.


If you have periodic screen caps saved to a folder then it can, however it makes more sense to make your own plugin to Xapian which you can do and share with Recoll.


iOS now OCRs all photos, wish there was a way to bulk export that text alongside the photos, https://9to5mac.com/2021/09/21/how-iphone-live-text-ocr-work...


This uses Xapian. Some of the Linux mailing lists, like linux-netdev, use Xapian.

I use Xapian on the desktop but I only store text documents so I have no need for Recoll. It is refreshing to use search that is not tuned for "popularity" and selling online ad services. If search is too easy, then IMO it is not really search. Search should require some skill. It should not be a game where a commercial entity Hoovering up user data and selling ad services runs the search and tries to guess what the searcher is searching for.

I never tried it, but this looks somewhat similar to Recoll:

https://github.com/kendling/pinot-search


As a consultant, I am routinely given hundreds of new files from new clients which are usually a mix of Word documents and PDFs of policies, procedures, and standards. Then, throughout the course of the engagement, I often have to pick out certain phrases or topics from these large datasets in order to reference them in a report.

Windows File Explorer can do this to some degree. I am not sure if File Explorer looks at the actual content of the documents or just titles. Either way, Recoll is magnitudes better and is absolutely worth it for my case.


Recoll is not just a desktop search engine, it can also be used as an intranet search engine either 'stand-alone' [1] or as an 'engine' in Searx [2]. I made the Searx engine a few years ago to extend Searx to intranet search, the QT qui is not used (an, on Debian, not even installed - only the 'recollcmd' package is needed).

[1] https://github.com/koniu/recoll-webui

[2] https://github.com/searx/searx/pull/1257


Is there any desktop search engine that can do topic modelling and/or semantic clustering to allow you to discover related documents or do other kinds of browsing and discovery of large sets of documents where you may not immediately know the specific search terms to directly go to what you need?


DEVONthink's See Also feature does this, but it's Apple-only (and rather expensive).

https://www.devontechnologies.com/apps/devonthink

https://download.devontechnologies.com/download/devonthink/3...


Recoll and the underlying search Xapian are incredibly powerful. I am surprised that almost all server side search use such bloated frameworks. Xapian seemed really easy to theme but their site and docs need some love. I also don't recall, but if they had a way to parse JSON search hints I'm sure that it would be perfect as a lightweight server side search for static sites. Not that JS search isn't cool, but until browsers implement a native site searcb feature, it is nice to have a way to search from non-JS browsers and clients.


Recoll can be used server-side, I've been doing this for years now - no gui needed nor even installed:

https://news.ycombinator.com/item?id=28954294


Wow, you also solved the JSON issue. I've been wanting to implement Xapian as a server-side search server for Hugo. I believe by just creating the template using Hugo, it should be pretty easy to have it search the JSON. I've got quite a bit of work to do though to figure it out.


For Windows users, a shout out for Copernic Desktop Search (content is indexed so search results are instant and presented via an appropriate embedded view for some types; you can filter but it doesn't support grep), which I've used for many years.

It's not free, but one of the utils I wouldn't be without.

https://copernic.com/en/desktop/


recoll is a killer app. it should somehow integrate deeper with linux desktops and take personal information management to the next level.

it is incredibly useful in itself, has a CLI/scripting interface etc, but its homemade GUI is quirky and using its output for next steps in workflows can be cumbersome (may need coding). better integration with file managers and other desktop apps would help it really takeoff


I wouldn't mind paying a few bucks for the Windows version, but it is unclear to me if the Windows version can index PDFs.

It seems to depend on pdftotext on Linux. Can I get that for Windows? If so, is an executable somewhere in my path enough?


Yes, it can, I'm using it personally and can confirm that the PDF indexing works without you having to install anything else yourself, it seems like every binary needed for it to run is bundled together with the main app.


PanDoc is available for Windows, which can also extract text from PDF files. Not sure if Recoll supports it though.

However if you want to OCR a PDF which contains image based 'pages' then you'll need something like Tesseract OCR.


I ran recoll on a Linux Laptop for a while. It worked pretty well and was very fast because of the index it created.


Doesn’t Spotlight on Mac do this already? What’s the added value?


Spotlight does indeed do this so Mac users don't need this. Might help Windows and Linux users though. I use voidtools.com's Everything tool to search for filenames quickly on Windows as it uses the NTFS index to do near-instant searches, since the Windows Indexing Service has been poor since XP... This might help if you are looking for content instead of just filenames.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: