This is a great idea. Full-text search for "my knowledgebase", books I've read, ...

This is a great idea. Full-text search for "my knowledgebase", books I've read, thing's I've written, etc. is an area with potential that still seems unfulfilled.

Some ideas: - Apache PDFBox https://pdfbox.apache.org/ - command line: https://pdfbox.apache.org/commandline/#extractText - XPDF has a command line tool you can use in Windows - http://www.foolabs.com/xpdf/ - pdftotext - If you're going for accuracy, Tesseract is one of the most accurate https://code.google.com/p/tesseract-ocr/ - Apache Tika is often used the way you suggest: http://tika.apache.org/