Some ideas: - Apache PDFBox https://pdfbox.apache.org/ - command line: https://pdfbox.apache.org/commandline/#extractText - XPDF has a command line tool you can use in Windows - http://www.foolabs.com/xpdf/ - pdftotext - If you're going for accuracy, Tesseract is one of the most accurate https://code.google.com/p/tesseract-ocr/ - Apache Tika is often used the way you suggest: http://tika.apache.org/
Some ideas: - Apache PDFBox https://pdfbox.apache.org/ - command line: https://pdfbox.apache.org/commandline/#extractText - XPDF has a command line tool you can use in Windows - http://www.foolabs.com/xpdf/ - pdftotext - If you're going for accuracy, Tesseract is one of the most accurate https://code.google.com/p/tesseract-ocr/ - Apache Tika is often used the way you suggest: http://tika.apache.org/