- document conversion (pdftotext, pdfbox, apache tabula, etc.)
- OCR (tesseract, pypdfocr, etc.)
- Named-Entity-Recognition (NER) i.e. finding and recognizing entities in text (DBPedia Spotlight, stanford NER via NLTK, spacy)
- coreference resolution, dependency parsing (spacy, syntaxnet)
- document conversion (pdftotext, pdfbox, apache tabula, etc.)
- OCR (tesseract, pypdfocr, etc.)
- Named-Entity-Recognition (NER) i.e. finding and recognizing entities in text (DBPedia Spotlight, stanford NER via NLTK, spacy)
- coreference resolution, dependency parsing (spacy, syntaxnet)