This looks like a good solution for scrapers where "close enough" is good enough. If you need 100% accuracy you can always fall back to using scrapy directly and make your scraping logic as accurate as you need. But in many cases you can live with some false positives and then this tool looks like it will fit the bill.
Does anyone know what methods are state of the art in machine learning for data extraction in general or where I could get an overview (invoices,images, documents etc.)?
Research papers often use the phrase "wrapper induction" when discussing automatic data extraction. Once I discovered that phrase I found several papers. I'm on mobile so can't link any.
Thanks. Some great keywords to investigate. I'm namely interested in two areas at the moment:
- invoices (I guess NER would be partially an Option)
- web scrapping (wrapper induction)
After the OCR of documents, I have used mostly regex to extract information from semi-structured documents. One example would be invoices (invoice number, total amount etc.), another would be to extract product names, SKU numbers etc. from various documents.