Hacker News new | past | comments | ask | show | jobs | submit login
Scrapely: The brains behind Portia, our visual web scraping tool (scrapinghub.com)
102 points by unsettledtck on July 7, 2016 | hide | past | favorite | 14 comments



I wonder if this was submitted after reading today's submission (https://news.ycombinator.com/item?id=12047234) about (real) Portia spiders and googling the subject.


We did indeed model our Portia on the real spiders. We'd like to think our version is a wee bit cuter...


After finding this picture of a real Portia, I have to disagree.

https://c1.staticflickr.com/7/6096/6306406141_3b237e21ee_b.j...

Look at those big, soulful black eyes...


I didn't know there was a real "portia" spider... the things you learn everyday


I wonder if that submission was in any way related to Peter Watts' novel "Echopraxia"...


This looks like a good solution for scrapers where "close enough" is good enough. If you need 100% accuracy you can always fall back to using scrapy directly and make your scraping logic as accurate as you need. But in many cases you can live with some false positives and then this tool looks like it will fit the bill.


We've actually developed a way that you can convert Portia projects into Scrapy spiders: https://blog.scrapinghub.com/2016/06/29/introducing-portia2c...

and since this is all open source, here's a link to GitHub: https://github.com/scrapinghub/portia2code


Does anyone know what methods are state of the art in machine learning for data extraction in general or where I could get an overview (invoices,images, documents etc.)?


Research papers often use the phrase "wrapper induction" when discussing automatic data extraction. Once I discovered that phrase I found several papers. I'm on mobile so can't link any.


Here is a '12 survey: http://arxiv.org/abs/1207.0246


We would need more context/information about your specific objectives.

- document conversion (pdftotext, pdfbox, apache tabula, etc.)

- OCR (tesseract, pypdfocr, etc.)

- Named-Entity-Recognition (NER) i.e. finding and recognizing entities in text (DBPedia Spotlight, stanford NER via NLTK, spacy)

- coreference resolution, dependency parsing (spacy, syntaxnet)


Thanks. Some great keywords to investigate. I'm namely interested in two areas at the moment: - invoices (I guess NER would be partially an Option) - web scrapping (wrapper induction)


There are many methods for different tasks. What do you mean by 'data extraction', do you have some specific examples?


After the OCR of documents, I have used mostly regex to extract information from semi-structured documents. One example would be invoices (invoice number, total amount etc.), another would be to extract product names, SKU numbers etc. from various documents.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: