Scrapely: The brains behind Portia, our visual web scraping tool

merraksh · on July 7, 2016

I wonder if this was submitted after reading today's submission (https://news.ycombinator.com/item?id=12047234) about (real) Portia spiders and googling the subject.

unsettledtck · on July 7, 2016

We did indeed model our Portia on the real spiders. We'd like to think our version is a wee bit cuter...

david-given · on July 7, 2016

After finding this picture of a real Portia, I have to disagree.

https://c1.staticflickr.com/7/6096/6306406141_3b237e21ee_b.j...

Look at those big, soulful black eyes...

jsargiox · on July 7, 2016

I didn't know there was a real "portia" spider... the things you learn everyday

pavel_lishin · on July 7, 2016

I wonder if that submission was in any way related to Peter Watts' novel "Echopraxia"...

rkrzr · on July 7, 2016

This looks like a good solution for scrapers where "close enough" is good enough. If you need 100% accuracy you can always fall back to using scrapy directly and make your scraping logic as accurate as you need. But in many cases you can live with some false positives and then this tool looks like it will fit the bill.

unsettledtck · on July 7, 2016

We've actually developed a way that you can convert Portia projects into Scrapy spiders: https://blog.scrapinghub.com/2016/06/29/introducing-portia2c...

and since this is all open source, here's a link to GitHub: https://github.com/scrapinghub/portia2code

abc03 · on July 7, 2016

Does anyone know what methods are state of the art in machine learning for data extraction in general or where I could get an overview (invoices,images, documents etc.)?

Buttons840 · on July 7, 2016

Research papers often use the phrase "wrapper induction" when discussing automatic data extraction. Once I discovered that phrase I found several papers. I'm on mobile so can't link any.

rlndmx · on July 7, 2016

Here is a '12 survey: http://arxiv.org/abs/1207.0246

ahljoh · on July 8, 2016

We would need more context/information about your specific objectives.

- document conversion (pdftotext, pdfbox, apache tabula, etc.)

- OCR (tesseract, pypdfocr, etc.)

- Named-Entity-Recognition (NER) i.e. finding and recognizing entities in text (DBPedia Spotlight, stanford NER via NLTK, spacy)

- coreference resolution, dependency parsing (spacy, syntaxnet)

abc03 · on July 8, 2016

Thanks. Some great keywords to investigate. I'm namely interested in two areas at the moment: - invoices (I guess NER would be partially an Option) - web scrapping (wrapper induction)

kmike84 · on July 7, 2016

There are many methods for different tasks. What do you mean by 'data extraction', do you have some specific examples?

abc03 · on July 7, 2016

After the OCR of documents, I have used mostly regex to extract information from semi-structured documents. One example would be invoices (invoice number, total amount etc.), another would be to extract product names, SKU numbers etc. from various documents.