Hacker News new | past | comments | ask | show | jobs | submit login

The crawler is a part of SeekStorm.

"Well-defined": Just a guess: We are doing key text extraction, i.e. we try not to index boilerplate stuff an and menu items. As "well-defined" is within a short list item, it might be accidentally skipped.

So, that is not yet perfect, and considering the diversity in web page structure it probably never will. But we will try to improve.




Yes, if the preview is the representation of what you have indexed, then half of that article is missing. You may have identified the weakest link in your stack - crawler/extractor (which is notoriously hard to do, would be good if you provided more detail eg. do you use headless browser or simple GET request, do you crawl PDFs etc). Little use of the advanced stack on top of it, if the data does not end in the index in the first place. Hope you provide an update on this in the future. I'd probably sign up for a plan.


No, the preview is NOT the representation of what we have indexed. The preview ist limited to about 200 words to be compliant with fair use legislation. Indexing is limited to 1 MB per document.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: