Hacker News new | past | comments | ask | show | jobs | submit login

I have assumed for a while the only way to convert FB -> RSS would be to scrape the home page, but from what I recall the HTML & DOM is all kinds of messed up - intentionally obfuscated to prevent adblocking. From a quick look just now it does seem like it would be a nightmare to try to parse it as-is - and I would guess FB changes a lot of the output regularly anyway to defeat adblockers, making efforts to keep up pretty challenging.



It almost sounds like a problem best solved with OCR, rather than scraping per se. Build a simple model to recognize “posts” from screenshots, and output the rectangular viewport regions of their inner content; then build some GIS-like layered 2D interval tree of all the DOM regions, such that you could ask Puppeteer et al to filter for every DOM node with visibility overlap with that viewport region; extract every single Unicode grapheme-cluster within those nodes separately, annotated with its viewport XY position; and finally, use the same kind of model that lets PDF readers you highlight “text” (i.e. arbitrary bags of absolute-positioned graphemes) in PDFs, to “un-render” the DOM nodes’ bag of positioned graphemes back into a stream of space/line/paragraph-segmented text.


I wrote about trying to do this for Facebook page events as an example. Code sample included:

https://chrishardie.com/2019/10/unlocking-community-events-f...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: