Hacker News new | past | comments | ask | show | jobs | submit login
Web scraping with Factor (re-factor.blogspot.com)
79 points by otoburb on April 22, 2014 | hide | past | favorite | 24 comments



This didn't seem interesting to me, because there's so many ways to scrape a site, in every high level language, that learning one more way to do it is going to be a diminished return...and I also didn't know that "Factor" was a language.

The OP's template could have some more info on what Factor is, but there's a few links, including this wiki for it: http://concatenative.org/wiki/view/Factor/Learning

I honestly didn't know what a "concatenative" language was until I saw that "Six programming paradigms that will change how you think about coding" post that fronted HN last week:

http://brikis98.blogspot.in/2014/04/six-programming-paradigm...

So this is just a long way of saying...after just learning about concatenative languages, I'm really interested in what that paradigm brings to a common task...maybe there aren't productivity gains, but I love learning different philosophies of coding, and thanks to the OP for showing one practical example.


I added some information about the Factor language to the blog template - great idea, thanks.

Concatenative languages are quite interesting, and I'd encourage you to try it out. You might find it helps your thinking about certain problems.


In exploring concatenative languages, be sure to investigate the Joy language.

https://web.archive.org/web/20111006171100/http://www.latrob...

...unfortunately, it appears that the webpage of Joy's author has been lost to the web, so therefore the link to the Internet Archive. And don't forget that there was a very mainstream concatenative language at one point in time: PostScript.


I do regular web scrapping using php and curl. This seems more interesting time to learn factor programming language.

@otoburb can you recommend any better guide for the same?


The Factor documentation is hosted at http://docs.factorcode.org. Feel free to browse! It's the same set of docs that you also have access to locally when you download the Factor binaries[1] or pull/compile from github[2].

The concatenative.org[3] wiki also has similar starting material and pointers.

Factor is a fun language. The blog that I linked to is written by one of the Factor contributors.

[1] http://factorcode.org/

[2] https://github.com/slavapestov/factor

[3] http://concatenative.org/wiki/view/Factor/Learning


Do you know about any "Factor for Forth programmers" tutorial? Factor is similar enough to irritate me with it's "for beginners" materials, but different (and larger) enough to make normal manuals useless by themselves.

I'd especially appreciate a bottom-up write up, starting with the stack and cells (which feel familiar), and then introducing higher-level abstractions of Factor.


Better wash your brain about that implied misconception. Factor is much more like Lisp and Haskell (esp point-free style) than Forth. I guess "learn Lisp and/or Haskell and Factor won't seem so foreign" isn't terrific advice. But right now there doesn't exist many newbie guides at all.


I dunno, syntactically it resembles Forth quite a bit, what with : ; for defining words and () for comments and all that. Anyway, I have no problem whatsoever with high-level abstractions in Factor, nor with its concatenative nature, nor with its macros and so on. I know all these features from other languages. What I want is a just description of how these high-level things map to assembly, I guess. For example I just learned that: "Internally, a quotation is a pair, consisting of an array and a machine code entry point. The array stores the quotation's elements" - this is a kind of definition I want for all the abstractions in Factor. It's probably best to go through Slava Pestov blog and pick up such scattered descriptions, but I'd really appreciate if someone prepared a single article with all these definitions.


Note that the stack comments aren't actually comments in Factor - they're part of the function definition and are mandatory. The compiler will do a simple check to ensure that all of your stack inputs and outputs match up for each function call.


We made stack effects mandatory for most definitions as it appeared an area of frustration with new Factor programmers.

However, we have a stack checker that still supports optional stack effects if you yearn for the good ol' days:

https://github.com/slavapestov/factor/issues/887


factor has opengl built into its repl, so it kind of gives you the mythical modern terminal in one of gary bernhardt's talks.


I forgot about that ... Factor impresses me a lot for there's so much packed in it. one man effort it seems ... even more impressive.


Although it feels likes a one programmer effort, there have been a few of us contributing consistently over the years:

https://github.com/slavapestov/factor/graphs/contributors


Thanks for showing me this, I was greatly misguided.


somewhat relative to factor: whatever happened to slava pestov?

I used to follow his blog about factor with interest some years ago and then all of a sudden... he was gone.


He is working for Google. From the outside, it looks a lot like he disappeared from open source...


you can write an entire web scraper with just a url using http://scrape.ly

With scrape.ly I can just do this to crawl the entire HackerNews site across pages and grab the urls and extract any data from the page it lands on without defining any fields (it discovers them on it's own) and so doesn't require you to 'relabel' fields when the site changes layouts. It also generates new IP addresses on the fly so you don't get stuck and launches multiple threads for you to speed up the process. It works fully with ajax sites and single page apps. Flash support is coming too.

    http://scrape.ly/s/{https://news.ycombinator.com/}
    {next:More}{Space Monkey dumps Python for Go}*{fields:'Auto'}
Honest question (I don't mind downvoting if you disagree), but why would you want to waste time writing web scrapers, maintaining it to run and fixing the code? Multiply it by 100 or 1000 different websites and it becomes a full-time job. For me, I want to get the data I need with the least possible of overhead and as soon as possible and I don't really want to be bothered with setting up environments and hosting for it to run and fixing bugs when sites change layout.


This is not a post about web scraping. It's a post about doing something in Factor.


    Web scraping with Factor


Are you familiar with the idea of implementing common problems for the sake of pedagogy? For example, someone who might want to demonstrate how a particular programming language can be used might start a blog, and in that blog said person might post articles demonstrating how you could attack a particular problem in that language.

Your criticism of this post comes across as tone-deaf. You might as well have written the editors of Beautiful Code to lecture them about how the chapter on quicksort is horribly misguided and that everything a good software craftsman should ever care to know on the subject can be found at http://docs.oracle.com/javase/7/docs/api/java/util/Arrays.ht...)


Honestly, I meant no harm. I saw that we were talking about web scraping in other languages like PHP and Python, and I wanted to add on to the idea above that Factor doesn't really provide additional value than any other implementation of the job in another language would. They equally share the same overhead associated with web scraping activity that must lay on the shoulder of the developer. All in all, I wanted to highlight that one shouldn't put so much effort into creating web scrapers, and suggested a different tool that is specialized for the same job mentioned in the article.


Or if you're a python enthusiast, then shameless self link: http://jakeaustwick.me/python-web-scraping-resource/


pretty good but that's an awful lot of reading and lot of work just to grab some data from a simple website.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: