Hacker News new | past | comments | ask | show | jobs | submit | gavino's comments login

Sure! The backend is actually pretty straight forward, it's a NextJS app deployed on Now with a few added endpoints to handle the incoming GraphQL queries.

Then for actually turning the query into a digestable output I used the GraphQL schema builder that handles accepts HTML nodes from the requested page and grabs the right variables.


I remember seeing GDOM a while back when I first started this project, but forgot to write it down as a source of inspiration. I'm gonna add all of these as alternatives, because they're all great :D


So happy to read that :) (and so glad it's served as source of inspiration for your project, keep up the good work!)


Troy and Abed scraping websites!! :D


in the morning!


Nah, I don't really plan on turning it into a company. I'd gladly accept any PR to swap out cheerio, I haven't touched that part in close to a year :D


That link doesn't work for me, but this one does https://www.textile.photos/ :)


thanks! updated in comment too


Location: Reno, NV

Remote: Yes (tentatively)

Willing to relocate: Yes

Technologies: javascript, react.js, angular, node.js, rails, html, css (scss, less), react native

Résumé/CV: On request

Email: gavin@gavin.codes

Portfolio: http://www.gavin.codes

Github: http://www.github.com/gavindinubilo


It doesn't currently do that, I think it'd be an interesting challenge to try and do that though. It's definitely possible to do.


Yep. Have a look at phantomjs [1], or other phantomjs wrappers like casperjs [2].

[1] https://www.npmjs.com/package/phantomjs

[2] https://www.npmjs.com/package/casperjs


> interesting challenge

Understatement of the year.

You'd need to either re-implement an entire browser stack or run a headless version of gecko of webkit server-side.

The former entails millions of man-hours of work. The latter opens up your server to all sorts of exploits. Overall a really bad idea.

Besides, single page applications are the worst junk in the entire Web 2.0 cesspool. If you really need to scrape them, they usually come with their own JSON API which you can just piggyback.


> entails millions of man-hours of work

Overstatement of the year.

Why on Earth would the OP start from scratch? Besides, though not a solo and OSS effort, Apifier does this; certainly without "millions" of hours having been spent on it.


I had been trying to figure out what would be causing this issue, thanks for pointing it out, I've pushed a fix real quick that will respond whether JSON is invalid or a CSS selector wasn't found on the provided URL.


Ah, yep, you're right, forgot to change the URL. Updated now. Thanks for letting me know.


And to get the HN post titles:

curl -d url=https://news.ycombinator.com/ -d json_data='{"title":[{"elem":".title > a","value":"text"}]}' http://www.jamapi.xyz/

This is cool :)

EDIT:

Incidentally, you don't really need to have that "index" key inside the values of an array, because in an array the order is preserved anyway. Unless I've misunderstood what it means?


Titles and links grouped together:

curl -X POST http://www.jamapi.xyz/ -d url=http://news.ycombinator.com -d json_data='{"title": "title","paragraphs": [{ "elem": "td.title a", "value": "text", "location": "href"}]}'

Use the http URL to call www.jamapi.xyz because calling https I get an Error code: SSL_ERROR_BAD_CERT_DOMAIN


Regarding the "index" key, there are some JSON parsers for languages like Swift that will rearrange your JSON. By adding the index key, you'll still be able to sort after parsing.

Also, thanks, it's really cool to see people liking this :)


They might rearrange keys in a JSON object, but in an array they should be preserved in order as according to the spec[1]. If Swift does this (which I can't really check) than this would be a bug.

[1] http://www.json.org/: An array is an ordered collection of values. An array begins with [ (left bracket) and ends with ] (right bracket). Values are separated by , (comma).


Yes, the order of elements in an array should always be preserved. For example, we might be expecting the first element to be a name, the second to be a date of birth, etc. We should use an object for that, but that's for reasons of readability, extensibility, etc. rather than array semantics being unsuitable.

Also, jq has a `--sort-keys` option which tries to make the output as reproducible/canonical as possible. From the manual:

> The keys are sorted "alphabetically", by unicode codepoint order. This is not an order that makes particular sense in any particular language, but you can count on it being the same for any two objects with the same set of keys, regardless of locale settings.

It would be strange for a JSON tool to go to such lengths to normalise data, if array order were unpredictable.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: