I had a recent realizing that Deno is actually an amazing tool for scripts and webscraping. It runs TS natively so I can take advantage of TypeScript's amazing type system when interacting with complex APIs. Being able to import libraries from a url also makes it a breeze to use any libraries (I've yet to run into a node module that Deno's std/node library wasn't able to polyfill).
I can basically write a .ts file anywhere and run it from the CLI with deno. Definitely prefer it over python at this point
I use Deno for linting, having a binary there instead of using NPM to get a linter is a different level of simple. I like your idea, though I rarely use js for scripts I'm sure it'll come in handy.
I haven't done something like that yet, but I'm sure there would be. Most webscraping libraries have good `@types` definitions now though which really comes in handy for at least using the tools
You can probably try it in a few minutes if you already have Deno installed. Make a new file. Here's a quick example script
```ts
/* run with `deno run --allow-net demo.ts` */
import { DOMParser } from 'https://deno.land/x/deno_dom/deno-dom-wasm.ts';
const API_URL = 'https://subsidytracker.goodjobsfirst.org/parent/tesla-inc';
const main = async (args: string[]) => {
const resp = await fetch(API_URL);
const text = await resp.text();
const html = new DOMParser().parseFromString(text, 'text/html');
console.log(
'Tesla subsidies:',
[
...html.querySelectorAll('table:first-of-type > tbody:first-of-type > tr')
].map(tr => {
const [name, value] = [...tr.querySelectorAll('td')];
return {
name: name.textContent,
value: value.textContent
};
})
);
};
main(Deno.args);
Lambda Soup (https://aantron.github.io/lambdasoup) is a recommendation over BeautifulSoup :) A compiled binary and a language with types helps me write parsing/scraping scripts.
From the looks of it, htmlq doesn’t have anything comparable to pup’s JSON output. That JSON is cumbersome to work with, but combined with jq it allows one to extend the shell hackery just a little bit beyond what CSS can do.
Hey, i'm the author of fq. It can convert to/from html and JSON (in two different modes). Use -d html, or the fromhtml, fromxml and toxml functions. Ex:
$ curl -s https://news.ycombinator.com/ | fq -r -d html 'grep_by(."@class"=="titleline").a."#text"'
Inkbase: Programmable Ink
New details on commercial spyware vendor Variston
How We Built Fly Postgres
...
$ curl -s https://news.ycombinator.com/ | fq -r -d html '{hosts: {host: [grep_by(."@class"=="titleline").a."@href" | fromurl.host]}} | toxml({indent:2})'
<hosts>
<host>www.inkandswitch.com</host>
<host>blog.google</host>
<host>fly.io</host>
...
</hosts>
Like others have said, pup seems to be abandoned. It's surprising that there isn't a well supported standard tool to work with html in the terminal, like jq for json.
Last time I looked at using pup or similar I wanted to extract two values for each element. For example, let's say I have the following html:
Is there a tool like this that can modify part of the HTML without changing the formatting or inserting new tags unnecessarily? I know it's a difficult problem given the complexity of HTML parsing, but I'd like to be able to work with parts of pages programmatically without messing up the hand-written formatting in other parts.
This is the best approach if they don't need to consider client-side JS mutations. Not sure why I interpreted it that way in my original reply. Thanks!
Anything out there that lets one programatically query the style on an element? I get a DOM object from lxmlhtml.fromstring(), but haven't figured out how to get element styles, so I can display something in bold, or apply a term color.
if those styles are defined inline it should be straightforward but most likely there's external styles being applied which might be a nontrivial problem to solve
However we do have document.defaultView.getComputedStyle which probably will solve most of your needs. Not sure if that API is available with this tool though
Good luck with that - you'd need to have a fully standards-compliant HTML/CSS/JS engine in order to get the same output, and at that point that's literally just a browser (and would require a massive investment of time for maintenance).
A driver for an existing browser is the only reasonable option for the foreseeable future, with how complex modern web development has become.
You need something to replace the curl part of this, not pup. You can pipe js into chrome via CLI with --repl flag and dump the full Dom and then pipe that into pup.
You'll be stuck with GET requests only though, and very simple ones, unless you get creative with the js you pipe in.
You may use xslt for this kind of operation (https://en.wikipedia.org/wiki/XSLT) but it is often overkilled. I would love a language that allows to easily write transductions of HTML documents but I think that combining simplicity and expressiveness for this kind of tasks is a bit painful. I usually end up using pandoc filters or some dedicated javascript script that transforms the DOM and output a new HTML file.
Pup – Like Jq, but for HTML - https://news.ycombinator.com/item?id=24797697 - Oct 2020 (2 comments)
Show HN: Pup – A command-line HTML parser - https://news.ycombinator.com/item?id=8312249 - Sept 2014 (27 comments)
Random bit of history: that Show HN was a very early choice for what is now called the second-chance pool:
Ask HN: Why did three HN stories jump 100 ranking points in 5 mins? - https://news.ycombinator.com/item?id=8313505 - Sept 2014 (6 comments)