Hacker News new | past | comments | ask | show | jobs | submit login
Pup: Parsing HTML at the command line (github.com/ericchiang)
185 points by tosh on Nov 30, 2022 | hide | past | favorite | 39 comments



Related:

Pup – Like Jq, but for HTML - https://news.ycombinator.com/item?id=24797697 - Oct 2020 (2 comments)

Show HN: Pup – A command-line HTML parser - https://news.ycombinator.com/item?id=8312249 - Sept 2014 (27 comments)

Random bit of history: that Show HN was a very early choice for what is now called the second-chance pool:

Ask HN: Why did three HN stories jump 100 ranking points in 5 mins? - https://news.ycombinator.com/item?id=8313505 - Sept 2014 (6 comments)


I use this extensively in bash scripts where I need to scrape HTML reliably and consistently. Cannot recommend it highly enough.

In fact, all of the source data for a project of mine, Baytyab[1] (couplet-finder) was scraped using bash + pup.

[1]: https://baytyab.com


Would you recommend it over BeautifulSoup?


I'd pick any CLI I can use directly in bash vs having to write Python to use a lib any day of the week.


I had a recent realizing that Deno is actually an amazing tool for scripts and webscraping. It runs TS natively so I can take advantage of TypeScript's amazing type system when interacting with complex APIs. Being able to import libraries from a url also makes it a breeze to use any libraries (I've yet to run into a node module that Deno's std/node library wasn't able to polyfill).

I can basically write a .ts file anywhere and run it from the CLI with deno. Definitely prefer it over python at this point


I use Deno for linting, having a binary there instead of using NPM to get a linter is a different level of simple. I like your idea, though I rarely use js for scripts I'm sure it'll come in handy.


Is there any situation where you end up using the type system for interacting with APIs but also have to webscrape on the same service?

I am interested in hearing more about this as maybe I should switch from Node to Deno for a scraping project I have.


I haven't done something like that yet, but I'm sure there would be. Most webscraping libraries have good `@types` definitions now though which really comes in handy for at least using the tools

You can probably try it in a few minutes if you already have Deno installed. Make a new file. Here's a quick example script

```ts

  /* run with `deno run --allow-net demo.ts` */

  import { DOMParser } from 'https://deno.land/x/deno_dom/deno-dom-wasm.ts';

  const API_URL = 'https://subsidytracker.goodjobsfirst.org/parent/tesla-inc';

  const main = async (args: string[]) => {
    const resp = await fetch(API_URL);
    const text = await resp.text();
    const html = new DOMParser().parseFromString(text, 'text/html');

    console.log(
      'Tesla subsidies:',
      [
        ...html.querySelectorAll('table:first-of-type > tbody:first-of-type > tr')
      ].map(tr => {
        const [name, value] = [...tr.querySelectorAll('td')];

        return {
          name: name.textContent,
          value: value.textContent
        };
      })
    );
  };

  main(Deno.args);
```

save that file and run:

  deno run --allow-net demo.ts


Lambda Soup (https://aantron.github.io/lambdasoup) is a recommendation over BeautifulSoup :) A compiled binary and a language with types helps me write parsing/scraping scripts.


It looks like the project became inactive for a bit and there are alternatives such as htmlq, etc. https://github.com/ericchiang/pup/issues/150


From the looks of it, htmlq doesn’t have anything comparable to pup’s JSON output. That JSON is cumbersome to work with, but combined with jq it allows one to extend the shell hackery just a little bit beyond what CSS can do.


Hey, i'm the author of fq. It can convert to/from html and JSON (in two different modes). Use -d html, or the fromhtml, fromxml and toxml functions. Ex:

  $ curl -s https://news.ycombinator.com/ | fq -r -d html 'grep_by(."@class"=="titleline").a."#text"'
  Inkbase: Programmable Ink
  New details on commercial spyware vendor Variston
  How We Built Fly Postgres
  ...
  $ curl -s https://news.ycombinator.com/ | fq -r -d html '{hosts: {host: [grep_by(."@class"=="titleline").a."@href" | fromurl.host]}} | toxml({indent:2})'
  <hosts>
    <host>www.inkandswitch.com</host>
    <host>blog.google</host>
    <host>fly.io</host>
    ...
  </hosts>
See https://github.com/wader/fq/blob/master/doc/formats.md#xml and https://github.com/wader/fq/blob/master/doc/formats.md#html for examples and documentations.



Like others have said, pup seems to be abandoned. It's surprising that there isn't a well supported standard tool to work with html in the terminal, like jq for json.

Last time I looked at using pup or similar I wanted to extract two values for each element. For example, let's say I have the following html:

  <div class="image">
    <p>Sunset in Hawaii</p>
    <img src="../randomstring123.jpg">
  </div>
  <div class="image">
   ...
Now I'd like to get both the image description, and the source, for each similar image in the page. Preferably so I can pipe it to curl and do

  curl -o "$description.jpg" "$url"
I couldn't find an easy way of doing it, so I used Python instead.


Is there a tool like this that can modify part of the HTML without changing the formatting or inserting new tags unnecessarily? I know it's a difficult problem given the complexity of HTML parsing, but I'd like to be able to work with parts of pages programmatically without messing up the hand-written formatting in other parts.


Probably not without constructing a DOM. Which nowadays, in practice, means using Blink or WebKit.


https://github.com/cloudflare/lol-html is a HTML rewriter that explicitly does not construct DOM or an AST, but streams nodes throughs ASAP.

https://blog.cloudflare.com/html-parsing-1/

https://blog.cloudflare.com/html-parsing-2/


This is the best approach if they don't need to consider client-side JS mutations. Not sure why I interpreted it that way in my original reply. Thanks!


I'd go so far and assume it's an AI problem.


Enjoying pup recently myself. Used it for a little demo/post on shell packaging approaches in Nix this summer https://t-ravis.com/post/shell/no_look_no_leap_shell_with_ni...


Anything out there that lets one programatically query the style on an element? I get a DOM object from lxmlhtml.fromstring(), but haven't figured out how to get element styles, so I can display something in bold, or apply a term color.


if those styles are defined inline it should be straightforward but most likely there's external styles being applied which might be a nontrivial problem to solve

However we do have document.defaultView.getComputedStyle which probably will solve most of your needs. Not sure if that API is available with this tool though


I'm looking for something like this but for scraping SPAs and JS-rich web content. A single static binary like this not chrome driver or selenium etc


Good luck with that - you'd need to have a fully standards-compliant HTML/CSS/JS engine in order to get the same output, and at that point that's literally just a browser (and would require a massive investment of time for maintenance).

A driver for an existing browser is the only reasonable option for the foreseeable future, with how complex modern web development has become.


You need something to replace the curl part of this, not pup. You can pipe js into chrome via CLI with --repl flag and dump the full Dom and then pipe that into pup.

You'll be stuck with GET requests only though, and very simple ones, unless you get creative with the js you pipe in.


Is there a tool to transform html?

Simple example, select all img tags without alt text, and insert given text into the alt tag. Or change the domain for all a.hrefs


You may use xslt for this kind of operation (https://en.wikipedia.org/wiki/XSLT) but it is often overkilled. I would love a language that allows to easily write transductions of HTML documents but I think that combining simplicity and expressiveness for this kind of tasks is a bit painful. I usually end up using pandoc filters or some dedicated javascript script that transforms the DOM and output a new HTML file.


While this is nice, it's three years old.

Direct installation of brew scripts isn't supported anymore. `go get` installs aren't either.

It needs an update.


It may need an update, but not for installation:

go install github.com/ericchiang/pup@latest


You could probably sprinkle a few more uses of cat into your command lines, but otherwise it's fine.


Will check it out, but would have preferred XPath selectors instead of CSS.


Why? CSS selectors are the normal web developer way to select content from a document. Even JavaScript adopted the approach.


XPath supports more complex queries. In JavaScript, XPath is available as document.evaluate


xmllint can do that:

  curl example.org | xmllint --html --xpath '//some/xpath/selector' -


Only if it's good old HTML 4. libxml2's parser doesn't grok HTML 5.


I'm using xmlstarlet in Alpine as a bare minimum way to scrap a webpage in CI pipeline.


This has support for both XPath and CSS selectors: https://github.com/ludovicianul/hq






Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: