Pup: Parsing HTML at the command line

dang · on Nov 30, 2022

Show HN: Pup – A command-line HTML parser - https://news.ycombinator.com/item?id=8312249 - Sept 2014 (27 comments)

Random bit of history: that Show HN was a very early choice for what is now called the second-chance pool:

Ask HN: Why did three HN stories jump 100 ranking points in 5 mins? - https://news.ycombinator.com/item?id=8313505 - Sept 2014 (6 comments)

bsnnkv · on Nov 30, 2022

I use this extensively in bash scripts where I need to scrape HTML reliably and consistently. Cannot recommend it highly enough.

In fact, all of the source data for a project of mine, Baytyab[1] (couplet-finder) was scraped using bash + pup.

[1]: https://baytyab.com

pipeline_peak · on Nov 30, 2022

Would you recommend it over BeautifulSoup?

bsnnkv · on Dec 1, 2022

I'd pick any CLI I can use directly in bash vs having to write Python to use a lib any day of the week.

culi · on Dec 1, 2022

I had a recent realizing that Deno is actually an amazing tool for scripts and webscraping. It runs TS natively so I can take advantage of TypeScript's amazing type system when interacting with complex APIs. Being able to import libraries from a url also makes it a breeze to use any libraries (I've yet to run into a node module that Deno's std/node library wasn't able to polyfill).

I can basically write a .ts file anywhere and run it from the CLI with deno. Definitely prefer it over python at this point

brigandish · on Dec 5, 2022

I use Deno for linting, having a binary there instead of using NPM to get a linter is a different level of simple. I like your idea, though I rarely use js for scripts I'm sure it'll come in handy.

bryanrasmussen · on Dec 1, 2022

Is there any situation where you end up using the type system for interacting with APIs but also have to webscrape on the same service?

I am interested in hearing more about this as maybe I should switch from Node to Deno for a scraping project I have.

culi · on Dec 1, 2022

I haven't done something like that yet, but I'm sure there would be. Most webscraping libraries have good `@types` definitions now though which really comes in handy for at least using the tools

You can probably try it in a few minutes if you already have Deno installed. Make a new file. Here's a quick example script

```ts

  /* run with `deno run --allow-net demo.ts` */

  import { DOMParser } from 'https://deno.land/x/deno_dom/deno-dom-wasm.ts';

  const API_URL = 'https://subsidytracker.goodjobsfirst.org/parent/tesla-inc';

  const main = async (args: string[]) => {
    const resp = await fetch(API_URL);
    const text = await resp.text();
    const html = new DOMParser().parseFromString(text, 'text/html');

    console.log(
      'Tesla subsidies:',
      [
        ...html.querySelectorAll('table:first-of-type > tbody:first-of-type > tr')
      ].map(tr => {
        const [name, value] = [...tr.querySelectorAll('td')];

        return {
          name: name.textContent,
          value: value.textContent
        };
      })
    );
  };

  main(Deno.args);

```

save that file and run:

  deno run --allow-net demo.ts

toastal · on Dec 1, 2022

Lambda Soup (https://aantron.github.io/lambdasoup) is a recommendation over BeautifulSoup :) A compiled binary and a language with types helps me write parsing/scraping scripts.

bijoo · on Nov 30, 2022

It looks like the project became inactive for a bit and there are alternatives such as htmlq, etc. https://github.com/ericchiang/pup/issues/150

mananaysiempre · on Nov 30, 2022

From the looks of it, htmlq doesn’t have anything comparable to pup’s JSON output. That JSON is cumbersome to work with, but combined with jq it allows one to extend the shell hackery just a little bit beyond what CSS can do.

wwader · on Nov 30, 2022

Hey, i'm the author of fq. It can convert to/from html and JSON (in two different modes). Use -d html, or the fromhtml, fromxml and toxml functions. Ex:

  $ curl -s https://news.ycombinator.com/ | fq -r -d html 'grep_by(."@class"=="titleline").a."#text"'
  Inkbase: Programmable Ink
  New details on commercial spyware vendor Variston
  How We Built Fly Postgres
  ...
  $ curl -s https://news.ycombinator.com/ | fq -r -d html '{hosts: {host: [grep_by(."@class"=="titleline").a."@href" | fromurl.host]}} | toxml({indent:2})'
  <hosts>
    <host>www.inkandswitch.com</host>
    <host>blog.google</host>
    <host>fly.io</host>
    ...
  </hosts>

See https://github.com/wader/fq/blob/master/doc/formats.md#xml and https://github.com/wader/fq/blob/master/doc/formats.md#html for examples and documentations.

thangalin · on Nov 30, 2022

https://www.w3.org/Tools/HTML-XML-utils/

sorenjan · on Dec 1, 2022

Like others have said, pup seems to be abandoned. It's surprising that there isn't a well supported standard tool to work with html in the terminal, like jq for json.

Last time I looked at using pup or similar I wanted to extract two values for each element. For example, let's say I have the following html:

  <div class="image">
    <p>Sunset in Hawaii</p>
    <img src="../randomstring123.jpg">
  </div>
  <div class="image">
   ...

Now I'd like to get both the image description, and the source, for each similar image in the page. Preferably so I can pipe it to curl and do

  curl -o "$description.jpg" "$url"

I couldn't find an easy way of doing it, so I used Python instead.

panic · on Nov 30, 2022

Is there a tool like this that can modify part of the HTML without changing the formatting or inserting new tags unnecessarily? I know it's a difficult problem given the complexity of HTML parsing, but I'd like to be able to work with parts of pages programmatically without messing up the hand-written formatting in other parts.

mh- · on Dec 1, 2022

Probably not without constructing a DOM. Which nowadays, in practice, means using Blink or WebKit.

yencabulator · on Dec 3, 2022

https://github.com/cloudflare/lol-html is a HTML rewriter that explicitly does not construct DOM or an AST, but streams nodes throughs ASAP.

https://blog.cloudflare.com/html-parsing-1/

https://blog.cloudflare.com/html-parsing-2/

mh- · on Dec 3, 2022

This is the best approach if they don't need to consider client-side JS mutations. Not sure why I interpreted it that way in my original reply. Thanks!

k__ · on Dec 1, 2022

I'd go so far and assume it's an AI problem.

abathur · on Nov 30, 2022

Enjoying pup recently myself. Used it for a little demo/post on shell packaging approaches in Nix this summer https://t-ravis.com/post/shell/no_look_no_leap_shell_with_ni...

jakeogh · on Dec 1, 2022

Anything out there that lets one programatically query the style on an element? I get a DOM object from lxmlhtml.fromstring(), but haven't figured out how to get element styles, so I can display something in bold, or apply a term color.

culi · on Dec 1, 2022

if those styles are defined inline it should be straightforward but most likely there's external styles being applied which might be a nontrivial problem to solve

However we do have document.defaultView.getComputedStyle which probably will solve most of your needs. Not sure if that API is available with this tool though

ducktective · on Nov 30, 2022

I'm looking for something like this but for scraping SPAs and JS-rich web content. A single static binary like this not chrome driver or selenium etc

mplewis9z · on Dec 1, 2022

Good luck with that - you'd need to have a fully standards-compliant HTML/CSS/JS engine in order to get the same output, and at that point that's literally just a browser (and would require a massive investment of time for maintenance).

A driver for an existing browser is the only reasonable option for the foreseeable future, with how complex modern web development has become.

chrismarlow9 · on Dec 1, 2022

You need something to replace the curl part of this, not pup. You can pipe js into chrome via CLI with --repl flag and dump the full Dom and then pipe that into pup.

You'll be stuck with GET requests only though, and very simple ones, unless you get creative with the js you pipe in.

ghoul2 · on Dec 1, 2022

Is there a tool to transform html?

Simple example, select all img tags without alt text, and insert given text into the alt tag. Or change the domain for all a.hrefs

fcholf · on Dec 1, 2022

You may use xslt for this kind of operation (https://en.wikipedia.org/wiki/XSLT) but it is often overkilled. I would love a language that allows to easily write transductions of HTML documents but I think that combining simplicity and expressiveness for this kind of tasks is a bit painful. I usually end up using pandoc filters or some dedicated javascript script that transforms the DOM and output a new HTML file.

John23832 · on Nov 30, 2022

While this is nice, it's three years old.

Direct installation of brew scripts isn't supported anymore. `go get` installs aren't either.

It needs an update.

JodieBenitez · on Nov 30, 2022

It may need an update, but not for installation:

go install github.com/ericchiang/pup@latest

msla · on Dec 1, 2022

You could probably sprinkle a few more uses of cat into your command lines, but otherwise it's fine.

turtlebits · on Nov 30, 2022

Will check it out, but would have preferred XPath selectors instead of CSS.

wmichelin · on Nov 30, 2022

Why? CSS selectors are the normal web developer way to select content from a document. Even JavaScript adopted the approach.

tuukkah · on Nov 30, 2022

XPath supports more complex queries. In JavaScript, XPath is available as document.evaluate

undume · on Nov 30, 2022

xmllint can do that:

  curl example.org | xmllint --html --xpath '//some/xpath/selector' -

jwilk · on Dec 1, 2022

Only if it's good old HTML 4. libxml2's parser doesn't grok HTML 5.

curben · on Nov 30, 2022

I'm using xmlstarlet in Alpine as a bare minimum way to scrap a webpage in CI pipeline.

ludovicianul · on Dec 1, 2022

This has support for both XPath and CSS selectors: https://github.com/ludovicianul/hq

BossHogg · on Nov 30, 2022

There's also https://github.com/charmparticle/xpe

natrys · on Nov 30, 2022

Also: https://github.com/benibela/xidel