Categories
Posts

pup: Command Line HTML Parsing

pup, the three sentence description nails is:

pup is a command line tool for processing HTML. It reads from stdin, prints to stdout, and allows the user to filter parts of the page using CSS selectors.

Inspired by jq, pup aims to be a fast and flexible way of exploring HTML from the terminal.

If you have used jq, this is going to feel familiar, just for an HTML context. With the option to output the results as JSON, you could find your self using pup in conjunction with jq.

There are several examples in the README, here are a few of my own.

List all of the IMG tags from cnn.com:
[shell]
curl -s https://www.cnn.com/ | pup ‘img’
[/shell]

Now get it back as a JSON array:
[shell]
curl -s https://www.cnn.com/ | pup ‘img json{}’
[/shell]

List all of the LINK attributes that have a rel="preload" attribute:

[shell]
curl -s https://www.cnn.com/ | pup ‘link[rel="preload"]’
[/shell]

I could see pup and jq becoming standard utilities on Unix-like systems.