My company is investing heavily in interactive runbooks (their own impl) and while I kinda dislike the idea of runbooks in general this is a nice implementation.
Interactive Runbooks exist in some sort of tech-debt limbo where you have enough time to write a nice little runbook but apparently not enough time to properly automate whatever the task is. Ideally automation is easy and all change flows through a SDLC and you get all the nice things that come from that (testing for one). Instead companies build these cybernetic business processes via these adhoc notebooks to manually poke and prod production. This kinda made sense when they were these low-effort throw away things for rare events or short-term triage but now they're kinda-sorta code but without tests or reusability (or maybe you even further down the crazy well and write libraries and tests for your notebooks at which point you're really investing in the cybernetic operational model, and like, why?).
Runbooks are about eliminating toil incrementally.
If you or your team find there's some process that needs to be done regularly, you're right, it should be automated. But there are often barriers to doing that immediately, and you may not even agree on what all the steps are. Here's how you do it then:
1. Write the runbook.
2. Execute the runbook multiple times.
3. Modify and improve the runbook.
4. Automate parts of the runbook.
5. When the entire thing is automated, hook it up to a metric to run automatically.
6. Delete your manual runbook - keep this new piece of code documented with its history and steps of improvement and eventual automation.
This method means you get consistency of a method almost immediately, don't need to fight product to spend a ton of time on it (you can spend 30 minutes here and there on a single step), the team has better conversations about ad-hoc process, and you eventually automate it.
In my job we have to do code releases to customers. This means we take a repo, do some tidying up like removing internal comments, inline some dependencies, make sure the README is in good shape, then make sure several people are notified and can approve it.
I wrote a scrappy runbook with like 10 bullet points because this was being done really inconsistently. A month later it had been improved a ton by engineers with detailed command lines and roll back steps. Now they're starting to automate parts of it. We don't have explicit time for this, but we got a consistent process straight off the bat, engineer buy-in to the process and improving it, and bit by bit we will get to a point where it becomes a one-liner without us feeling we had to carve out a chunk of time specially for it.
Step 0. Explain the process, end-to-end, in plain written English.
It's amazing how many people jump directly from "oh here's a problem" to "let's automate it" without a suitable definition of the problem. A runbook as you've described it above, is a perfect intermediate step. You can refine the process, add error handling and special cases, tune it based on empirical data, and adjust the process on the fly until it's boring. Then you can automate it :-)
I worked at Rundeck on the core engineering team through the PagerDuty acquisition. Left my job start of the year to bootstrap a reboot, StepWise, that embraces and leans into that iterative approach to automation.
We are looking for early adopters to try it out and give feedback soon here. Would love to get your thoughts; https://stepwisehq.com .
Joined your waitlist. Looks and feels very intuitive and promising. Even more refined and polished than Rundeck and your competitors. Good job and thanks for sharing.
> Interactive Runbooks exist in some sort of tech-debt limbo where you have enough time to write a nice little runbook but apparently not enough time to properly automate whatever the task is
I think this is a really common place to fall into. Automating 100% of conditions is hard, but having a google-doc full of bash commands is kinda meh.
One of my previous teams decided on a "Maintainer Lambda" which would help with some tasks we couldn't automate fully. The Lambda could either make an action (trigger a db backup, trigger a bounce, etc) or it could perform some query (find FQDN hostname of server where ${INPUT} metric is in alarm). It'd help with calling complex APIs (eg. AWS APIs) repeatably, and testable, and it'd handle any permissions and input validations. The Lambda had unit testing, and it could run automatically under certain limited conditions. Over time, we've added to it, and let it handle more tasks, as long as it'd log it's actions in a ticket (eg. drain server, then auto-bounce on mem-full alarm).
We'd embed cURL scripts into the runback doc, to trigger the functions manually. But the biggest win was whenever we got an automated ticket, for things like alarming metrics, it'd auto-run and parse the ticket, spitting out useful commands into the comments that the on-call could run (eg. copy+paste'able SSH-to-host commands based on which server triggered alarm). It's basically a reverse-run book. Instead of looking up what to do, we let the ticket tell us what to do, then slowly automate those steps.
I have a bunch of things I want to use them for. amongst them
* tutorials and other show and tell type stuff.
* troubleshooting: have a collaboratively-authored runbook that queries APIs, loads/aggregates data in some way and allows people to interactively examine response data (like a debug terminal sort of thing but with graphing etc)
It's like a specialised case of org-babel for the masses.
Anyone with emacs can do this today, for free. The huge caveat is that it requires
emacs.
Where this tool shines, is that it's a superset of markdown, and thus not tied to any particular editor.
I haven't played with any of the alternative/independent implementations of Org mode but there are several out there¹. It would be cool to use one as the basis for a tool like Runme, or a collection of plugins for other editors.
Also, often you don't need to save your Jupyter notebook in the unergonomical json format; instead save it as a script containing your code cells (and the markdown cells as comments). Thus you can run it directly from the command line in whatever language your notebook is, python, bash... To save notebooks as plain text files you pip install jupytext, and it magically works: jupyter will interpret your .py files as notebooks, and so on.
> The JupyterHub app offered via the GitLab Kubernetes integration now ships with Nurtch’s Rubix library, providing a simple way to create DevOps runbooks
although this is another case of the "don't use the word 'simple' in your description unless it actually is"
This looks very similar to LiveBook¹. It is purely Elixir/BEAM based, but is quite polished and seems like a perfect workflow tool that is also able to expose these workflows (simply called livebooks) as web apps that some functional, non-technical person can execute on his/her own.
If you like this, you may also like Speedrun. It's markdown to build tools straight into your GitHub README's, wikis and documentation so you users can do what they came to do with a click. https://speedrun.nobackspacecrew.com
I made a VSCode extension to do this but never published it. All it required was that md code snippets had a language annotation/label, and that you either had a default interpreter pre-installed or could configure an interpreter.
I still have no clue why I did it NOR why I never published it.
While in principle I agree with the usefulness of this approach I wonder about how actually it benefits orgs and projects.
Seems to me that the fact that there exist extensive runbooks that are maintained separately from the scripts of the project that either duplicate actions between them or those just only exist in runbooks are massive smells.
I like the idea of this conceptually but it feels that I could do all this with github actions. Is it the interactivity and ease of changing the script that is the selling point here?
We've seen that it's not unusual in operations for folks to walk through runbooks running commands that require small (but important) amounts of human input. Runme makes that experience considerably better by utilizing the markdown that's already there and allowing for behavioral configuration (user prompts, process execution, language interpreter etc). But you can also name, and run groups of cells/commands to streamline your workflows.
Yes, most likely. Many organisations are still tied to manual/semi-manual deployments due to company policy or regulatory compliance (banks for example).
I wouldn't squirrel my documentation away in an ipynb file. So even though I'd like to use notebooks as runbooks, Jupyter just isn't the right solution for that. This could be a great tool for when you want to augment markdown documentation with runnable snippets. I really like this idea, it gives off literate programming[0] vibes.
> I wouldn't squirrel my documentation away in an ipynb file.
This! ipynb files are a nightmare to deal with. Fortunately, you can use jupyter without ever seeing a .ipynb file, by storing your notebooks in a different format (e.g. as programs or as markdown files). Bonus points: you'll hear fewer complains from your local graybeards if they can edit and run notebooks from their favorite text editor.
Another is that it looks like runme is written in Go instead of python. It's a single executable I can drop onto any new machine instead of futzing around getting a particular python version installed before doing any work.
I'm on the Runme team, and one of the goals of the primary goals of runme is to use a shared markdown parsing engine so that there is consistent behavior between CLI/GH Action and the VS Code Notebook.
There are actually many differences. We focus more on operational tasks, and allow cells to contain code from a number of programming languages in addition to python. In fact most folks are using Runme to make executing runbooks full of bash commands a better experience.
Ok, so from the front page I got the impression it was only bash. I'm hitting cloud APIs via python so Jupyter works for that, and I'm using the python to ssh into systems and run bash there. The two missing things from Jupyter are 1. ssh-ing to remote hosts and running shell scripts (we do it with a bunch of code) and 2. running cells concurrently. But python is a hard requirement.
I've just added the currently supported languages you can execute to runme.dev in the features section (Javascript, Typescript, Shell, Lua, Perl, Python and Ruby for now). Also, the Runme CLI (https://github.com/stateful/runme) also supports running cells in parallel with `runme run -all -p`.
We'd be interested to know how we can help Runme satisfy your use cases, join us on Discord! https://discord.gg/runme
I wonder if you'd get some mileage from my https://p3rl.org/Object::Remote module (or at least stealing the ideas therein ;) - it effectively applies https://p3rl.org/App::FatPacker on-demand over the wire so the remote code can use any (pure perl) module you've got installed locally without needing to be able to write to the disk.
(at https://shadow.cat/ we use it to investigate things on new customers' platforms with only an ssh login and a perl core install as requirements, along with as an ad-hoc "I need to run this on lots of machines without being intrusive" tool)
VS Code can attach to VMs via SSH and the notebooks will transparently run on the remote host. Python works out of the box as mentioned. I’m a co-creator of Runme btw.
Interactive Runbooks exist in some sort of tech-debt limbo where you have enough time to write a nice little runbook but apparently not enough time to properly automate whatever the task is. Ideally automation is easy and all change flows through a SDLC and you get all the nice things that come from that (testing for one). Instead companies build these cybernetic business processes via these adhoc notebooks to manually poke and prod production. This kinda made sense when they were these low-effort throw away things for rare events or short-term triage but now they're kinda-sorta code but without tests or reusability (or maybe you even further down the crazy well and write libraries and tests for your notebooks at which point you're really investing in the cybernetic operational model, and like, why?).