Write the Docs Portland 2017

Do you know a runbook from a flipbook?
How sysadmins use documentation

Andrea Longo
15 May 2017


Do you know a runbook from a flipbook? How sysadmins use documentation - Andrea Longo @feorlen 15 May 2017 Write The Docs Portland

Hi! I’m a nerd who likes servers. A lot.

Hello! My name is Andrea. I’m a developer who has done a bunch of different things, but one common thread has been looking at the tech world from the perspective of a system administrator. Whether it’s writing code or documentation, resolving bug reports or support tickets, a lot of my career has been making sure the folks who keep things running on the server side can get that done.

Old school ops is not dead. It’s not even resting.

What I’d like to share today is what happens in that Land of Ops, and how it intersects with our roles as writers. Much has changed since my days of 12 hour midnight shifts—the original DevOps was making devs sit in the Network Operations Center, training the folks who are now responsible for that nifty thing we built. But underneath all the Happy Little Clouds that enable our online lives, there are still a few actual physical computers. Someone has to install and maintain them. They go down. Notifications get sent, and someone decides it’s time to send a tech to push a button.

Managed services: So I signed up for this thing (AWS, OWA email, Azure, etc.)

Lots of developers have signed up for things. Managed infrastructure services are a very good thing, because running servers isn’t easy. Securing data really isn’t easy. The rise of “Everything as a Service” has certainly changed what many sysadmins do on a daily basis, shifting from running cable and hauling around the crash cart to integration and automation. There is code to be written, even if no user will ever see it. DevOps in many ways exists because Engineering and Operations need to understand the world each side lives in. Old-school ops has not gone away. It’s in banks, brick and mortar retail, transportation, and other organizations that comprise a great part of the economy. Some have shifted to new ways of managing technology: a project that fifteen years ago would be deployed in a company-owned datacenter might now use however many AWS instances demand requires. Others, like government labs, never will.

How do you really learn to do this stuff? By doing it. And getting it wrong. And doing it again.

System administration as a career path isn’t historically something you got into sitting in a classroom. Tribal knowledge is passed from elder to younger by example, observation, and the trial-by-fire known as “On-Call.” That weekend you spent with two senior colleagues, tcpdump, and the pizza delivery guy is a stronger professional bonding activity than any ten evenings at the pub.

Read All The Things!! This software has great documentation? Lucky You!! But your job is to assume the worst. So you still ask around.

Organizations using applications they didn’t build have to rely heavily on documentation, whether from an external vendor or written in-house. Vendor documentation, what the maker of the software thinks will be most helpful to the largest number of people, is hopefully good at answering “How.” It might not be so great at answering “Why?” for your situation. But—Surprise!—lots of common server software doesn’t have great documentation. Blogs and online forums may be what you have.

Then Write Down All The Things!! That’s how a lot of that information got where you could find it. Some other sysadmin figured it out.

Internal documentation is what we, in our organizations, know about how our process works. Ops organizations have a long history of documentation that is written by sysadmins, for sysadmins. In larger organizations, these may be compiled by a writing team into more structured form. But that tribal knowledge is often only hard-won from practitioners themselves. In an organization whose core business is not the making of tech products, those docs may never be seen outside the technology group.

Planning vs Responding: making a thing happen and keeping a thing happening require different kinds of information.

A hands-on sysadmin typically has two broad categories of tasks: designing and implementing a solution, and making sure that solution keeps working. The goal is to have more of the former to ensure less of the latter, but even the most smoothly running process still needs to be responsive to business requests.

#%!& Implementation how does it work??? What does it do? Will it help me? How do I make it do what I want? Planning needs time to think.

Planning and implementation integrates, and sometimes builds, tools and processes. The types of questions documentation needs to answer here are often similar to those of developers, such as “What is the syntax for this scripting language?” But it’s just as much “How does this package address my requirement?” and “Can it fit in my organization’s existing process?”

Incident response: It did what?? DON’T PANIC! No, really, don’t panic because we have a procedure for that. Probably. We wrote one last time, right?

Incident response, on the other hand, shouldn’t involve a lot of contemplation. A ticket comes in, someone decides how urgent it is, and who should handle it. For on call-situations, the person who first responds is typically tasked with only triage and mitigation. Root cause analysis comes later, maybe by somebody else. The critical questions to answer are “What is this?”, “How is it affecting the system?”, and “What is the best thing to do about it right now?” For some context, let me tell a story about an enterprise software deployment.

TalkyCo: Your Cat Friendly Telecom. (Photo of businesswoman.)

Here’s Carla. She’s the Director of Infrastructure at TalkyCo, a mobile telecom company. Her team keeps internal services working, from email to bandwidth to call center systems. They don’t directly handle end-user help desk issues, but they do manage the process of deploying, upgrading, and monitoring employee machines and devices. Customer Care said subscribers are asking about the photos of adoptable cats pre-installed on TalkyCo-branded smartphones. Some agents are informally sharing to respond to subscriber inquiries, but it would be nice to have a system with consistent, updated information.

Eric brought some things home. (Photo of conference giveaway items and man wearing glasses.)

She asked Eric, a database architect, to look for products that might help at the conference he’s attending next week. He finds Categorio, a company with a system to manage cat-related information and share it across large organizations.

Categorio: Up & Running book cover

The product has a database of cats that can be adopted at shelters around the country. Eric wonders how well that works in practice, so he reads a few blogs. Overall it seems to match the marketing material and people like it. There’s a tutorial on the website. There’s even an “Up and Running” book. Eric downloads the enterprise trial. His first task: figure out how to install the thing.

Does this software address my problem?

At this stage, Eric is doing a high-level evaluation. He has some basic information, but is looking for independent confirmation that this product meets his needs. The blogs say it’s working for other people, and the introductory documentation from the website walks him through getting the trial working. That’s enough to move forward, and Carla approves a budget for a pilot.

TalkyCo and Categorio plan a project. (Photo of woman drawing a diagram on a whiteboard)

Eric requests an on-site meeting to discuss a pilot deployment. At the meeting, TalkyCo folks and Categorio folks sketch out what a pilot would look like: one server node for about 25 Customer Care agents in the Thornton, Colorado call center. Back at his desk, Eric sets up housekeeping: new labels and tags in the issue tracking system, a new section in the internal knowledge base, and a wiki. He turns the meeting notes into a proper network diagram, and writes up what kind of server it will need. All of this information goes on the wiki, along with links to end user documentation for Customer Care.

Informal documentation: making sure details aren’t lost. Ticket systems are nifty. Wikis are awesome. Giant email threads: not so much.

Issue tracking tickets aren’t structured documents, but are still a useful source of information. After issues are resolved, some of them will be made into more detailed knowledge base articles. The team wiki is more informal, almost like a scrapbook. Eric’s wiki even has a page for random feature ideas that functions like a persistent brainstorming session.

Internal docs socialize local best practices. (Photo of woman reading a book in front of a laptop.)

Eric installs Categorio on the server. He asks Hsin-yi, the team’s networking intern, to add it to the monitoring system and make a simple dashboard. He points her to the knowledge base to learn more about what services should be monitored for different types of servers, and what tools are used for that. Hsin-yi also creates a wiki page describing what is shown on the dashboard and what events will trigger warning and error alerts. Eric starts a Categorio-specific runbook.

The runbook: helping frazzled humans remember stuff on the first try.

A runbook contains step-by-step instructions, clear enough for an engineer on call to handle an issue without needing to look up additional information. Hostnames, file names, escalation criteria—if it’s needed to perform a task, it’s included in the procedure. Every major area of the system can have a runbook, with tasks as diverse as how to failover a storage volume to how to handle a power outage.

This Could Be You. I hope not. (Photo of airport emergency checklist in red text: Alert 3 Check List. When you call an ALERT 3 you are essentially shutting down the airport.

Some things are still important enough to print out and tape to your desk. Here’s an example from a mission-critical, highly available service in another industry, one that you may have used to attend this event. If someone has to use that Red Phone, everybody at your airport is about to have a Really Bad Day. You might not have time to find, or be able to access, an online document. Time-sensitive details have to be available when and where they are needed. ProTip: tie a flashlight to the binder with the power outage procedures. And check the battery.

Let’s make a map! (Diagram of network server nodes.))

Now that the pilot is up and running, it’s time to plan a larger deployment. Eric develops a more robust plan, with test and production systems, replicas, and a server at each call center. He evaluates the impact of the new software on the existing network, and how to push client installs to the agent devices. He brings in Kathy, a tech writer, to create a basic operations manual. The knowledge base and wiki content is a good start, but it doesn’t have a lot of structure. Also, the Categorio-provided docs need some focused attention before being used by a specific audience.

Growing project needs food! Um, to adhere to internal process. Configuration management. Naming conventions. Deployment window schedules. No Problem! Internal process guides, we haz them.

Here we are deep in implementation. There are scripts to write, procedures to define, and all of it needs to fit into the larger organizational standards. Eric knows the drill by now, but less experienced staff benefit from guides on naming schemes, test tools, configuration management procedures, and other company standards. We saw this when Hsin-yi was integrating with the monitoring system. Some things that the team writes up for their own use include dependency lists for installed packages, platform details, and workarounds from the vendor. Answers to questions are collected in the knowledge base and procedures are added to the runbook. Work in progress is posted on the wiki. Everything important to operating and using the new system should be recorded somewhere other than email.

Oops! (Photo of broken mannequin on the ground.)

Eric pushes out packages to the test machines, and finds that the secondary node, on a different network, didn’t successfully install. It seems to be something about automated vs manual install. Categorio Support says because of how his network is configured, he should use a custom script. He could write that himself, or he could have professional services write it. Since the contract came with some professional services time, Eric has them do it.

Capturing history for parts that appear out of nowhere. We didn’t make it. But we still get to track and manage it.

Now Eric has vendor-supplied customizations that need to be tracked. The script goes into the repository, and in the commit message Eric writes up where it came from, why it was needed, and how to use it. For now, he adds a note with a link to that commit on the wiki, but later it will get incorporated into an install guide.

I thought you tested that? (Photo of airplane seat-back display showing Linux system messages.)

The team continues to test the system. Things that don’t behave as expected are candidates for procedures in the runbook, with sample alerts, snippets from logfiles or images of dashboards—anything that would help quickly identify and resolve emergencies. More detailed root cause analysis goes on the wiki and, if appropriate, the knowledge base. Kathy has completed the first versions of the manuals, based on what the team learned from the test rollout. Eric schedules a conference call with everyone in the on-call rotation to go over what Categorio is, the basics of what to expect and, most importantly, where to find the docs. By the way, the in-flight entertainment on United 757s run Linux. Now you know.

Yay! We did it! (Photo of party balloons.)

Testing has gone well, and now the team is ready to roll it out to production. First the other agents in the Thornton call center, and then the other two facilities. Eric invites the deployment team to lunch. Everybody eats delivery spring rolls, and enjoys the celebration.

♪We've only just begun...♫ Now we get to see how this thing really works. I hope somebody is writing this down.

In the process of getting to a successful production deployment, a lot of artifacts were created. The team tried to anticipate as many questions and situations as they could, but there was always the understanding that better experience with the system will bring new things to learn and document.

Two months later...

Agents have up-to-date cat info for subscribers, and they really like that the initial screen shows the nearest animal shelter along with the account details. That feature came out of a suggestion from the wiki brainstorming page. Managers can monitor Categorio activity and handle customer escalations right from their phones. Customers are happy, agents are happy, and cats who have found new families are happy.

Life is good. (Photo of kitten.)

It’s a holiday weekend, and the call centers are in a good mood because subscribers are off having fun instead of calling about their bills. But Saturday afternoon, an agent in Jacksonville views a Categorio cat record only to find a picture of something that is not a cat.

WAT! (Photo of pangolin.)

How did a pangolin get in here? Within minutes, all the agents at that facility are reporting animals that are not cats. Plus, the network is being swamped with incoming requests to the Jacksonville Categorio node. African rhinos. Tree frogs. Sea turtles. All these exotic and endangered animals that are not even remotely domestic cats. There’s even... well, see for yourself.

Photo of soft toy dinosaurs.

While the Customer Care manager is still opening the trouble ticket, the network attack triggers a critical alert. Meena, the engineer on call, hands her eldest the bbq tongs and goes inside to login. She sees there’s also ticket from Customer Care, and starts investigating. These not-cats are new records created by client software from outside the TalkyCo network. The metadata seems correct, but what are these images? And the Jacksonville network is slow as molasses. Meena decides the best thing to do is switch Jacksonville from using its local node to the primary at Corporate. The runbook has a procedure to do exactly that, and she confirms the cats are back to normal. Good. Now for the network.

What is this stuff? (Photo of personal computers and cables on the floor.

The dashboard is showing the server getting hit from IP addresses all over the world, but only on the Categorio port. Clients are supposed to authenticate with the server, but somehow these rogue clients are able to do that. Meena blocks that port at the firewall to temporarily shut them down. Now that she’s stabilized the immediate emergency, she adds her findings to the monitoring ticket and sends Eric a message asking him if he can take a look.

Now fix stuff proper-like. (Photo of man sweeping a parking lot.)

Eric looks in the Categorio logs and sees the bogus clients all look like they are authenticated, but with meaningless users and all the exact software version. And it doesn’t match any of the versions agents are actually using. It looks like someone has a botnet with a cracked client to create new cat entries. He updates the ticket with the recommendation to allow only known TalkyCo address ranges. Eric uses Meena’s incident summary to compose an email for Categorio Support. The workaround is ok, but it really needs a bug fix. And there’s still the problem of weird records in the database.

Photo of people around a race car covered in fake fur to look like a pony.

Support replies back apologizing profusely for the hack. They already were working on a fix for this. Two days later, they have a server hotfix, updated client versions, and a script to clean up the database. Awesome! Eric documents this script just like the previous one, and notes the new installed version numbers. Plus, removing malicious records from the database is absolutely a runbook task. Emergency resolved, data collected, reports written, life moves on.

Wasn’t it great we had that written down somewhere? Tested procedures mean faster response.

Much of the information Eric referenced as part of this investigation came from things already written down, like the version numbers for the client software. That came from a manifest used by the configuration management system to push packages out. There was a copy of it (and where to find the current version) on the wiki. Not every organization will have all of these parts for every service, and some may have more. Large companies who have been running the same software for decades have tons of internal operations documents, occasionally long after the original vendor has given up.

Design and implement: how do I make this work for us, for real. Incident response: tell me what to do RIGHT NOW!

Despite much work to build easily managed and self-repairing systems, system administration is still very interrupt-driven. The best documentation provides what someone needs to know, at the right level and in the right context.

Reference docs provide detail for thinking and learning. Humans are all different. Some like theory. Some like examples. Some like videos. Some would rather eat glass than watch a video.

Integrating a new thing is a lot of up-front learning, and different people learn in different ways. Some prefer tutorials and cookbooks, others in-depth theory. One common frustration I hear is there might be lots of beginner materials, and even highly detailed expert-level information, but not much in the middle. That’s where a lot of “How do I make this work for real?” happens.

Runbooks and checklists don’t leave room for much thinking. Because you won’t have it either. Do. It. Now. Please. Faster.

Incident response relies heavily on team-produced documentation. Sometimes it only exists because the person who knew everything wasn’t available that day, and it didn’t go well. Details depend on local practices, and may contain confidential information.

Ops is changing, and not changing. Nothing is perfect. Everything will fail sometime. Good docs save lives. Like those of Infrastructure engineers who like having a life too.

System administration is much less ad-hoc than it used to be. Ops teams that burn out their people can take their companies down with them. Those who write budgets don’t like to hear “Well, that took longer than we thought.” But physical devices and services you don’t control still exist. Things happen. Planning to manage is good. Planning for when you can’t manage is necessary. The documentation used and created by sysadmins reflect this reality.

Thank you! And here’s a real flipbook

Thank you for joining me on this tour, and I hope this provides useful insight into what Ops organizations do, and how sysadmins see the world. And if you are wondering what a flipbook is, it’s a low tech animation drawn on the pages of a physical book. You quickly flip through the pages to see the movie.
NASA Deep Space 2 Mars Microprobe
Thank you!