Podcast: Play in new window | Download
Subscribe: Apple Podcasts | Spotify | TuneIn | RSS | More
Can writing documentation beef up your troubleshooting skills?
This week in episode 315 David Klee returns to explore the connection between effective troubleshooting and documentation. We’ll discuss appropriate levels of detail for documentation and explore it as a skill building exercise. Listen closely to hear why good documentation can make all the difference in a regulatory compliance audit as well as in emergency situations. Also, we’ll talk through some interview questions you can ask to determine the value of good documentation within an organization.
Original Recording Date: 01-20-2025
Topics – An Exploration of Troubleshooting, Pre-requisites for Effective Troubleshooting, What Should Be Documented, Forms of Documentation and Emergency Preparedness, Interview Questions and Employer Perceptions
2:32 – An Exploration of Troubleshooting
- David Klee is a returning guest and the owner and chief architect at Heraflux Technologies. If you missed the previous discussions with David, you can find them below:
- Episode 119 – Tinkering into Specialty with David Klee (1/2)
- Episode 120 – A Time to Build with David Klee (2/2)
- Episode 309 – The Consulting Life: Managing Travel and Becoming a Better Communicator with David Klee (1/2)
- Episode 310 – Finding a Better Way: Contracting, Independence, and a Consultant’s Reputation with David Klee (2/2)
- David approached us about an idea for another topic to explore. After many years in the industry (11 of them as a business owner), David began to think about patterns he has seen and what has made him and many others successful.
- “What has actually made this work? And it’s the art and the science and the luck of troubleshooting…. What makes some of the best technologists arguably some of the best troubleshooters in the world, and then how do you apply that to life? …There’s a lot more than just knowing a technical feature or two or being able to Google faster than the person next to you. I have a lot of fun with this topic.” – David Klee, framing our discussion
- Philosophically, David believes troubleshooting is as much an art as it is a science. There is a foundation one needs to be a good troubleshooter, and David tells us this stems from our childhood curiosity about why things do what they do.
- David tells the story of learning to use a screwdriver at age 5, taking the family’s VCR apart, and successfully putting it back together again (which may or may not have landed him in trouble).
- Over time some people have a constant need to know why something is what is / why it works the way it does. David sees this present in some people but not all people.
- “When you look at those that are truly great at an industry…they want to know why, and they don’t stop until they know why.” – David Klee
- David mentions the Dunning-Kruger Effect, which speaks to breaking up the things we know and don’t know into 4 quadrants:
- Unknown unknowns are the things that get people into trouble because they think they know these but do not
- Known unknowns – David considers this area enlightenment in IT and a way to know where the boundaries are
- “Unknown knowns are the things that I consider you a master at a technology or a topic of anything because what you know becomes so integrated into your frame of reference and your being that you don’t know that you know it. You just do it. And, when you hit that point of a mastery of something…you may not be able to explain how you do it or you may not be able to tell somebody the steps to do it. But it’s just muscle memory. It’s just go. You do, and it works…. The truly good educators are the ones that can actually take what they know and dial it to the level of the people that they’re talking to. Some experts cannot do that, but they are so good at what they do. Others can. It’s fascinating…. It’s the unknown unknowns that gets people into trouble. It’s the unknown knowns that really separates people.” – David Klee
- We did not mention known knowns, but it would be the final quadrant.
- John says it’s the idea that you can master a skill or process but not have mastery of teaching or explaining that skill or process. Doing ang teaching could overlap, but they do not always overlap.
- David comes from a family of teachers, actually. His parents were traveling road musicians who fell into education, but they have always continued some sort of musical pursuit on their own.
- “It’s neat…to be able to explain to somebody how something works and why. I love it.” – David Klee
- When Nick thinks about troubleshooting, he thinks about both high pressure and low-pressure situations when we’re trying to figure out why something is not doing what it’s supposed to do.
- David says we’re trying to determine why there is an unexpected outcome and what we need to do to get to the expected outcome.
- “It’s a formal methodology or informal methodology for understanding why something does not have an expected outcome and working through the process that is an iterative process – either elimination or identification. And you end up with essentially identification, review, remediate, rinse and repeat until you get the desired outcome. That’s about as formal of a definition as I can give you.” – David Klee
- John thinks this may disguise the art in the troubleshooting process of knowing what issues may be more likely than others.
- People might discover something is not working and change 10 things. If something then starts working again, how do we know which change (or combination of changes) actually resolved the problem? We are far less likely to undo the changes once something begins working again.
- John mentions being good at troubleshooting in areas in which he has lost the fear of something going wrong.
- While John feels comfortable troubleshooting computer systems and software, he’s not good at troubleshooting car problems due to limited knowledge and a feeling of high stakes. Someone with a better knowledge of cars may perceive the stakes to be far lower when making a recommendation for fixing problems.
- David says it depends on what you are troubleshooting. There is a risk qualification element that needs to be considered with the process used in troubleshooting.
- David shares the example of troubleshooting a payment processing system with a group of folks who didn’t know what they didn’t know. The process they had developed to troubleshooting ran the risk of preventing payment processing for the entire company. David describes determining the need to speak to the group of people who built the system in order to troubleshoot the system safely.
11:43 – Pre-requisites for Effective Troubleshooting
- Nick mentions we highlighted a pre-requisite for troubleshooting being knowledge of the systems we’re troubleshooting. What is the correlation between how good a troubleshooter one can be and how well one knows the systems involved in troubleshooting?
- If we know our systems well, we know what is / is not possible within a given set of constraints. One example is knowing the ramifications of changing different database settings.
- “You know what’s going to happen because you know the platform and you know your environment, and you know how they come together. If you know this stuff you can resolve these issues a whole lot quicker.” – David Klee
- Knowledge of the platform and environment would mean we know the systems which interact with the one we are troubleshooting, the impact of the outage, the right person to call for help, and the questions you need to ask them.
- It can be much harder when you inherit a system someone else built and no documentation on why it was set up the way it was or how other systems communicate with it. Likely you also don’t know what types of changes have been made to it over time (whether they were band aid type fixes or some other kind).
- John mentions we’re highlighting domain knowledge of a system and its specific failure modes combined with what has happened in the past to diagnose and fix those things. A resilient system should have these things documented.
- David says the flip side of this is being someone coming in from the outside who has never seen this machine before. Think about the scenario in which you are asked to troubleshoot a system which people with all the domain knowledge can’t fix. As a consultant he runs into this pretty regularly. It can be challenging, but David says it keeps him sharp.
- Someone troubleshooting a system like this has to keep track of what’s already been done, what should have been done, and what questions need to be asked to extract domain knowledge from others when information hasn’t been documented.
- One must also know the platform well enough to successfully understand a system’s current state (which might be different than what people tell you).
- “Perception of a system’s state might be entirely different than the reality of the system’s state. That’s a hard, hard art to master right there.” – David Klee
- John says someone who doesn’t know a system may have a better chance of doing effective diagnosis. The person who knows a system well is going to make assumptions someone who doesn’t know a system would likely not make (i.e. the database is running great, etc.).
- David stresses the importance of quantifying performance when we’re troubleshooting. Preconceived notions about an environment might lead to subjective explanations.
- When he walks into an environment to troubleshoot a problem, David wants to look at the raw data. This data can help provide the true nature of a system’s state and perhaps prevent finger pointing between teams.
- “Show me the data. Show me why you think this. And most of the time, people cannot produce that data.” – David Klee
- Even trend information on how past issues of a specific kind were resolved counts as data and may provide a nice starting point for troubleshooting.
- David tells the story of a database administrator and a storage administrator getting into a shouting match over a specific problem. Each of them wanted to be right, but neither had data to back up their claims. In the end, both were right – the problem was somewhere in between the database and the storage in the network and operation system layers. Listen as David describes it in detail.
- “But it’s ‘I’m right. You’re wrong.’ There was no ‘I understand that my telemetry is showing me this, but your telemetry is showing you something different.’ Put the data together, and draw a line between them. It’s the why is this showing 2 different things.” – David Klee on troubleshooting telemetry data from different systems
18:29 – What Should Be Documented
- What type of documentation would be helpful to have in situations like the one David described (the network and database administrators getting into an argument)?
- David says it would have been ideal to have a diagram of the entire environment that highlights the data communication flow between systems.
- “If we were able to literally have every single hop there, then you essentially start at both ends, and you start collecting the data until you meet in the middle. If you know the pieces involved, you can collect the system state and the telemetry behind it. That’s the easy part. You just have to know how to draw that line.” – David Klee
- Should each hop in the flow of data be instrumented from the beginning or only when there is a problem?
- David says you need data to baseline for good performance, and when there is a problem, you have to compare the telemetry for each part of the path to that of the baseline.
- David feels like he spends 25% of his week benchmarking and baselining things for people and has developed methodologies for different types of systems across technology stacks.
- “How can you tell me it’s running slow if we don’t know how it was running when everything is fine? You have no objectivity to gauge it’s slow.” – David Klee
- John highlights the challenges of diagramming these types of systems or applications. We need to represent physical connections, virtual connections, and even API calls for example. There are many layers involved, each of which can change.
- David thinks of a system he might be troubleshooting as an ecosystem rather than something static. He gives the example of a desktop computer and how a single software update can change everything.
- “To me it’s document what’s in your domain. Document it the best you can. Imagine you get hit by the beer truck, and somebody else has to come along and follow you…. I want you to know everything there is to know about why this machine was setup the way it was, what it took to get this thing running stable including custom tweaks, the raw architecture behind it, the configuration, everything I can possibly think of…mostly because I’m probably going to be the one to upgrade this thing in 4 or 5 or 6 years. I want to know – what did I do to stabilize this thing? Why is it setup the way it’s setup? And if somebody else needs to come along and support this…” – David Klee, on the purpose of good documentation
- David gets calls from customers during problem situations asking why certain configurations were made, and when this happens, he will send them the same, extremely thorough set of documents he produced and shared with them when the system was originally built.
- David has high expectations for what good looks like when it comes to documentation. The output from a SQL Server and infrastructure health check provided to customers will be around 250 pages on average. The spirit behind this is so customers have the what and the why.
- David highlights some big successes from producing thorough documentation for customers.
- David’s company saved a customer over $30 million in SQL Server licensing because of effective tuning, and due to this work, the firm later won a massive SQL Server migration project as a result.
- “They liked the documentation. They liked the why and not just the what.” – David Klee, on thorough systems documentation as a differentiator of his business
- David’s company saved a customer over $30 million in SQL Server licensing because of effective tuning, and due to this work, the firm later won a massive SQL Server migration project as a result.
- John says people often don’t want to document things for the next person, but many times the next person who comes along is you 6 months later.
- These things need to be documented well or perhaps put into a knowledge management system.
- David documents things so he isn’t forced to recall them from memory months or years in the future. He could be documenting a quick change or something that took 40 hours to find in the process of solving a problem. When making a change to something weird or nuanced, David will document it and make sure he and his customers have multiple copies of it.
- “There are things all over the place with the kind of tuning that I do. If you didn’t know it was there, you’d have no clue. You’d just have weird symptoms here or there, but these things are so nuanced.” – David Klee
- How can we balance thorough documentation with the need to make progress and not impede it?
- An organization has to be on board and allow technologists the time to document properly. If this doesn’t happen, the entire IT organization suffers. The technologists who made changes to solve a problem will forget what they did very quickly when forced to move on to the next fire immediately.
- “I think that’s why I run into a lot of the states that I do out there…. Something broke. Nobody’s had the time to think about it or look at it or document it or review it…. Here, you figure it out.” – David Klee
25:25 – Forms of Documentation and Emergency Preparedness
- Is the documentation we’re talking about something kept in a change control system, a wiki, and asset management system, or just some large document somewhere?
- If enough history is provided, David tells us the format doesn’t matter.
- Some places use a formalized change control process with tickets and platforms like ServiceNow. It’s a process that works for those organizations.
- Some organizations treat infrastructure as code and leverage JIRA tickets for tracking changes.
- “As long as I have a list of what has changed and when and why and a reference document that shows why the system is configured the way it I, any nonstandard change, any reason why the system is in use, anything. What’s it talking to? What’s placed on it? What firewall exclusions, routes…? If I have that, I know 90% of what I need. It’s that last 10% that’s always…specific to a given machine. But if I know the why, the what, and the where…you can figure out the how.” – David Klee
- Make sure your documentation is retrievable even if a system or platform or datacenter is offline! David has one customer who prints out the documentation once per month and puts it into a fireproof safe to take one example.
- There are many ways to ensure the documentation is retrievable in a critical situation. We need to make sure it’s available offsite somewhere (digital or printed copy).
- Who should have access to the documentation?
- David says more than one person for sure. Things can happen to people like getting hit by a beer truck or being in a natural disaster.
- “If you have one copy…single point of failure; I don’t believe in that. Two, three, four copies – park it on a USB drive at a bank deposit box. Park it in a public cloud that key members of IT can get into. The odds that that goes down…slim.” – David Klee
- Nick mentions the access to documentation (i.e. the run books) would need to be part of onboarding and offboarding new team members.
- David tells the story of helping a trucking company build a DR plan several years ago. The company was in tornado alley and had around 800 virtual machines. Due to regulations in the trucking industry, there is a requirement for constant telemetry feeding back to corporate systems from the trucks themselves.
- “I don’t believe in just testing a handful of pieces of DR every once in a while. We fail over and run from DR for 1 week out of every month…. They fail over the first Friday night of the month. They fail back on the second Friday of the month. Half of IT gets off the 3rd Friday of the month, and the other half gets off the last Friday of the month. I think it’s great. They love it…. Anybody in the room can fail over the entire company with the run books that are provided and maintained by every member of IT.” – David Klee, speaking to a DR plan for a company he helped architect
- In this scenario, the company’s CIO chooses 4 random members of IT staff who cannot be part of the DR exercise (i.e. simulating that they died). If any member of the team has to call one of those 4 team members during the fail over or fail back, the DR exercise fails.
- The company we’re talking about not only says they have DR. They demonstrate it to insurance companies and auditors, and their insurance is much less as a result. The full failover to / from DR takes about 43 minutes. The process took about a year to get right because of so many moving parts, but it works very well.
- How detailed should a company’s disaster recovery or business continuity plan be for the purpose of audits?
- David says there are varying degrees of detail. Some auditors may be checking a box, only looking for backups and offsite copies of data.
- Good auditors would ask to see the detailed process of how things failed over and how long it took.
- Some companies do disaster recovery and only fail over one system (maybe even without all the dependencies).
- “The good auditors are the ones that ask how long, when, not just what. Those are the auditors that most people in IT hate.” – David Klee
- David shares the story of a database administrator friend of his who was, in an audit, asked about the disaster recovery process and if he could demonstrate it.
- The auditor then noticed a 400-page book about SQL Server backup and recovery that David’s friend had written.
- David’s friend mentioned the book was the genericized process, but he then produced a specific document of the process at that specific company. It answered the auditor’s questions in 5 minutes.
33:12 – Interview Questions and Employer Perceptions
- John is wondering if we may have uncovered some good screening questions for job applicants to use in interviews related to this topic.
- We could ask about the company’s knowledge management strategy, the way they document how systems work, or the level of importance placed on documenting disaster recovery / failover processes.
- David says the employer should have an immediate answer for this. If they don’t, it’s a red flag and may mean you are the one who has to do whatever it takes to get stuff up and running again.
- Different parts of the business might document things in different ways (all of which could be effective), and processes might have different levels of importance when it comes to business resilience. John gives the example of documenting an employee onboarding process and where that ranks in overall priority compared to other things.
- David shares the story of a company whose disaster recovery plan includes helping the families of the IT professionals who need to engage because of an emergency situation. This includes transportation, housing, and much more.
- What types of questions might David ask a prospective job candidate on this topic?
- David would ask how someone documents why something works.
- David also asks for a 5-minute technical presentation covering a facet of what they are working on and why they enjoy it.
- “It’s an interesting twist because it tells me…can you talk to somebody who knows something about what you’re doing? Can you convey it in a way that people can understand? And it helps me get into their brain. Why do you like doing this?” – David Klee
- These types of questions help David understand how much someone enjoys working in the technology field or they are in it solely for the money.
- Do most employers see value in a prospective employee having the experience in writing detailed documentation or disaster recovery plans, or is it a mixed bag?
- “The company should love it. I can’t say they always do.” – David Klee
- Some employers may think you are too deep in the weeds or that you spend too much time on paperwork and process to effectively get things done.
- Companies could be solely focused on getting things done, which can be a problem.
- Companies too focused on process may be very inefficient.
- David says it’s an interesting balancing act and sees this play out differently inside different organizations. The approach may depend on the type of business and what they are trying to do.
- “If a person’s process behind this stuff doesn’t line up with the company, you may not be a good fit.” – David Klee
- John mentioned the good and the bad of systems being designed to prevent change.
- “Database technologies are very evolutionary. I see people that can’t embrace positive change to be as big of a detriment as people that embrace negative change too haphazardly.” – David Klee
- David highlights an example. This company needed a more highly available database environment but was too reluctant to migrate to one. The change was too much for them to embrace even though it would provide a great benefit.
- This isn’t about a poor value statement for the change. No one wants to put their job on the line if something doesn’t work. David mentions a mandate from the top of a company for availability, but no one at lower levels is willing to make changes to achieve it. This is a case of a mandate not being enforced.
- “It’s we’re willing to sacrifice what we know to move into unknown territory carefully, cautiously, one piece at a time…and they can’t start the process.” – David Klee
- John says this may be due to cultural or political undercurrents not visible to someone on the outside of the system.
- David references a previous conversation we had on the show about consulting and the level of exposure to politics. David reiterates one of the reasons he loves consulting – because he does not do well with politics.
- Are consultants brought in because of company politics? The fun part of consulting according to David is when a company brings you in to tell them what they need to do. When a company brings you in and tells you what to do as a consultant, you have the ability to say no if asked to take an improper or incorrect approach (another reason David loves consulting).
- In environments where detailed documentation is seen as valuable, can this get someone a promotion or perhaps even save their job?
- The answer is 100% yes.
- David gives the example of a company which had a security incident last year. The database administrator was seen as someone who always said no to things, wanting to look at code before it was released to production or have changes happen during normal working hours, etc.
- After the company was hit by a ransomware incident, the database administrator (or DBA), recovered the machines in 7 hours. It took 2 weeks for all other systems that used the database to get back up and running again.
- The database servers had proper change control, use of service accounts, firewalling, etc. and were the most resilient because of that.
- “Data was up all because…questioned everything, didn’t trust a bit. I trust you, and I trust your intentions. But prove it.” – David Klee, on the mindset needed for resilient systems
Mentioned in the Outro
- When creating system or change documentation, remember that one person you could be writing the documentation for is you in the future. You can also take the attitude of providing the right level of depth in documentation so that others can fix the problem without needing to call you.
- Having appropriate levels of documentation in a place where everyone can find it can make it easier for team members to rotate in and out of certain areas and support taking uninterrupted vacations.
- If your company or team isn’t documenting systems or changes at a deep level, maybe you can be the one to start the trend or help operationalize it for your team. Try speaking with your manager or team lead about the value of better documentation and ideas for getting there (maybe differently than it has been done in the past). Even making small improvements is progress, and it could be the kind of progress that helps you progress to team lead someday.
- For additional interview question ideas related to documentation and knowledge management, check out Episode 293 – Enterprise Knowledge Management: A Consultative Approach to Solving the Right Problems with Abby Clobridge (2/2).
- Troubleshooting is about drawing a line between two points and checking telemetry at every point in between, but when we are troubleshooting with others, it is an opportunity to show empathy, to collaborate effectively to solve a problem, and to learn from colleagues.
- Maybe you can learn from the way colleagues on other teams document their changes and see how it compares to the way your team is doing it.
- Documentation lends itself well to ensure we are prepared for a business emergency
Contact the Hosts
- The hosts of Nerd Journey are John White and Nick Korte.
- E-mail: nerdjourneypodcast@gmail.com
- DM us on Twitter/X @NerdJourney
- Connect with John on LinkedIn or DM him on Twitter/X @vJourneyman
- Connect with Nick on LinkedIn or DM him on Twitter/X @NetworkNerd_
- If you’ve been impacted by a layoff or need advice, check out our Layoff Resources Page.