Skip to main content
Effective troubleshooting is not about knowing every command. It is about having a methodology that stops you from thrashing and helps you move from observation to root cause as efficiently as possible. After two-plus decades of on-call work — from bare-metal RHEL servers to containerised microservices — the discipline of observe first, hypothesize second, test third saves more time than any individual tool. This page captures the workflows I actually follow during incidents, not an exhaustive reference.

General Troubleshooting Methodology

1

Observe — establish ground truth before touching anything

Resist the urge to restart services immediately. Gather data first.
# What is happening right now?
uptime                        # load average trend (1/5/15 min)
w                             # who is logged in + what they're running
dmesg -T | tail -50           # recent kernel messages
journalctl -xe --no-pager | tail -100   # systemd journal

# System-wide snapshot
vmstat 1 5                    # CPU, memory, I/O every second for 5s
iostat -xz 1 5                # per-device I/O stats
free -m                       # memory overview
df -hT                        # disk usage
ip -br addr show              # interface status
2

Hypothesize — form a specific, testable statement

A hypothesis is not “something is wrong with the database.” It is: “PostgreSQL is not accepting new connections because the connection pool is exhausted.” Write it down. If you cannot state it precisely, you need more observation.
3

Test — one change at a time

Change exactly one variable per test. Document what you changed, when, and what the result was. In a production incident this log is your audit trail.
# Common test tools
strace -p <pid> -e trace=network    # system calls for a process
ltrace -p <pid>                      # library calls
lsof -p <pid>                        # open files + sockets for a process
tcpdump -i eth0 -n port 5432        # capture traffic on a port
4

Fix — targeted and reversible when possible

Apply the minimal fix. If you must restart a service, know why you’re doing it and what you expect to change. Document the fix and the outcome.
5

Validate and monitor

# Confirm the fix held
journalctl -u app.service -n 50 --no-pager
tail -f /var/log/app/app.log
watch -n 2 'ss -tlnp | grep 8080'    # watch port status every 2s

High CPU Investigation

# Top CPU processes right now
ps aux --sort=-%cpu | head -15

# Interactive live view (sort by CPU with 'P', memory with 'M')
top

# htop with tree view (shows parent-child relationships)
htop -d 3      # refresh every 0.3s

# Per-core CPU breakdown (press '1' in top to toggle)
mpstat -P ALL 2 3    # 3 samples, 2-second interval, all CPUs
A load average of 1.0 on a single-core machine means the CPU is fully loaded. On a 16-core machine, 1.0 means virtually idle. Always divide load average by CPU core count to get a meaningful utilisation ratio. Check nproc or lscpu for core count.

Memory Investigation

free -m             # RAM + swap (avoid relying on "used" — includes cache)
# Key field: "available" = what apps can actually use without swapping

cat /proc/meminfo   # full details
# MemAvailable, Cached, Buffers, SwapCached, AnonPages, Mapped

# Is the system swapping?
vmstat 1 5
# si (swap in) / so (swap out) > 0 = swapping — investigate further
swapon --show       # swap devices and usage

# Top memory consumers
ps aux --sort=-%mem | head -15

Disk Space Issues

# Where is disk being used?
df -hT                                  # overview of all filesystems
df -i                                   # inode usage (100% inodes = no new files)

# Drill down by directory size
du -sh /var/*          | sort -rh | head -20
du -sh /home/*         | sort -rh | head -10
du -sh /opt/*          | sort -rh | head -10

# Find the single largest directories under /var to 3 levels deep
du -h --max-depth=3 /var 2>/dev/null | sort -rh | head -30

# Find individual files larger than 500 MB
find / -xdev -size +500M -type f -exec ls -lh {} \; 2>/dev/null

# Find the 20 largest files anywhere
find / -xdev -type f -printf '%s %p\n' 2>/dev/null \
    | sort -rn | head -20 | awk '{printf "%.1f MB  %s\n", $1/1048576, $2}'

Service Debugging with systemd

systemctl status nginx               # status + recent journal lines
systemctl start  nginx               # start
systemctl stop   nginx               # stop
systemctl restart nginx              # stop + start
systemctl reload nginx               # send SIGHUP (reload config, no downtime)
systemctl enable  nginx              # enable at boot
systemctl disable nginx              # disable at boot
systemctl is-active  nginx           # returns 0 if active
systemctl is-enabled nginx           # returns 0 if enabled

# List failed units
systemctl --failed

# Check if a service keeps restarting
systemctl status nginx | grep -E "Active:|restart"
When a service fails to start, always check journalctl -u service-name -n 50 --no-pager before anything else. The error is almost always in the last few lines. systemctl status truncates long messages — journalctl gives the full output.

Log Analysis Patterns

# Count error rate per minute from structured logs
grep "ERROR" /var/log/app/app.log \
    | awk '{print $1, $2}' \
    | cut -d: -f1-2 \
    | sort | uniq -c | sort -rn

# Extract unique IPs from nginx access log
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20

# Find slow requests (>1000ms) in nginx
awk '$NF > 1.0 {print $0}' /var/log/nginx/access.log | wc -l

# Watch log file and alert on pattern
tail -F /var/log/app/app.log | grep --line-buffered "CRITICAL" | while read -r line; do
    echo "ALERT: ${line}" | mail -s "Critical error" ops@example.com
done

OOM Killer Investigation

# Was the OOM killer invoked recently?
dmesg -T | grep -i "oom\|killed process\|out of memory"

# In the journal
journalctl -k --since "24 hours ago" | grep -i "oom\|killed"

# The OOM killer log line contains:
# "Out of memory: Kill process <pid> (<name>) score <N> or sacrifice child"
# "Killed process <pid> (<name>) total-vm:<X>kB, anon-rss:<Y>kB"
oom_score_adj = -1000 on a service means the OOM killer will never touch it. Use this only for truly critical processes (database, monitoring agent). Protecting the wrong process means something else gets killed instead — possibly the kernel itself triggering a panic.

Network Connectivity Debugging

1

Verify local interface and address

ip -br addr show      # quick overview of all interfaces and their IPs
ip link show          # are all expected interfaces UP?
2

Test default gateway reachability

GATEWAY=$(ip route show default | awk '/default/{print $3; exit}')
echo "Default gateway: ${GATEWAY}"
ping -c 4 "${GATEWAY}"
3

Test external IP connectivity (bypassing DNS)

ping -c 4 8.8.8.8       # Google DNS by IP — tests raw routing
ping -c 4 1.1.1.1        # Cloudflare DNS

# If gateway pings but 8.8.8.8 doesn't → routing or upstream issue
4

Test DNS resolution

dig +short google.com
# If this fails but 8.8.8.8 pings → DNS issue
# Try alternate resolver:
dig @8.8.8.8 +short google.com
5

Test application port specifically

# Is the remote port open?
nc -zv remote-host 443
telnet remote-host 443

# Test HTTP response (not just TCP)
curl -sv --max-time 10 https://remote-host/health

# Check local firewall isn't blocking outbound
iptables -L OUTPUT -n -v
6

Capture traffic if still unclear

# Capture all traffic on port 5432 (PostgreSQL)
tcpdump -i eth0 -n -w /tmp/capture.pcap port 5432

# Quick human-readable capture (no write to file)
tcpdump -i eth0 -n -A port 80 -c 50

# Filter by host
tcpdump -i any host 10.0.0.5 -n

# Analyse the pcap offline
tcpdump -r /tmp/capture.pcap -n | head -50

Quick-Reference Incident Checklist

# 1. What is the service status?
systemctl status app.service

# 2. What does the log say?
journalctl -u app.service -n 100 --no-pager

# 3. Is the port open?
ss -tlnp | grep '<port>'

# 4. Any OOM kills?
dmesg -T | grep -i "oom\|killed" | tail -10

# 5. Disk full?
df -h

# 6. Out of inodes?
df -i

# 7. Memory pressure?
free -m

# 8. System load?
uptime; vmstat 1 3
# 1. Load average vs core count
uptime; nproc

# 2. Is it CPU or I/O wait?
vmstat 1 5    # high wa% = I/O bound; high us% = CPU bound

# 3. Identify the culprit process
ps aux --sort=-%cpu | head -10
ps aux --sort=-%mem | head -10

# 4. Check I/O
iostat -xz 1 3
iotop -b -n 3 -o    # top I/O consumers (requires iotop)

# 5. Network latency?
mtr --report --report-cycles 10 8.8.8.8
ss -tnp | grep ESTABLISHED | wc -l    # connection count

# 6. Database connections?
# (PostgreSQL)
psql -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"

Linux Essentials

Core commands used throughout these troubleshooting workflows.

Bash Scripting

Automate repetitive diagnostic and remediation steps.

Networking

Deep-dive on network diagnostics, firewall, and DNS.

DevOps Overview

Monitoring with Prometheus/Grafana and the ELK stack.
Last modified on June 9, 2026