Systematic Linux Troubleshooting: Workflows and Tools
Proven Linux troubleshooting methodology: CPU, memory, disk, service failures, log analysis, OOM killer, and network debugging for real incidents.
Effective troubleshooting is not about knowing every command. It is about having a methodology that stops you from thrashing and helps you move from observation to root cause as efficiently as possible. After two-plus decades of on-call work — from bare-metal RHEL servers to containerised microservices — the discipline of observe first, hypothesize second, test third saves more time than any individual tool. This page captures the workflows I actually follow during incidents, not an exhaustive reference.
Observe — establish ground truth before touching anything
Resist the urge to restart services immediately. Gather data first.
# What is happening right now?uptime # load average trend (1/5/15 min)w # who is logged in + what they're runningdmesg -T | tail -50 # recent kernel messagesjournalctl -xe --no-pager | tail -100 # systemd journal# System-wide snapshotvmstat 1 5 # CPU, memory, I/O every second for 5siostat -xz 1 5 # per-device I/O statsfree -m # memory overviewdf -hT # disk usageip -br addr show # interface status
2
Hypothesize — form a specific, testable statement
A hypothesis is not “something is wrong with the database.” It is: “PostgreSQL is not accepting new connections because the connection pool is exhausted.” Write it down. If you cannot state it precisely, you need more observation.
3
Test — one change at a time
Change exactly one variable per test. Document what you changed, when, and what the result was. In a production incident this log is your audit trail.
# Common test toolsstrace -p <pid> -e trace=network # system calls for a processltrace -p <pid> # library callslsof -p <pid> # open files + sockets for a processtcpdump -i eth0 -n port 5432 # capture traffic on a port
4
Fix — targeted and reversible when possible
Apply the minimal fix. If you must restart a service, know why you’re doing it and what you expect to change. Document the fix and the outcome.
5
Validate and monitor
# Confirm the fix heldjournalctl -u app.service -n 50 --no-pagertail -f /var/log/app/app.logwatch -n 2 'ss -tlnp | grep 8080' # watch port status every 2s
# Top CPU processes right nowps aux --sort=-%cpu | head -15# Interactive live view (sort by CPU with 'P', memory with 'M')top# htop with tree view (shows parent-child relationships)htop -d 3 # refresh every 0.3s# Per-core CPU breakdown (press '1' in top to toggle)mpstat -P ALL 2 3 # 3 samples, 2-second interval, all CPUs
A load average of 1.0 on a single-core machine means the CPU is fully loaded. On a 16-core machine, 1.0 means virtually idle. Always divide load average by CPU core count to get a meaningful utilisation ratio. Check nproc or lscpu for core count.
free -m # RAM + swap (avoid relying on "used" — includes cache)# Key field: "available" = what apps can actually use without swappingcat /proc/meminfo # full details# MemAvailable, Cached, Buffers, SwapCached, AnonPages, Mapped# Is the system swapping?vmstat 1 5# si (swap in) / so (swap out) > 0 = swapping — investigate furtherswapon --show # swap devices and usage# Top memory consumersps aux --sort=-%mem | head -15
# Where is disk being used?df -hT # overview of all filesystemsdf -i # inode usage (100% inodes = no new files)# Drill down by directory sizedu -sh /var/* | sort -rh | head -20du -sh /home/* | sort -rh | head -10du -sh /opt/* | sort -rh | head -10# Find the single largest directories under /var to 3 levels deepdu -h --max-depth=3 /var 2>/dev/null | sort -rh | head -30# Find individual files larger than 500 MBfind / -xdev -size +500M -type f -exec ls -lh {} \; 2>/dev/null# Find the 20 largest files anywherefind / -xdev -type f -printf '%s %p\n' 2>/dev/null \ | sort -rn | head -20 | awk '{printf "%.1f MB %s\n", $1/1048576, $2}'
# A process can hold a file descriptor open after deletion.# The disk space is not freed until the process closes/restarts.# Find deleted files still held open (the key indicator)lsof +L1# +L1 = show files with link count < 1 (deleted but still open)# Formatted output showing sizelsof +L1 | awk 'NR==1 || $NF ~ /deleted/ {print}'# Identify the process and sizelsof +L1 | awk '{print $2, $9, $NF}' | sort -k1 -n# Options:# 1. Restart the process — space freed immediately# 2. If you cannot restart, truncate the file descriptor:# > /proc/<pid>/fd/<fd_num> # truncates without restarting
# Check logrotate statuscat /var/lib/logrotate/statuslogrotate -d /etc/logrotate.conf # dry-run to test config# Manually rotate logs nowlogrotate -f /etc/logrotate.d/nginx# Emergency: compress large log file in placegzip /var/log/app/big.log # creates big.log.gz, removes original# Truncate an actively-written log (safe for processes that keep the FD open): > /var/log/app/big.log # truncate to zero without removing file# or equivalently:truncate -s 0 /var/log/app/big.log# Check for large journal files (systemd)journalctl --disk-usagejournalctl --vacuum-size=500M # keep only 500 MB of journaljournalctl --vacuum-time=7d # keep only last 7 days
systemctl status nginx # status + recent journal linessystemctl start nginx # startsystemctl stop nginx # stopsystemctl restart nginx # stop + startsystemctl reload nginx # send SIGHUP (reload config, no downtime)systemctl enable nginx # enable at bootsystemctl disable nginx # disable at bootsystemctl is-active nginx # returns 0 if activesystemctl is-enabled nginx # returns 0 if enabled# List failed unitssystemctl --failed# Check if a service keeps restartingsystemctl status nginx | grep -E "Active:|restart"
When a service fails to start, always check journalctl -u service-name -n 50 --no-pager before anything else. The error is almost always in the last few lines. systemctl status truncates long messages — journalctl gives the full output.
# Was the OOM killer invoked recently?dmesg -T | grep -i "oom\|killed process\|out of memory"# In the journaljournalctl -k --since "24 hours ago" | grep -i "oom\|killed"# The OOM killer log line contains:# "Out of memory: Kill process <pid> (<name>) score <N> or sacrifice child"# "Killed process <pid> (<name>) total-vm:<X>kB, anon-rss:<Y>kB"
oom_score_adj = -1000 on a service means the OOM killer will never touch it. Use this only for truly critical processes (database, monitoring agent). Protecting the wrong process means something else gets killed instead — possibly the kernel itself triggering a panic.
ping -c 4 8.8.8.8 # Google DNS by IP — tests raw routingping -c 4 1.1.1.1 # Cloudflare DNS# If gateway pings but 8.8.8.8 doesn't → routing or upstream issue
4
Test DNS resolution
dig +short google.com# If this fails but 8.8.8.8 pings → DNS issue# Try alternate resolver:dig @8.8.8.8 +short google.com
5
Test application port specifically
# Is the remote port open?nc -zv remote-host 443telnet remote-host 443# Test HTTP response (not just TCP)curl -sv --max-time 10 https://remote-host/health# Check local firewall isn't blocking outboundiptables -L OUTPUT -n -v
6
Capture traffic if still unclear
# Capture all traffic on port 5432 (PostgreSQL)tcpdump -i eth0 -n -w /tmp/capture.pcap port 5432# Quick human-readable capture (no write to file)tcpdump -i eth0 -n -A port 80 -c 50# Filter by hosttcpdump -i any host 10.0.0.5 -n# Analyse the pcap offlinetcpdump -r /tmp/capture.pcap -n | head -50
# 1. What is the service status?systemctl status app.service# 2. What does the log say?journalctl -u app.service -n 100 --no-pager# 3. Is the port open?ss -tlnp | grep '<port>'# 4. Any OOM kills?dmesg -T | grep -i "oom\|killed" | tail -10# 5. Disk full?df -h# 6. Out of inodes?df -i# 7. Memory pressure?free -m# 8. System load?uptime; vmstat 1 3
Application is slow — first 5 minutes
# 1. Load average vs core countuptime; nproc# 2. Is it CPU or I/O wait?vmstat 1 5 # high wa% = I/O bound; high us% = CPU bound# 3. Identify the culprit processps aux --sort=-%cpu | head -10ps aux --sort=-%mem | head -10# 4. Check I/Oiostat -xz 1 3iotop -b -n 3 -o # top I/O consumers (requires iotop)# 5. Network latency?mtr --report --report-cycles 10 8.8.8.8ss -tnp | grep ESTABLISHED | wc -l # connection count# 6. Database connections?# (PostgreSQL)psql -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"