Systematic Linux Troubleshooting: Workflows and Tools

Effective troubleshooting is not about knowing every command. It is about having a methodology that stops you from thrashing and helps you move from observation to root cause as efficiently as possible. After two-plus decades of on-call work — from bare-metal RHEL servers to containerised microservices — the discipline of observe first, hypothesize second, test third saves more time than any individual tool. This page captures the workflows I actually follow during incidents, not an exhaustive reference.

General Troubleshooting Methodology

Observe — establish ground truth before touching anything

Resist the urge to restart services immediately. Gather data first.

# What is happening right now?
uptime                        # load average trend (1/5/15 min)
w                             # who is logged in + what they're running
dmesg -T | tail -50           # recent kernel messages
journalctl -xe --no-pager | tail -100   # systemd journal

# System-wide snapshot
vmstat 1 5                    # CPU, memory, I/O every second for 5s
iostat -xz 1 5                # per-device I/O stats
free -m                       # memory overview
df -hT                        # disk usage
ip -br addr show              # interface status

Hypothesize — form a specific, testable statement

A hypothesis is not “something is wrong with the database.” It is: “PostgreSQL is not accepting new connections because the connection pool is exhausted.” Write it down. If you cannot state it precisely, you need more observation.

Test — one change at a time

Change exactly one variable per test. Document what you changed, when, and what the result was. In a production incident this log is your audit trail.

# Common test tools
strace -p <pid> -e trace=network    # system calls for a process
ltrace -p <pid>                      # library calls
lsof -p <pid>                        # open files + sockets for a process
tcpdump -i eth0 -n port 5432        # capture traffic on a port

Fix — targeted and reversible when possible

Apply the minimal fix. If you must restart a service, know why you’re doing it and what you expect to change. Document the fix and the outcome.

Validate and monitor

# Confirm the fix held
journalctl -u app.service -n 50 --no-pager
tail -f /var/log/app/app.log
watch -n 2 'ss -tlnp | grep 8080'    # watch port status every 2s

High CPU Investigation

# Top CPU processes right now
ps aux --sort=-%cpu | head -15

# Interactive live view (sort by CPU with 'P', memory with 'M')
top

# htop with tree view (shows parent-child relationships)
htop -d 3      # refresh every 0.3s

# Per-core CPU breakdown (press '1' in top to toggle)
mpstat -P ALL 2 3    # 3 samples, 2-second interval, all CPUs

A load average of 1.0 on a single-core machine means the CPU is fully loaded. On a 16-core machine, 1.0 means virtually idle. Always divide load average by CPU core count to get a meaningful utilisation ratio. Check nproc or lscpu for core count.

Memory Investigation

free -m             # RAM + swap (avoid relying on "used" — includes cache)
# Key field: "available" = what apps can actually use without swapping

cat /proc/meminfo   # full details
# MemAvailable, Cached, Buffers, SwapCached, AnonPages, Mapped

# Is the system swapping?
vmstat 1 5
# si (swap in) / so (swap out) > 0 = swapping — investigate further
swapon --show       # swap devices and usage

# Top memory consumers
ps aux --sort=-%mem | head -15

Disk Space Issues

Find space consumers
Deleted files still open
Log rotation

# Where is disk being used?
df -hT                                  # overview of all filesystems
df -i                                   # inode usage (100% inodes = no new files)

# Drill down by directory size
du -sh /var/*          | sort -rh | head -20
du -sh /home/*         | sort -rh | head -10
du -sh /opt/*          | sort -rh | head -10

# Find the single largest directories under /var to 3 levels deep
du -h --max-depth=3 /var 2>/dev/null | sort -rh | head -30

# Find individual files larger than 500 MB
find / -xdev -size +500M -type f -exec ls -lh {} \; 2>/dev/null

# Find the 20 largest files anywhere
find / -xdev -type f -printf '%s %p\n' 2>/dev/null \
    | sort -rn | head -20 | awk '{printf "%.1f MB  %s\n", $1/1048576, $2}'

# A process can hold a file descriptor open after deletion.
# The disk space is not freed until the process closes/restarts.

# Find deleted files still held open (the key indicator)
lsof +L1
# +L1 = show files with link count < 1 (deleted but still open)

# Formatted output showing size
lsof +L1 | awk 'NR==1 || $NF ~ /deleted/ {print}'

# Identify the process and size
lsof +L1 | awk '{print $2, $9, $NF}' | sort -k1 -n

# Options:
# 1. Restart the process — space freed immediately
# 2. If you cannot restart, truncate the file descriptor:
#    > /proc/<pid>/fd/<fd_num>    # truncates without restarting

# Check logrotate status
cat /var/lib/logrotate/status
logrotate -d /etc/logrotate.conf    # dry-run to test config

# Manually rotate logs now
logrotate -f /etc/logrotate.d/nginx

# Emergency: compress large log file in place
gzip /var/log/app/big.log           # creates big.log.gz, removes original

# Truncate an actively-written log (safe for processes that keep the FD open)
: > /var/log/app/big.log            # truncate to zero without removing file
# or equivalently:
truncate -s 0 /var/log/app/big.log

# Check for large journal files (systemd)
journalctl --disk-usage
journalctl --vacuum-size=500M       # keep only 500 MB of journal
journalctl --vacuum-time=7d         # keep only last 7 days

Service Debugging with systemd

systemctl status nginx               # status + recent journal lines
systemctl start  nginx               # start
systemctl stop   nginx               # stop
systemctl restart nginx              # stop + start
systemctl reload nginx               # send SIGHUP (reload config, no downtime)
systemctl enable  nginx              # enable at boot
systemctl disable nginx              # disable at boot
systemctl is-active  nginx           # returns 0 if active
systemctl is-enabled nginx           # returns 0 if enabled

# List failed units
systemctl --failed

# Check if a service keeps restarting
systemctl status nginx | grep -E "Active:|restart"

When a service fails to start, always check journalctl -u service-name -n 50 --no-pager before anything else. The error is almost always in the last few lines. systemctl status truncates long messages — journalctl gives the full output.

Log Analysis Patterns

# Count error rate per minute from structured logs
grep "ERROR" /var/log/app/app.log \
    | awk '{print $1, $2}' \
    | cut -d: -f1-2 \
    | sort | uniq -c | sort -rn

# Extract unique IPs from nginx access log
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20

# Find slow requests (>1000ms) in nginx
awk '$NF > 1.0 {print $0}' /var/log/nginx/access.log | wc -l

# Watch log file and alert on pattern
tail -F /var/log/app/app.log | grep --line-buffered "CRITICAL" | while read -r line; do
    echo "ALERT: ${line}" | mail -s "Critical error" ops@example.com
done

OOM Killer Investigation

# Was the OOM killer invoked recently?
dmesg -T | grep -i "oom\|killed process\|out of memory"

# In the journal
journalctl -k --since "24 hours ago" | grep -i "oom\|killed"

# The OOM killer log line contains:
# "Out of memory: Kill process <pid> (<name>) score <N> or sacrifice child"
# "Killed process <pid> (<name>) total-vm:<X>kB, anon-rss:<Y>kB"

oom_score_adj = -1000 on a service means the OOM killer will never touch it. Use this only for truly critical processes (database, monitoring agent). Protecting the wrong process means something else gets killed instead — possibly the kernel itself triggering a panic.

Network Connectivity Debugging

Verify local interface and address

ip -br addr show      # quick overview of all interfaces and their IPs
ip link show          # are all expected interfaces UP?

Test default gateway reachability

GATEWAY=$(ip route show default | awk '/default/{print $3; exit}')
echo "Default gateway: ${GATEWAY}"
ping -c 4 "${GATEWAY}"

Test external IP connectivity (bypassing DNS)

ping -c 4 8.8.8.8       # Google DNS by IP — tests raw routing
ping -c 4 1.1.1.1        # Cloudflare DNS

# If gateway pings but 8.8.8.8 doesn't → routing or upstream issue

Test DNS resolution

dig +short google.com
# If this fails but 8.8.8.8 pings → DNS issue
# Try alternate resolver:
dig @8.8.8.8 +short google.com

Test application port specifically

# Is the remote port open?
nc -zv remote-host 443
telnet remote-host 443

# Test HTTP response (not just TCP)
curl -sv --max-time 10 https://remote-host/health

# Check local firewall isn't blocking outbound
iptables -L OUTPUT -n -v

Capture traffic if still unclear

# Capture all traffic on port 5432 (PostgreSQL)
tcpdump -i eth0 -n -w /tmp/capture.pcap port 5432

# Quick human-readable capture (no write to file)
tcpdump -i eth0 -n -A port 80 -c 50

# Filter by host
tcpdump -i any host 10.0.0.5 -n

# Analyse the pcap offline
tcpdump -r /tmp/capture.pcap -n | head -50

Quick-Reference Incident Checklist

Service is down — first 5 minutes

# 1. What is the service status?
systemctl status app.service

# 2. What does the log say?
journalctl -u app.service -n 100 --no-pager

# 3. Is the port open?
ss -tlnp | grep '<port>'

# 4. Any OOM kills?
dmesg -T | grep -i "oom\|killed" | tail -10

# 5. Disk full?
df -h

# 6. Out of inodes?
df -i

# 7. Memory pressure?
free -m

# 8. System load?
uptime; vmstat 1 3

Application is slow — first 5 minutes

# 1. Load average vs core count
uptime; nproc

# 2. Is it CPU or I/O wait?
vmstat 1 5    # high wa% = I/O bound; high us% = CPU bound

# 3. Identify the culprit process
ps aux --sort=-%cpu | head -10
ps aux --sort=-%mem | head -10

# 4. Check I/O
iostat -xz 1 3
iotop -b -n 3 -o    # top I/O consumers (requires iotop)

# 5. Network latency?
mtr --report --report-cycles 10 8.8.8.8
ss -tnp | grep ESTABLISHED | wc -l    # connection count

# 6. Database connections?
# (PostgreSQL)
psql -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"

Linux Essentials

Core commands used throughout these troubleshooting workflows.

Bash Scripting

Automate repetitive diagnostic and remediation steps.

Networking

Deep-dive on network diagnostics, firewall, and DNS.

DevOps Overview

Monitoring with Prometheus/Grafana and the ELK stack.

​General Troubleshooting Methodology

​High CPU Investigation

​Memory Investigation

​Disk Space Issues

​Service Debugging with systemd

​Log Analysis Patterns

​OOM Killer Investigation

​Network Connectivity Debugging

​Quick-Reference Incident Checklist

​Related Pages

Linux Essentials

Bash Scripting

Networking

DevOps Overview

General Troubleshooting Methodology

High CPU Investigation

Memory Investigation

Disk Space Issues

Service Debugging with systemd

Log Analysis Patterns

OOM Killer Investigation

Network Connectivity Debugging

Quick-Reference Incident Checklist

Related Pages