Linux & Networking Troubleshooting Scenarios

Real interview troubleshooting questions test your systematic approach. Let's practice with scenarios you'll actually face.

Scenario 1: High Load, Low CPU

Interviewer: "Load average is 100 but CPU utilization is only 10%. What's going on?"

Your approach:

# Step 1: Confirm the symptoms
uptime                    # Check load average
top -b -n1 | head -20     # CPU utilization

# Step 2: Check for I/O wait
vmstat 1 5                # Look at 'wa' column (I/O wait)
iostat -x 1 5             # Disk I/O details

# Step 3: Identify the culprit
iotop                     # Which processes are doing I/O
lsof +D /mount/point      # Files being accessed on slow storage

# Step 4: Check disk health
dmesg | grep -i error     # Disk errors in kernel log
smartctl -a /dev/sda      # SMART disk health

Root causes: Disk failing, NFS issues, database locks, swap thrashing

Scenario 2: SSH Connection Timeout

Interviewer: "You can't SSH to a server. Walk me through debugging."

Your approach:

# Step 1: Basic connectivity
ping server.example.com          # ICMP reachable?
traceroute server.example.com    # Where does it fail?

# Step 2: DNS resolution
dig server.example.com           # Does DNS resolve?
dig @8.8.8.8 server.example.com  # Try alternate DNS

# Step 3: Port connectivity
nc -zv server.example.com 22     # Is port 22 open?
telnet server.example.com 22     # Can we connect?

# Step 4: Check from another host
# If above fails, try from different network
# Rules out client-side issues

# Step 5: If you have console access
systemctl status sshd            # Is SSH running?
ss -tlnp | grep 22               # Listening on port 22?
journalctl -u sshd               # SSH logs
iptables -L -n                   # Firewall blocking?
cat /etc/hosts.deny              # TCP wrappers?

Common causes: Firewall rules, SSH service down, network ACLs, security groups (cloud)

Scenario 3: Application Running Slow

Interviewer: "Users report the web app is slow. Where do you start?"

Your structured approach (USE Method):

Resource	Utilization	Saturation	Errors
CPU	`top`, `mpstat`	Load average	`dmesg`
Memory	`free -h`	Swap usage	OOM in logs
Disk	`iostat`	I/O wait	`smartctl`
Network	`sar -n DEV`	Drops/errors	Interface errors

# Quick health check
top -b -n1 | head -20
free -h
iostat -x 1 3
ss -s

# Application-specific
curl -w "@curl-format.txt" -o /dev/null -s http://localhost/
# Measures DNS, connect, TTFB, total time

# Check application logs
tail -f /var/log/app/error.log
journalctl -u myapp -f

# Database connection issues?
mysql -e "SHOW PROCESSLIST"
# or
psql -c "SELECT * FROM pg_stat_activity"

Scenario 4: Disk Space Emergency

Interviewer: "Disk is at 99%. Production is down. What do you do?"

Your approach (fast!):

# Step 1: Find largest directories (quick)
du -sh /* 2>/dev/null | sort -rh | head -10

# Step 2: Find large files
find /var -type f -size +100M -exec ls -lh {} \; 2>/dev/null

# Step 3: Check for deleted files still open
lsof +L1 | head -20

# Step 4: Safe quick wins
# Truncate (not delete) large logs
> /var/log/large-log-file.log

# Clear package cache
# Debian/Ubuntu
apt-get clean
# RHEL/CentOS
yum clean all

# Remove old kernels (careful!)
# Check current kernel first
uname -r

# Step 5: Long-term
# Set up log rotation
# Add monitoring alerts at 80%
# Consider LVM for flexibility

Scenario 5: Network Packet Loss

Interviewer: "Users complain of intermittent connectivity. How do you investigate?"

# Step 1: Measure packet loss
ping -c 100 target.com
# Look at packet loss percentage

# Step 2: Continuous monitoring
mtr target.com
# Shows loss at each hop

# Step 3: Check interface errors
ip -s link show eth0
# Look at RX/TX errors, drops

# Step 4: Check for duplex mismatch
ethtool eth0
# Full duplex should match switch config

# Step 5: Capture packets for analysis
tcpdump -i eth0 -w capture.pcap host target.com
# Analyze with Wireshark

# Step 6: Check for network saturation
sar -n DEV 1 10
# Look at rxkB/s, txkB/s vs interface capacity

The Troubleshooting Framework

Always use a systematic approach:

1. GATHER INFORMATION
   - What changed recently?
   - When did it start?
   - Who is affected?

2. FORM HYPOTHESIS
   - Based on symptoms, what's most likely?
   - Prioritize by probability

3. TEST HYPOTHESIS
   - Run diagnostic commands
   - Check logs
   - Verify assumptions

4. IMPLEMENT FIX
   - Start with reversible changes
   - Document what you changed

5. VERIFY AND MONITOR
   - Confirm issue resolved
   - Set up monitoring to catch recurrence

Pro tip: Always verbalize your thought process in interviews. Interviewers want to see HOW you think, not just the final answer.

You've mastered Linux and networking fundamentals. Next module: CI/CD and Infrastructure as Code—the tools that define modern DevOps. :::