Advanced Linux Troubleshooting in DevOps and Cloud: Practical Use Cases and Commands
akhil mittal
Posted on October 14, 2024
In a DevOps or Cloud role, Linux administrators are often responsible for ensuring that applications and infrastructure run smoothly, and debugging and troubleshooting are essential skills. Below, I’ll provide more detailed examples of Linux system debugging and troubleshooting from a DevOps and cloud operations perspective, focusing on real-world issues you might encounter.
Scenario 1: Application Running Slowly on an EC2 Instance
You’re responsible for maintaining a web application running on an EC2 instance in AWS, and users report that the application has become very slow.
Steps for Debugging and Troubleshooting:
1. Check System Resource Usage (CPU, Memory, Disk)
The first step is to check if the system is running out of resources such as CPU, memory, or disk I/O, which could cause slowdowns.
-
Monitor CPU and Memory Usage:
- Use the
top
orhtop
command to see the most resource-consuming processes:
- Use the
top
Example output:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1234 www-data 20 0 149880 5140 2956 S 25.0 2.0 0:15.00 apache2
In this case, Apache (apache2
) is using 25% of CPU, which could indicate heavy web traffic or inefficient code execution.
- Check Disk I/O: High disk I/O can be a bottleneck in web applications, especially if they read/write data frequently (such as with database operations).
Use iostat
to check disk I/O:
sudo apt-get install sysstat
iostat -x 1 5
Example output:
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
xvda 9.10 204.00 819.20 1224 49152
High disk I/O may indicate heavy database or log file operations.
- Check Available Disk Space: If the disk is nearly full, it can cause the system to slow down.
df -h
Example output:
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 30G 29G 1G 97% /
In this example, the root partition is 97% full, which could cause performance issues. You may need to clean up logs or expand the disk.
2. Check Application Logs for Errors
After checking system resources, the next step is to examine application logs to identify any errors or warnings.
- Check Web Server Logs (e.g., Apache or NGINX): If your web application is slow, logs may reveal issues like timeouts, connection errors, or misconfigurations.
For NGINX:
sudo tail -f /var/log/nginx/access.log /var/log/nginx/error.log
For Apache:
sudo tail -f /var/log/apache2/access.log /var/log/apache2/error.log
Example error in an Apache log:
[error] [client 192.168.1.100] script timed out before returning headers: index.php
This could indicate a problem with a slow PHP script or database queries.
-
Check Application-Specific Logs:
For web applications running on Node.js, Python, or Java, check the application-specific logs (e.g.,
/var/log/app/app.log
).
Example:
sudo tail -f /var/log/myapp/app.log
Look for errors such as:
Error: Database connection timed out
This might indicate a problem with the database server (e.g., high load, unresponsive database).
3. Troubleshoot Network and Load Balancer Issues
In cloud environments, the network plays a critical role in system performance. If the system is responding slowly, you may have a network bottleneck or load balancer misconfiguration.
- Ping and Network Latency: Check if there is network latency or packet loss between the application server and its dependencies (e.g., database, external services):
ping 8.8.8.8
High latency or packet loss might indicate a networking issue between the EC2 instance and external services.
- Check Load Balancer Health: If your application is behind an Application Load Balancer (ALB), ensure that the health checks are passing, and traffic is being routed properly.
Check the ALB target group health:
aws elbv2 describe-target-health --target-group-arn <target-group-arn>
Example output:
TargetHealthDescriptions:
- Target:
Id: i-0123456789abcdef0
TargetHealth:
State: healthy
If an instance is marked as unhealthy, there could be issues with the application itself, causing failed health checks.
Scenario 2: High Memory Usage on an EC2 Instance
You notice that the memory usage on one of your EC2 instances is constantly high, leading to performance issues and Out of Memory (OOM) kills.
Steps for Debugging and Troubleshooting:
1. Check Memory Usage
-
Check Overall Memory Usage:
Use the
free -h
command to check total, used, and free memory:
free -h
Example output:
total used free shared buff/cache available
Mem: 7.8G 6.0G 500M 1.2G 1.3G 2.0G
This shows that 6 GB of memory is used, with only 500 MB free.
-
Check Which Processes Are Consuming Memory:
Use
top
orhtop
to see which processes are consuming the most memory:
top -o %MEM
Look for processes consuming excessive memory. For example:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5678 mysql 20 0 205m 1.2g 500m S 12.5 60.0 2:12.35 mysqld
In this case, mysqld
(MySQL) is using a large amount of memory (1.2 GB), which could cause memory pressure.
2. Investigate Memory Leaks or Inefficient Resource Usage
- Check Application Logs for Memory-Related Errors: Review application logs for memory-related issues like Out of Memory (OOM) kills or memory allocation errors.
Example OOM error:
Out of memory: Kill process 12345 (node) score 100 or sacrifice child
-
Check for Zombie or Hanging Processes:
Use the
ps aux | grep defunct
command to check for zombie processes:
ps aux | grep defunct
Zombie processes might indicate that the parent process is not handling child processes correctly.
3. Add Swap Space (Temporary Fix)
If memory usage is consistently high, adding swap space can provide temporary relief by extending memory onto disk.
- Create a Swap File:
sudo fallocate -l 2G /swapfile
- Set Up Swap Space:
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
- Verify Swap Space:
free -h
This shows that additional swap space has been added, allowing the system to handle higher memory usage without running out of RAM.
Scenario 3: Database Connection Timeout
Your application frequently encounters database connection timeouts, especially during periods of high traffic.
Steps for Debugging and Troubleshooting:
1. Check Database Logs
-
Check Database Server Logs:
If using MySQL, for example, check the logs at
/var/log/mysql/error.log
:
sudo tail -f /var/log/mysql/error.log
Look for errors such as:
[ERROR] Too many connections
This indicates that the database has reached its maximum number of connections.
2. Increase Database Connection Limits
-
Increase Max Connections:
Edit the MySQL configuration file
/etc/mysql/my.cnf
and increase themax_connections
parameter:
sudo nano /etc/mysql/my.cnf
Add the following line under the [mysqld]
section:
max_connections = 500
- Restart MySQL: Restart the database service to apply the changes:
sudo systemctl restart mysql
Conclusion:
Linux administrators working in DevOps or cloud environments frequently face challenges related to system performance, application errors, resource bottlenecks, and network issues. Being able to debug and troubleshoot effectively involves checking system logs, monitoring resource usage, and adjusting configurations to optimize performance. The examples provided give practical steps and commands that can be applied in real-world scenarios to diagnose and resolve issues.
Let me know if you'd like more details on any specific scenario!
Posted on October 14, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
October 14, 2024