Advanced Linux Troubleshooting in DevOps and Cloud: Practical Use Cases and Commands

In a DevOps or Cloud role, Linux administrators are often responsible for ensuring that applications and infrastructure run smoothly, and debugging and troubleshooting are essential skills. Below, I’ll provide more detailed examples of Linux system debugging and troubleshooting from a DevOps and cloud operations perspective, focusing on real-world issues you might encounter.

Scenario 1: Application Running Slowly on an EC2 Instance

You’re responsible for maintaining a web application running on an EC2 instance in AWS, and users report that the application has become very slow.

Steps for Debugging and Troubleshooting:

1. Check System Resource Usage (CPU, Memory, Disk)

The first step is to check if the system is running out of resources such as CPU, memory, or disk I/O, which could cause slowdowns.

Monitor CPU and Memory Usage:
- Use the top or htop command to see the most resource-consuming processes:

top

Example output:

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   1234 www-data  20   0  149880   5140   2956 S   25.0   2.0   0:15.00 apache2

In this case, Apache (apache2) is using 25% of CPU, which could indicate heavy web traffic or inefficient code execution.

Check Disk I/O: High disk I/O can be a bottleneck in web applications, especially if they read/write data frequently (such as with database operations).

Use iostat to check disk I/O:

   sudo apt-get install sysstat
   iostat -x 1 5

Example output:

   Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
   xvda              9.10        204.00        819.20      1224      49152

High disk I/O may indicate heavy database or log file operations.

Check Available Disk Space: If the disk is nearly full, it can cause the system to slow down.

   df -h

Example output:

   Filesystem      Size  Used Avail Use% Mounted on
   /dev/xvda1       30G   29G  1G   97% /

In this example, the root partition is 97% full, which could cause performance issues. You may need to clean up logs or expand the disk.

2. Check Application Logs for Errors

After checking system resources, the next step is to examine application logs to identify any errors or warnings.

Check Web Server Logs (e.g., Apache or NGINX): If your web application is slow, logs may reveal issues like timeouts, connection errors, or misconfigurations.

For NGINX:

   sudo tail -f /var/log/nginx/access.log /var/log/nginx/error.log

For Apache:

   sudo tail -f /var/log/apache2/access.log /var/log/apache2/error.log

Example error in an Apache log:

   [error] [client 192.168.1.100] script timed out before returning headers: index.php

This could indicate a problem with a slow PHP script or database queries.

Check Application-Specific Logs: For web applications running on Node.js, Python, or Java, check the application-specific logs (e.g., /var/log/app/app.log).

Example:

   sudo tail -f /var/log/myapp/app.log

Look for errors such as:

   Error: Database connection timed out

This might indicate a problem with the database server (e.g., high load, unresponsive database).

3. Troubleshoot Network and Load Balancer Issues

In cloud environments, the network plays a critical role in system performance. If the system is responding slowly, you may have a network bottleneck or load balancer misconfiguration.

Ping and Network Latency: Check if there is network latency or packet loss between the application server and its dependencies (e.g., database, external services):

   ping 8.8.8.8

High latency or packet loss might indicate a networking issue between the EC2 instance and external services.

Check Load Balancer Health: If your application is behind an Application Load Balancer (ALB), ensure that the health checks are passing, and traffic is being routed properly.

Check the ALB target group health:

   aws elbv2 describe-target-health --target-group-arn <target-group-arn>

Example output:

   TargetHealthDescriptions:
   - Target:
       Id: i-0123456789abcdef0
     TargetHealth:
       State: healthy

If an instance is marked as unhealthy, there could be issues with the application itself, causing failed health checks.

Scenario 2: High Memory Usage on an EC2 Instance

You notice that the memory usage on one of your EC2 instances is constantly high, leading to performance issues and Out of Memory (OOM) kills.

Steps for Debugging and Troubleshooting:

1. Check Memory Usage

Check Overall Memory Usage: Use the free -h command to check total, used, and free memory:

   free -h

Example output:

                 total        used        free      shared  buff/cache   available
   Mem:           7.8G        6.0G        500M        1.2G        1.3G        2.0G

This shows that 6 GB of memory is used, with only 500 MB free.

Check Which Processes Are Consuming Memory: Use top or htop to see which processes are consuming the most memory:

   top -o %MEM

Look for processes consuming excessive memory. For example:

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
   5678 mysql     20   0   205m    1.2g    500m  S  12.5  60.0   2:12.35 mysqld

In this case, mysqld (MySQL) is using a large amount of memory (1.2 GB), which could cause memory pressure.

2. Investigate Memory Leaks or Inefficient Resource Usage

Check Application Logs for Memory-Related Errors: Review application logs for memory-related issues like Out of Memory (OOM) kills or memory allocation errors.

Example OOM error:

   Out of memory: Kill process 12345 (node) score 100 or sacrifice child

Check for Zombie or Hanging Processes: Use the ps aux | grep defunct command to check for zombie processes:

   ps aux | grep defunct

Zombie processes might indicate that the parent process is not handling child processes correctly.

3. Add Swap Space (Temporary Fix)

If memory usage is consistently high, adding swap space can provide temporary relief by extending memory onto disk.

Create a Swap File:

   sudo fallocate -l 2G /swapfile

Set Up Swap Space:

   sudo chmod 600 /swapfile
   sudo mkswap /swapfile
   sudo swapon /swapfile

Verify Swap Space:

   free -h

This shows that additional swap space has been added, allowing the system to handle higher memory usage without running out of RAM.

Scenario 3: Database Connection Timeout

Your application frequently encounters database connection timeouts, especially during periods of high traffic.

Steps for Debugging and Troubleshooting:

1. Check Database Logs

Check Database Server Logs: If using MySQL, for example, check the logs at /var/log/mysql/error.log:

   sudo tail -f /var/log/mysql/error.log

Look for errors such as:

   [ERROR] Too many connections

This indicates that the database has reached its maximum number of connections.

2. Increase Database Connection Limits

Increase Max Connections: Edit the MySQL configuration file /etc/mysql/my.cnf and increase the max_connections parameter:

   sudo nano /etc/mysql/my.cnf

Add the following line under the [mysqld] section:

   max_connections = 500

Restart MySQL: Restart the database service to apply the changes:



   sudo systemctl restart mysql

Conclusion:

Linux administrators working in DevOps or cloud environments frequently face challenges related to system performance, application errors, resource bottlenecks, and network issues. Being able to debug and troubleshoot effectively involves checking system logs, monitoring resource usage, and adjusting configurations to optimize performance. The examples provided give practical steps and commands that can be applied in real-world scenarios to diagnose and resolve issues.

Let me know if you'd like more details on any specific scenario!

Blog

Advanced Linux Troubleshooting in DevOps and Cloud: Practical Use Cases and Commands

akhil mittal

Scenario 1: Application Running Slowly on an EC2 Instance

Steps for Debugging and Troubleshooting:

1. Check System Resource Usage (CPU, Memory, Disk)

2. Check Application Logs for Errors

3. Troubleshoot Network and Load Balancer Issues

Scenario 2: High Memory Usage on an EC2 Instance

Steps for Debugging and Troubleshooting:

1. Check Memory Usage

2. Investigate Memory Leaks or Inefficient Resource Usage

3. Add Swap Space (Temporary Fix)

Scenario 3: Database Connection Timeout

Steps for Debugging and Troubleshooting:

1. Check Database Logs

2. Increase Database Connection Limits

Conclusion:

Join Our Newsletter. No Spam, Only the good stuff.

Related