10 Essential Linux Commands Every Site Reliability Engineer (SRE) Should Know
Oliver Bennet
Posted on October 29, 2024
Introduction
In the role of a Site Reliability Engineer (SRE), knowing the right Linux commands is key to maintaining, monitoring, and troubleshooting complex systems. These commands can make everyday tasks smoother and ensure reliable system performance. Here’s a look at ten must-know commands that every SRE should keep handy.
1. uptime
- System Uptime and Load
Description: Shows how long the system has been running and displays the average load.
Why It’s Important: Quick way to gauge system load and identify if the system is under stress.
Basic Usage: uptime
2. journalctl
- System Logs Access
Description: Displays logs collected by systemd.
Why It’s Important: Essential for troubleshooting errors or investigating performance issues, particularly on systemd-based Linux systems.
Basic Usage: journalctl -u nginx.service
(Show logs for a specific service)
3. free
- Memory Usage
Description: Displays memory usage including free, used, and cached memory.
Why It’s Important: Crucial for detecting memory leaks or planning memory upgrades.
Basic Usage: free -h
(Displays in human-readable format)
4. iostat
- Input/Output Statistics
Description: Reports CPU and I/O statistics for devices and partitions.
Why It’s Important: Helps to identify I/O bottlenecks and assess disk performance.
Basic Usage: iostat -xz 1
(Shows extended statistics with per-second updates)
5. lsof
- List Open Files
Description: Lists open files and the processes that opened them.
Why It’s Important: Useful for identifying resource usage and finding potential file descriptor leaks.
Basic Usage: lsof -i :80
(Lists processes using port 80)
6. dstat
- System Resource Statistics
Description: Combines multiple monitoring tools in one command to display CPU, disk, network, memory, and process stats.
Why It’s Important: Provides a real-time overview of system health, combining various metrics in one view.
Basic Usage: dstat -cdnm
(Displays CPU, disk, network, and memory usage)
7. curl
- Test HTTP Services
Description: Transfers data from or to a server, often used for API and HTTP testing.
Why It’s Important: Allows you to troubleshoot and test endpoints and web services quickly.
Basic Usage: curl -I http://example.com
(Fetches HTTP headers)
8. ping
and traceroute
- Network Connectivity and Path Analysis
Description: ping checks connectivity to a remote host, while traceroute shows the path packets take.
Why It’s Important: Essential for diagnosing network issues and pinpointing connection problems.
Basic Usage: ping google.com
/ traceroute google.com
9. sar
- System Activity Report
Description: Collects and displays CPU, memory, network, and disk statistics.
Why It’s Important: Helps identify patterns and potential issues over time, especially useful for trend analysis.
Basic Usage: sar -u 1 5
(Displays CPU usage over 5 seconds)
10. systemctl
- Manage Services
Description: Controls system services, allowing you to start, stop, enable, and check service status.
Why It’s Important: Essential for service reliability, as it enables quick service restarts or status checks during incidents.
Basic Usage: systemctl restart nginx
(Restart the Nginx service)
Conclusion
These commands are invaluable tools for SREs to monitor, manage, and troubleshoot systems effectively. Mastering them can enhance both the reliability and resilience of your infrastructure, ultimately leading to a smoother and more efficient production environment.
If you would like to understand more Bash CLI and master yourself. Here is a video
🔗 Support my Work
▶️ Support by Subscribing my YouTube
▶️ Explore more open-source tutorials on my website
▶️ Follow me on X
☕ Buy me a Coffee
Posted on October 29, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
October 29, 2024