Advanced Data Processing with awk and sed
Amal Shehu
Posted on January 10, 2024
Processing large log files is a common yet challenging task in data management. The Unix tools awk
and sed
are indispensable for such tasks, providing powerful capabilities for pattern scanning, data extraction, and stream editing.
Introduction to awk and sed
awk: A powerful programming language designed for text processing. It's particularly suited for processing structured text data, like tables, and excels in pattern scanning and processing.
sed: Short for Stream Editor, sed
is a utility that performs basic text transformations on an input stream (a file or input from a pipeline). It's primarily used for substituting text patterns.
Understanding the Tools
-
awk Syntax:
awk 'pattern {action}' file
-
pattern
: Specifies the condition to be matched. -
action
: The set of commands to execute when the pattern matches. -
file
: The file to be processed.
-
-
sed Syntax:
sed [options] 'command' file
-
options
: Modifiers that alter the behavior ofsed
. -
command
: The sed command to execute (like substitution). -
file
: The file to process.
-
awk
and sed
are not just tools; they are powerful allies in text processing and data manipulation. Their integration into Unix/Linux environments and scripting capabilities make them indispensable for anyone working with text and data. Beginner or an experienced professional, mastering these tools can significantly enhance productivity and broaden our capabilities in data handling.
Example 1: Extracting and Formatting Data from a Log File
Task: From a log file, server.log
, extract records from a specific date, format them into CSV, and perform some clean-up tasks. Each line follows this format:
[Date] [Time] [IP Address] [URL]
Our goal is to extract records from a specific date, format them into CSV, and perform some clean-up tasks.
Step 1: Inspecting the Log File
Sample server.log
entries:
2024-01-10 09:15:23 192.168.1.1 /index.html
2024-01-10 09:17:45 192.168.1.2 /about.html
2024-01-11 10:15:24 192.168.1.3 /contact.html
...
Step 2: Filtering and Formatting with awk
To extract and format records from 2024-01-10
:
awk '$1 == "2024-01-10" {print $1","$2","$3","$4}' server.log > output.csv
Explanation:
-
$1 == "2024-01-10"
: Filters lines where the first field (date) is2024-01-10
. -
print $1","$2","$3","$4
: Formats the selected line into CSV format and writes tooutput.csv
.
Step 3: Advanced Text Manipulation with sed
Suppose we want to clean the URLs by removing http://www.example.com
and changing /index.html
to /home.html
:
sed -i 's/http:\/\/www.example.com//g' output.csv
sed -i 's/\/index.html/\/home.html/g' output.csv
Explanation:
- The first
sed
command removes the specified URL part. - The second
sed
command replaces/index.html
with/home.html
.
Final Output
output.csv
will now contain clean, well-formatted data:
2024-01-10,09:15:23,192.168.1.1,/home.html
2024-01-10,09:17:45,192.168.1.2,/about.html
...
Example 2: Extracting Specific Fields and Summarizing Data
Task: From a log file, sales.log
, extract sales data for a particular product, summarize total sales.
Log File Format:
Date, Product ID, Quantity, Price
awk Command:
awk -F, '$2 == "Product123" {total += $3 * $4} END {print "Total Sales for Product123: ", total}' sales.log
Explanation:
-
-F,
: Sets field separator as comma. -
$2 == "Product123"
: Filters records for 'Product123'. -
{total += $3 * $4}
: Calculates total sales. -
END {print ...}
: Prints the final total.
Example 3: Combining awk and sed for Data Cleanup and Transformation
Task: From server.log
, extract data, reformat dates, and remove specific query parameters from URLs.
Server Log Format:
YYYY-MM-DD HH:MM:SS IP /path?query
Commands:
awk '$1 >= "2024-01-01" && $1 <= "2024-01-31" {print $1","$2","$3","$4}' server.log | sed -E 's/([0-9]{4})-([0-9]{2})-([0-9]{2})/\3-\2-\1/' | sed 's/\?.*//'
Explanation:
-
awk
filters and formats January 2024 data. - First
sed
changes date format toDD-MM-YYYY
. - Second
sed
removes query parameters.
Example 4: Analyzing and Reformatting Multi-Line Records
Task: Process a multi-line log file where each record spans multiple lines, and reformat the output.
Multi-Line Log Format:
Start: [Date]
Info: [Details]
End: [Date]
---
awk Command:
awk '/Start/ {start = $2} /Info/ {info = $2} /End/ {print "Record from " start " to " $2 ": " info}' multi-line.log
Explanation:
- Uses pattern matching over multiple lines.
- Accumulates data and prints a summary for each record.
Example 5: Sed for Advanced In-Place Editing
Task: Modify config.txt
by replacing specific configuration values.
Commands:
sed -i '/max_connections/c\max_connections=500' config.txt
sed -i '/timeout/s/30/60/' config.txt
Explanation:
- Changes
max_connections
to 500. - Replaces timeout value 30 with 60.
Example 6: awk for Column-Based Arithmetic Operations
Task: Calculate average memory usage from a system log file, system.log
.
System Log Format:
Timestamp, CPU Load, Memory Usage
awk Command:
awk -F, '{total += $3; count++} END {print "Average Memory Usage:", total/count}' system.log
Explanation:
- Summarizes memory usage and calculates average.
Example 7: Filtering and Counting Unique IP Addresses
Task: From a web server log file, access.log
, extract and count unique IP addresses.
Log File Format:
IP Address - Date Time Request ...
awk Command:
awk '{print $1}' access.log | sort | uniq -c
Explanation:
-
awk '{print $1}'
: Extracts the first field, typically the IP address. -
sort
: Sorts the IP addresses. -
uniq -c
: Counts the occurrences of each unique IP address.
Example 8: Processing Multi-Field Records and Summarizing
Task: In a file inventory.csv
, calculate total value of each product category.
CSV File Format:
Category,Product,Quantity,Unit Price
awk Command:
awk -F, '{total[$1] += $3 * $4} END {for (cat in total) print cat, total[cat]}' inventory.csv
Explanation:
-
-F,
: Sets field separator as comma. -
{total[$1] += $3 * $4}
: Aggregates total value per category. -
END {for (cat in total) ...}
: Iterates over categories to print total values.
Example 9: sed for Line Deletion Based on Pattern
Task: Remove all comments from a configuration file, config.txt
.
Configuration File Format:
# This is a comment
parameter=value
# Another comment
sed Command:
sed '/^#/d' config.txt
Explanation:
-
/^#/d
: Deletes lines that start with#
(comments).
Example 10: awk for Complex Pattern Matching and Extraction
Task: From a log file security.log
, extract timestamps of failed login attempts.
Log File Format:
[Timestamp] [Log Level] [Message]
awk Command:
awk '/Failed login/ {print $1}' security.log
Explanation:
-
/Failed login/
: Filters lines containing 'Failed login'. -
{print $1}
: Prints the timestamp field of those lines.
Example 11: Extracting and Sorting Log Entries by Date
Task: Sort application.log
entries by date in descending order.
Log File Format:
[YYYY-MM-DD] [Message]
awk and sort Command:
awk '{print $1, $2}' application.log | sort -r
Explanation:
-
awk '{print $1, $2}'
: Extracts the date and time. -
sort -r
: Sorts the results in reverse (descending) order.
Example 12: Merging Two Files Line by Line
Task: Merge lines from file1.txt
and file2.txt
.
awk Command:
awk 'NR==FNR{a[NR]=$0; next} {print a[FNR], $0}' file1.txt file2.txt
Explanation:
-
NR==FNR
: True for the first file. -
a[NR]=$0; next
: Stores the lines of the first file. -
{print a[FNR], $0}
: Prints the corresponding lines from both files.
Example 13: Replacing Multiple Patterns with sed
Task: In config.txt
, replace multiple outdated parameters with new ones.
sed Command:
sed -e 's/oldparam1/newparam1/' -e 's/oldparam2/newparam2/' config.txt
Explanation:
- Multiple
-e
options allow for applying several substitution patterns.
Example 14: awk for Column-Wise Summation
Task: Sum values in the third column of data.csv
.
CSV File Format:
Field1,Field2,Value
awk Command:
awk -F, '{sum += $3} END {print sum}' data.csv
Explanation:
-
-F,
: Sets the comma as the field separator. -
{sum += $3}
: Sums up the third column. -
END {print sum}
: Outputs the total sum.
Example 15: Filtering and Redirecting Output with awk
Task: Extract error messages from system.log
and save them to errors.log
.
Log File Format:
[Timestamp] [Error Level] [Message]
awk Command:
awk '/ERROR/ {print > "errors.log"}' system.log
Explanation:
-
/ERROR/
: Filters lines containing 'ERROR'. -
{print > "errors.log"}
: Redirects them toerrors.log
.
Example 16: sed for Adding Text Based on a Pattern
Task: Add a warning message after every error line in logs.txt
.
sed Command:
sed '/ERROR/i\Warning: Check the error above.' logs.txt
Explanation:
-
/ERROR/i\...
: Inserts the warning message before lines containing 'ERROR'.
Example 17: awk for Conditional Field Modification
Task: In report.csv
, double the value in the second column if the first column is 'A'.
CSV File Format:
Category,Value
awk Command:
awk -F, '$1 == "A" {$2 *= 2} {print}' report.csv
Explanation:
-
$1 == "A" {$2 *= 2}
: Doubles the second field if the first is 'A'.
Example 18: Combining awk and sed for Complex File Editing
Task: Extract lines containing 'DEBUG' from dev.log
, and remove timestamps.
Commands:
awk '/DEBUG/ {print}' dev.log | sed 's/^\[[0-9-: ]*\] //'
Explanation:
-
awk '/DEBUG/ {print}'
: Filters lines with 'DEBUG'. -
sed 's/^\[[0-9-: ]*\] //'
: Removes the timestamp pattern.
Example 19: sed for Inline Commenting in Code Files
Task: Comment out lines containing 'deprecate' in code.py
.
sed Command:
sed '/deprecate/s/^/#/' code.py
Explanation:
-
/deprecate/s/^/#/
: Prepends#
to lines containing 'deprecate'.
Example 20: awk for Extracting and Reporting Unique Values
Task: Find and count unique user IDs in users.log
.
Log File Format:
UserID,Action
awk Command:
awk -F, '{users[$1]++} END {for (user in users) print user, users[user]}' users.log
**Explanation:
**
-
{users[$1]++}
: Counts occurrences of each UserID. -
END {...}
: Prints each UserID and its count.
Example 21: Parsing JSON-like Logs
Task: Extract values from a JSON-like log file, json.log
, where each line is a JSON object.
Log File Format:
{"date": "2024-01-10", "event": "login", "user": "user1"}
awk Command:
awk -F'[,:}]' '{for (i=1; i<=NF; i++) if ($i ~ /"user"/) print $(i+1)}' json.log | tr -d ' "'
Explanation:
-
-F'[,:}]'
: Sets field separator to handle JSON structure. -
if ($i ~ /"user"/)
: Searches for "user" field. -
tr -d ' "'
: Removes extra characters.
Example 22: Extracting Data from Nested Log Structures
Task: Parse nested data structures in complex.log
.
Log File Format:
[main:[subkey:value, ...], otherinfo:...]
awk Command:
awk -F'[,:[]' '{for (i=1; i<=NF; i++) if ($i ~ /subkey/) print $(i+1)}' complex.log
Explanation:
-
-F'[,:[]'
: Custom field separators for nested structure. -
if ($i ~ /subkey/)
: Identifies desired nested key.
Example 23: sed for Conditional Replacement in Files
Task: In settings.conf
, replace 'enabled' with 'disabled' only in lines containing 'featureX'.
sed Command:
sed '/featureX/s/enabled/disabled/' settings.conf
Explanation:
-
/featureX/s/enabled/disabled/
: Conditional pattern replacement.
Example 24: awk for Generating Reports with Headers and Footers
Task: Generate a report from data.csv
with a header and footer.
CSV File Format:
Name,Value
awk Command:
awk -F, 'BEGIN {print "Report Header"} {print $1, $2} END {print "Report Footer"}' data.csv
Explanation:
-
BEGIN {print "Report Header"}
: Prints header. -
END {print "Report Footer"}
: Prints footer.
Example 25: sed for File Content Reversal
Task: Reverse the order of lines in file.txt
.
sed Command:
sed '1!G;h;$!d' file.txt
Explanation:
-
'1!G;h;$!d'
: Complex command to reverse line order.
Example 26: awk for Data Range Selection
Task: Select records from records.csv
within a value range.
CSV File Format:
ID,Value
awk Command:
awk -F, '$2 >= 100 && $2 <= 200' records.csv
Explanation:
-
$2 >= 100 && $2 <= 200
: Selects rows with second column in specified range.
Example 27: sed for Selective Line Number Printing
Task: Print specific line numbers from file.txt
, for example, line 10 to 20.
sed Command:
sed -n '10,20p' file.txt
Explanation:
-
-n '10,20p'
: Prints lines 10 to 20.
Example 28: awk for Field Rearrangement
Task: Rearrange the fields in data.csv
.
CSV File Format:
Field1,Field2,Field3
awk Command:
awk -F, '{print $3, $1, $2}' data.csv
Explanation:
-
{print $3, $1, $2}
: Changes field order.
Example 29: sed for Inserting Lines After a Match
Task: In script.sh
, insert a new command after each 'echo' command.
sed Command:
sed '/echo/a\new_command' script.sh
Explanation:
-
/echo/a\new_command
: Inserts 'new_command' after lines containing 'echo'.
Example 30: awk for Selective Field Capitalization
Task: Capitalize the first field in names.csv
.
CSV File Format:
firstname,lastname
awk Command:
awk -F, '{print toupper($1), $2}' names.csv
Explanation:
-
toupper($1)
: Capitalizes the first field.
Example 31: Parsing Apache Log Files
Task: Extract and list unique visiting IPs from access.log
.
awk Command:
awk '{print $1}' access.log | sort -u
Explanation:
-
$1
extracts the first field, typically the IP address. -
sort -u
sorts and lists unique entries.
Example 32: Text Wrapping in Paragraphs
Task: Wrap each paragraph in text.txt
to 80 characters.
awk Command:
awk 'BEGIN{RS=""; FS="\n"} {gsub(/.{80}/,"&\n",$0); print}' text.txt
Explanation:
-
RS=""
treats blank lines as record separators. -
gsub(/.{80}/,"&\n",$0)
wraps lines at 80 characters.
Example 33: Incrementing Numbers in File
Task: Increment all numbers by 1 in data.txt
.
sed Command:
sed 's/\([0-9]\+\)/\1+1/' data.txt
Explanation:
-
\([0-9]\+\)
matches numbers. -
\1+1
increments the matched numbers.
Example 34: Extracting Email Addresses
Task: Extract email addresses from contacts.txt
.
awk Command:
awk '{for(i=1;i<=NF;i++) if ($i ~ /^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/) print $i}' contacts.txt
Explanation:
- Regular expression matches email addresses.
Example 35: Converting CSV to Tab-Delimited
Task: Convert data.csv
to tab-delimited format.
awk Command:
awk 'BEGIN{FS=","; OFS="\t"} {print $1,$2,$3}' data.csv
Explanation:
-
FS=","
sets the field separator to a comma. -
OFS="\t"
sets the output field separator to a tab.
Example 36: Removing Duplicate Lines
Task: Remove duplicate lines in file.txt
.
awk Command:
awk '!seen[$0]++' file.txt
Explanation:
-
!seen[$0]++
tracks and filters out duplicates.
Example 37: Counting Word Frequency
Task: Count the frequency of each word in text.txt
.
awk Command:
awk '{for(i=1;i<=NF;i++) words[$i]++} END{for(word in words) print word, words[word]}' text.txt
Explanation:
- Counts and prints the frequency of each word.
Example 38: In-place File Editing
Task: Replace 'oldtext' with 'newtext' in file.txt
.
sed Command:
sed -i 's/oldtext/newtext/g' file.txt
Explanation:
- In-place substitution of text.
Example 39: Summarizing Disk Usage
Task: Summarize disk usage from df
output.
awk Command:
df | awk 'NR>1 {sum += $3} END {print "Total used:", sum}'
Explanation:
- Sums the third column representing disk usage.
Example 40: Sed for Line Numbering
Task: Add line numbers to file.txt
.
sed Command:
sed = file.txt | sed 'N;s/\n/ /'
Explanation:
- Adds line numbers before each line.
Example 41: Joining Lines Based on Pattern
Task: Join lines ending with a backslash in script.sh
.
sed Command:
sed ':a; /\\$/N; s/\\\n//; ta' script.sh
Explanation:
- Joins lines ending with a backslash.
Example 42: Multi-file Processing with awk
Task: Calculate the total of the third column across multiple CSV files.
awk Command:
awk 'BEGIN{FS=","} {total += $3} END{print total}' *.csv
Explanation:
- Sums the third column of all CSV files.
Example 43: Filtering Specific Time Range
Task: Extract log entries from server.log
within a specific hour.
awk Command:
awk '$2 >= "10:00:00" && $2 <= "11:00:00"' server.log
Explanation:
- Filters entries based on time range.
Example 44: Random Line Selection
**
Task:** Select a random line from quotes.txt
.
awk Command:
awk 'BEGIN{srand()} {lines[NR]=$0} END{print lines[int(rand()*NR)+1]}' quotes.txt
Explanation:
- Selects a random line from the file.
Example 45: Sed for Case Conversion
Task: Convert all text to uppercase in file.txt
.
sed Command:
sed 's/.*/\U&/' file.txt
Explanation:
- Converts each line to uppercase.
Example 46: Extracting Specific Columns
Task: Extract the first and third columns from data.csv
.
awk Command:
awk -F, '{print $1,$3}' data.csv
Explanation:
- Prints the first and third fields.
Example 47: In-line Comment Removal
Task: Remove C++ style comments (//) from code.cpp
.
sed Command:
sed 's|//.*||' code.cpp
Explanation:
- Removes in-line comments.
Example 48: Splitting a File into Multiple Parts
Task: Split largefile.txt
into multiple files based on a pattern.
awk Command:
awk '/PATTERN/{n++}{print > "outfile" n ".txt"}' largefile.txt
Explanation:
- Splits file at each occurrence of 'PATTERN'.
Example 49: Aggregating Network Traffic Data
Task: Summarize total bytes transferred from network.log
.
Log File Format:
IP,Bytes
awk Command:
awk -F, '{total[$1] += $2} END{for (ip in total) print ip, total[ip]}' network.log
Explanation:
- Sums bytes for each IP address.
Example 50: Sed for Conditional File Inclusion
Task: Include extra.txt
contents in main.txt
after a specific pattern.
sed Command:
sed '/PATTERN/r extra.txt' main.txt
Explanation:
- Inserts
extra.txt
contents after 'PATTERN'.
Conclusion
awk
and sed
, two of the most potent text processing utilities available in Unix-like environments. These tools, with their rich set of features, can handle a vast range of tasks from simple text substitutions to complex data manipulations and analyses. Their usefulness extends across various domains, making them indispensable for system administrators, data scientists, and developers alike. With a little practice, we can master these tools and significantly enhance our productivity and capabilities in data handling.
Posted on January 10, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.