Advanced Data Processing with awk and sed

Processing large log files is a common yet challenging task in data management. The Unix tools awk and sed are indispensable for such tasks, providing powerful capabilities for pattern scanning, data extraction, and stream editing.

Introduction to awk and sed

awk: A powerful programming language designed for text processing. It's particularly suited for processing structured text data, like tables, and excels in pattern scanning and processing.

sed: Short for Stream Editor, sed is a utility that performs basic text transformations on an input stream (a file or input from a pipeline). It's primarily used for substituting text patterns.

Understanding the Tools

awk Syntax: awk 'pattern {action}' file
- pattern: Specifies the condition to be matched.
- action: The set of commands to execute when the pattern matches.
- file: The file to be processed.
sed Syntax: sed [options] 'command' file
- options: Modifiers that alter the behavior of sed.
- command: The sed command to execute (like substitution).
- file: The file to process.

awk and sed are not just tools; they are powerful allies in text processing and data manipulation. Their integration into Unix/Linux environments and scripting capabilities make them indispensable for anyone working with text and data. Beginner or an experienced professional, mastering these tools can significantly enhance productivity and broaden our capabilities in data handling.

Example 1: Extracting and Formatting Data from a Log File

Task: From a log file, server.log, extract records from a specific date, format them into CSV, and perform some clean-up tasks. Each line follows this format:

[Date] [Time] [IP Address] [URL]

Our goal is to extract records from a specific date, format them into CSV, and perform some clean-up tasks.

Step 1: Inspecting the Log File

Sample server.log entries:

2024-01-10 09:15:23 192.168.1.1 /index.html
2024-01-10 09:17:45 192.168.1.2 /about.html
2024-01-11 10:15:24 192.168.1.3 /contact.html
...

Step 2: Filtering and Formatting with awk

To extract and format records from 2024-01-10:

awk '$1 == "2024-01-10" {print $1","$2","$3","$4}' server.log > output.csv

Explanation:

$1 == "2024-01-10": Filters lines where the first field (date) is 2024-01-10.
print $1","$2","$3","$4: Formats the selected line into CSV format and writes to output.csv.

Step 3: Advanced Text Manipulation with sed

Suppose we want to clean the URLs by removing http://www.example.com and changing /index.html to /home.html:

sed -i 's/http:\/\/www.example.com//g' output.csv
sed -i 's/\/index.html/\/home.html/g' output.csv

Explanation:

The first sed command removes the specified URL part.
The second sed command replaces /index.html with /home.html.

Final Output

output.csv will now contain clean, well-formatted data:

2024-01-10,09:15:23,192.168.1.1,/home.html
2024-01-10,09:17:45,192.168.1.2,/about.html
...

Example 2: Extracting Specific Fields and Summarizing Data

Task: From a log file, sales.log, extract sales data for a particular product, summarize total sales.

Log File Format:

Date, Product ID, Quantity, Price

awk Command:

awk -F, '$2 == "Product123" {total += $3 * $4} END {print "Total Sales for Product123: ", total}' sales.log

Explanation:

-F,: Sets field separator as comma.
$2 == "Product123": Filters records for 'Product123'.
{total += $3 * $4}: Calculates total sales.
END {print ...}: Prints the final total.

Example 3: Combining awk and sed for Data Cleanup and Transformation

Task: From server.log, extract data, reformat dates, and remove specific query parameters from URLs.

Server Log Format:

YYYY-MM-DD HH:MM:SS IP /path?query

Commands:

awk '$1 >= "2024-01-01" && $1 <= "2024-01-31" {print $1","$2","$3","$4}' server.log | sed -E 's/([0-9]{4})-([0-9]{2})-([0-9]{2})/\3-\2-\1/' | sed 's/\?.*//'

Explanation:

awk filters and formats January 2024 data.
First sed changes date format to DD-MM-YYYY.
Second sed removes query parameters.

Example 4: Analyzing and Reformatting Multi-Line Records

Task: Process a multi-line log file where each record spans multiple lines, and reformat the output.

Multi-Line Log Format:

Start: [Date]
Info: [Details]
End: [Date]
---

awk Command:

awk '/Start/ {start = $2} /Info/ {info = $2} /End/ {print "Record from " start " to " $2 ": " info}' multi-line.log

Explanation:

Uses pattern matching over multiple lines.
Accumulates data and prints a summary for each record.

Example 5: Sed for Advanced In-Place Editing

Task: Modify config.txt by replacing specific configuration values.

Commands:

sed -i '/max_connections/c\max_connections=500' config.txt
sed -i '/timeout/s/30/60/' config.txt

Explanation:

Changes max_connections to 500.
Replaces timeout value 30 with 60.

Example 6: awk for Column-Based Arithmetic Operations

Task: Calculate average memory usage from a system log file, system.log.

System Log Format:

Timestamp, CPU Load, Memory Usage

awk Command:

awk -F, '{total += $3; count++} END {print "Average Memory Usage:", total/count}' system.log

Explanation:

Summarizes memory usage and calculates average.

Example 7: Filtering and Counting Unique IP Addresses

Task: From a web server log file, access.log, extract and count unique IP addresses.

Log File Format:

IP Address - Date Time Request ...

awk Command:

awk '{print $1}' access.log | sort | uniq -c

Explanation:

awk '{print $1}': Extracts the first field, typically the IP address.
sort: Sorts the IP addresses.
uniq -c: Counts the occurrences of each unique IP address.

Example 8: Processing Multi-Field Records and Summarizing

Task: In a file inventory.csv, calculate total value of each product category.

CSV File Format:

Category,Product,Quantity,Unit Price

awk Command:

awk -F, '{total[$1] += $3 * $4} END {for (cat in total) print cat, total[cat]}' inventory.csv

Explanation:

-F,: Sets field separator as comma.
{total[$1] += $3 * $4}: Aggregates total value per category.
END {for (cat in total) ...}: Iterates over categories to print total values.

Example 9: sed for Line Deletion Based on Pattern

Task: Remove all comments from a configuration file, config.txt.

Configuration File Format:

# This is a comment
parameter=value
# Another comment

sed Command:

sed '/^#/d' config.txt

Explanation:

/^#/d: Deletes lines that start with # (comments).

Example 10: awk for Complex Pattern Matching and Extraction

Task: From a log file security.log, extract timestamps of failed login attempts.

Log File Format:

[Timestamp] [Log Level] [Message]

awk Command:

awk '/Failed login/ {print $1}' security.log

Explanation:

/Failed login/: Filters lines containing 'Failed login'.
{print $1}: Prints the timestamp field of those lines.

Example 11: Extracting and Sorting Log Entries by Date

Task: Sort application.log entries by date in descending order.

Log File Format:

[YYYY-MM-DD] [Message]

awk and sort Command:

awk '{print $1, $2}' application.log | sort -r

Explanation:

awk '{print $1, $2}': Extracts the date and time.
sort -r: Sorts the results in reverse (descending) order.

Example 12: Merging Two Files Line by Line

Task: Merge lines from file1.txt and file2.txt.

awk Command:

awk 'NR==FNR{a[NR]=$0; next} {print a[FNR], $0}' file1.txt file2.txt

Explanation:

NR==FNR: True for the first file.
a[NR]=$0; next: Stores the lines of the first file.
{print a[FNR], $0}: Prints the corresponding lines from both files.

Example 13: Replacing Multiple Patterns with sed

Task: In config.txt, replace multiple outdated parameters with new ones.

sed Command:

sed -e 's/oldparam1/newparam1/' -e 's/oldparam2/newparam2/' config.txt

Explanation:

Multiple -e options allow for applying several substitution patterns.

Example 14: awk for Column-Wise Summation

Task: Sum values in the third column of data.csv.

CSV File Format:

Field1,Field2,Value

awk Command:

awk -F, '{sum += $3} END {print sum}' data.csv

Explanation:

-F,: Sets the comma as the field separator.
{sum += $3}: Sums up the third column.
END {print sum}: Outputs the total sum.

Example 15: Filtering and Redirecting Output with awk

Task: Extract error messages from system.log and save them to errors.log.

Log File Format:

[Timestamp] [Error Level] [Message]

awk Command:

awk '/ERROR/ {print > "errors.log"}' system.log

Explanation:

/ERROR/: Filters lines containing 'ERROR'.
{print > "errors.log"}: Redirects them to errors.log.

Example 16: sed for Adding Text Based on a Pattern

Task: Add a warning message after every error line in logs.txt.

sed Command:

sed '/ERROR/i\Warning: Check the error above.' logs.txt

Explanation:

/ERROR/i\...: Inserts the warning message before lines containing 'ERROR'.

Example 17: awk for Conditional Field Modification

Task: In report.csv, double the value in the second column if the first column is 'A'.

CSV File Format:

Category,Value

awk Command:

awk -F, '$1 == "A" {$2 *= 2} {print}' report.csv

Explanation:

$1 == "A" {$2 *= 2}: Doubles the second field if the first is 'A'.

Example 18: Combining awk and sed for Complex File Editing

Task: Extract lines containing 'DEBUG' from dev.log, and remove timestamps.

Commands:

awk '/DEBUG/ {print}' dev.log | sed 's/^\[[0-9-: ]*\] //'

Explanation:

awk '/DEBUG/ {print}': Filters lines with 'DEBUG'.
sed 's/^\[[0-9-: ]*\] //': Removes the timestamp pattern.

Example 19: sed for Inline Commenting in Code Files

Task: Comment out lines containing 'deprecate' in code.py.

sed Command:

sed '/deprecate/s/^/#/' code.py

Explanation:

/deprecate/s/^/#/: Prepends # to lines containing 'deprecate'.

Example 20: awk for Extracting and Reporting Unique Values

Task: Find and count unique user IDs in users.log.

Log File Format:

UserID,Action

awk Command:

awk -F, '{users[$1]++} END {for (user in users) print user, users[user]}' users.log

**Explanation:

{users[$1]++}: Counts occurrences of each UserID.
END {...}: Prints each UserID and its count.

Example 21: Parsing JSON-like Logs

Task: Extract values from a JSON-like log file, json.log, where each line is a JSON object.

Log File Format:

{"date": "2024-01-10", "event": "login", "user": "user1"}

awk Command:

awk -F'[,:}]' '{for (i=1; i<=NF; i++) if ($i ~ /"user"/) print $(i+1)}' json.log | tr -d ' "'

Explanation:

-F'[,:}]': Sets field separator to handle JSON structure.
if ($i ~ /"user"/): Searches for "user" field.
tr -d ' "': Removes extra characters.

Example 22: Extracting Data from Nested Log Structures

Task: Parse nested data structures in complex.log.

Log File Format:

[main:[subkey:value, ...], otherinfo:...]

awk Command:

awk -F'[,:[]' '{for (i=1; i<=NF; i++) if ($i ~ /subkey/) print $(i+1)}' complex.log

Explanation:

-F'[,:[]': Custom field separators for nested structure.
if ($i ~ /subkey/): Identifies desired nested key.

Example 23: sed for Conditional Replacement in Files

Task: In settings.conf, replace 'enabled' with 'disabled' only in lines containing 'featureX'.

sed Command:

sed '/featureX/s/enabled/disabled/' settings.conf

Explanation:

/featureX/s/enabled/disabled/: Conditional pattern replacement.

Example 24: awk for Generating Reports with Headers and Footers

Task: Generate a report from data.csv with a header and footer.

CSV File Format:

Name,Value

awk Command:

awk -F, 'BEGIN {print "Report Header"} {print $1, $2} END {print "Report Footer"}' data.csv

Explanation:

BEGIN {print "Report Header"}: Prints header.
END {print "Report Footer"}: Prints footer.

Example 25: sed for File Content Reversal

Task: Reverse the order of lines in file.txt.

sed Command:

sed '1!G;h;$!d' file.txt

Explanation:

'1!G;h;$!d': Complex command to reverse line order.

Example 26: awk for Data Range Selection

Task: Select records from records.csv within a value range.

CSV File Format:

ID,Value

awk Command:

awk -F, '$2 >= 100 && $2 <= 200' records.csv

Explanation:

$2 >= 100 && $2 <= 200: Selects rows with second column in specified range.

Example 27: sed for Selective Line Number Printing

Task: Print specific line numbers from file.txt, for example, line 10 to 20.

sed Command:

sed -n '10,20p' file.txt

Explanation:

-n '10,20p': Prints lines 10 to 20.

Example 28: awk for Field Rearrangement

Task: Rearrange the fields in data.csv.

CSV File Format:

Field1,Field2,Field3

awk Command:

awk -F, '{print $3, $1, $2}' data.csv

Explanation:

{print $3, $1, $2}: Changes field order.

Example 29: sed for Inserting Lines After a Match

Task: In script.sh, insert a new command after each 'echo' command.

sed Command:

sed '/echo/a\new_command' script.sh

Explanation:

/echo/a\new_command: Inserts 'new_command' after lines containing 'echo'.

Example 30: awk for Selective Field Capitalization

Task: Capitalize the first field in names.csv.

CSV File Format:

firstname,lastname

awk Command:

awk -F, '{print toupper($1), $2}' names.csv

Explanation:

toupper($1): Capitalizes the first field.

Example 31: Parsing Apache Log Files

Task: Extract and list unique visiting IPs from access.log.

awk Command:

awk '{print $1}' access.log | sort -u

Explanation:

$1 extracts the first field, typically the IP address.
sort -u sorts and lists unique entries.

Example 32: Text Wrapping in Paragraphs

Task: Wrap each paragraph in text.txt to 80 characters.

awk Command:

awk 'BEGIN{RS=""; FS="\n"} {gsub(/.{80}/,"&\n",$0); print}' text.txt

Explanation:

RS="" treats blank lines as record separators.
gsub(/.{80}/,"&\n",$0) wraps lines at 80 characters.

Example 33: Incrementing Numbers in File

Task: Increment all numbers by 1 in data.txt.

sed Command:

sed 's/\([0-9]\+\)/\1+1/' data.txt

Explanation:

$[0-9]\+$ matches numbers.
\1+1 increments the matched numbers.

Example 34: Extracting Email Addresses

Task: Extract email addresses from contacts.txt.

awk Command:

awk '{for(i=1;i<=NF;i++) if ($i ~ /^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/) print $i}' contacts.txt

Explanation:

Regular expression matches email addresses.

Example 35: Converting CSV to Tab-Delimited

Task: Convert data.csv to tab-delimited format.

awk Command:

awk 'BEGIN{FS=","; OFS="\t"} {print $1,$2,$3}' data.csv

Explanation:

FS="," sets the field separator to a comma.
OFS="\t" sets the output field separator to a tab.

Example 36: Removing Duplicate Lines

Task: Remove duplicate lines in file.txt.

awk Command:

awk '!seen[$0]++' file.txt

Explanation:

!seen[$0]++ tracks and filters out duplicates.

Example 37: Counting Word Frequency

Task: Count the frequency of each word in text.txt.

awk Command:

awk '{for(i=1;i<=NF;i++) words[$i]++} END{for(word in words) print word, words[word]}' text.txt

Explanation:

Counts and prints the frequency of each word.

Example 38: In-place File Editing

Task: Replace 'oldtext' with 'newtext' in file.txt.

sed Command:

sed -i 's/oldtext/newtext/g' file.txt

Explanation:

In-place substitution of text.

Example 39: Summarizing Disk Usage

Task: Summarize disk usage from df output.

awk Command:

df | awk 'NR>1 {sum += $3} END {print "Total used:", sum}'

Explanation:

Sums the third column representing disk usage.

Example 40: Sed for Line Numbering

Task: Add line numbers to file.txt.

sed Command:

sed = file.txt | sed 'N;s/\n/ /'

Explanation:

Adds line numbers before each line.

Example 41: Joining Lines Based on Pattern

Task: Join lines ending with a backslash in script.sh.

sed Command:

sed ':a; /\\$/N; s/\\\n//; ta' script.sh

Explanation:

Joins lines ending with a backslash.

Example 42: Multi-file Processing with awk

Task: Calculate the total of the third column across multiple CSV files.

awk Command:

awk 'BEGIN{FS=","} {total += $3} END{print total}' *.csv

Explanation:

Sums the third column of all CSV files.

Example 43: Filtering Specific Time Range

Task: Extract log entries from server.log within a specific hour.

awk Command:

awk '$2 >= "10:00:00" && $2 <= "11:00:00"' server.log

Explanation:

Filters entries based on time range.

Example 44: Random Line Selection

Task:** Select a random line from quotes.txt.

awk Command:

awk 'BEGIN{srand()} {lines[NR]=$0} END{print lines[int(rand()*NR)+1]}' quotes.txt

Explanation:

Selects a random line from the file.

Example 45: Sed for Case Conversion

Task: Convert all text to uppercase in file.txt.

sed Command:

sed 's/.*/\U&/' file.txt

Explanation:

Converts each line to uppercase.

Example 46: Extracting Specific Columns

Task: Extract the first and third columns from data.csv.

awk Command:

awk -F, '{print $1,$3}' data.csv

Explanation:

Prints the first and third fields.

Example 47: In-line Comment Removal

Task: Remove C++ style comments (//) from code.cpp.

sed Command:

sed 's|//.*||' code.cpp

Explanation:

Removes in-line comments.

Example 48: Splitting a File into Multiple Parts

Task: Split largefile.txt into multiple files based on a pattern.

awk Command:

awk '/PATTERN/{n++}{print > "outfile" n ".txt"}' largefile.txt

Explanation:

Splits file at each occurrence of 'PATTERN'.

Example 49: Aggregating Network Traffic Data

Task: Summarize total bytes transferred from network.log.

Log File Format:

IP,Bytes

awk Command:

awk -F, '{total[$1] += $2} END{for (ip in total) print ip, total[ip]}' network.log

Explanation:

Sums bytes for each IP address.

Example 50: Sed for Conditional File Inclusion

Task: Include extra.txt contents in main.txt after a specific pattern.

sed Command:

sed '/PATTERN/r extra.txt' main.txt

Explanation:

Inserts extra.txt contents after 'PATTERN'.

Conclusion

awk and sed, two of the most potent text processing utilities available in Unix-like environments. These tools, with their rich set of features, can handle a vast range of tasks from simple text substitutions to complex data manipulations and analyses. Their usefulness extends across various domains, making them indispensable for system administrators, data scientists, and developers alike. With a little practice, we can master these tools and significantly enhance our productivity and capabilities in data handling.

Advanced Data Processing with awk and sed

Amal Shehu

Introduction to awk and sed

Understanding the Tools

Example 1: Extracting and Formatting Data from a Log File

Step 1: Inspecting the Log File

Step 2: Filtering and Formatting with awk

Step 3: Advanced Text Manipulation with sed

Final Output

Example 2: Extracting Specific Fields and Summarizing Data

Example 3: Combining awk and sed for Data Cleanup and Transformation

Example 4: Analyzing and Reformatting Multi-Line Records

Example 5: Sed for Advanced In-Place Editing

Example 6: awk for Column-Based Arithmetic Operations

Example 7: Filtering and Counting Unique IP Addresses

Example 8: Processing Multi-Field Records and Summarizing

Example 9: sed for Line Deletion Based on Pattern

Example 10: awk for Complex Pattern Matching and Extraction

Example 11: Extracting and Sorting Log Entries by Date

Example 12: Merging Two Files Line by Line

Example 13: Replacing Multiple Patterns with sed

Example 14: awk for Column-Wise Summation

Example 15: Filtering and Redirecting Output with awk

Example 16: sed for Adding Text Based on a Pattern

Example 17: awk for Conditional Field Modification

Example 18: Combining awk and sed for Complex File Editing

Example 19: sed for Inline Commenting in Code Files

Example 20: awk for Extracting and Reporting Unique Values

Example 21: Parsing JSON-like Logs

Example 22: Extracting Data from Nested Log Structures

Example 23: sed for Conditional Replacement in Files

Example 24: awk for Generating Reports with Headers and Footers

Example 25: sed for File Content Reversal

Example 26: awk for Data Range Selection

Example 27: sed for Selective Line Number Printing

Example 28: awk for Field Rearrangement

Example 29: sed for Inserting Lines After a Match

Example 30: awk for Selective Field Capitalization

Example 31: Parsing Apache Log Files

Example 32: Text Wrapping in Paragraphs

Example 33: Incrementing Numbers in File

Example 34: Extracting Email Addresses

Example 35: Converting CSV to Tab-Delimited

Example 36: Removing Duplicate Lines

Example 37: Counting Word Frequency

Example 38: In-place File Editing

Example 39: Summarizing Disk Usage

Example 40: Sed for Line Numbering

Example 41: Joining Lines Based on Pattern

Example 42: Multi-file Processing with awk

Example 43: Filtering Specific Time Range

Example 44: Random Line Selection

Example 45: Sed for Case Conversion

Example 46: Extracting Specific Columns

Example 47: In-line Comment Removal

Example 48: Splitting a File into Multiple Parts

Example 49: Aggregating Network Traffic Data

Example 50: Sed for Conditional File Inclusion

Conclusion

Join Our Newsletter. No Spam, Only the good stuff.

Related