Advanced Data Processing with awk and sed

amalshehu

Amal Shehu

Posted on January 10, 2024

Advanced Data Processing with awk and sed

Programmer
Processing large log files is a common yet challenging task in data management. The Unix tools awk and sed are indispensable for such tasks, providing powerful capabilities for pattern scanning, data extraction, and stream editing.

Introduction to awk and sed

awk: A powerful programming language designed for text processing. It's particularly suited for processing structured text data, like tables, and excels in pattern scanning and processing.

sed: Short for Stream Editor, sed is a utility that performs basic text transformations on an input stream (a file or input from a pipeline). It's primarily used for substituting text patterns.

Understanding the Tools

  • awk Syntax: awk 'pattern {action}' file

    • pattern: Specifies the condition to be matched.
    • action: The set of commands to execute when the pattern matches.
    • file: The file to be processed.
  • sed Syntax: sed [options] 'command' file

    • options: Modifiers that alter the behavior of sed.
    • command: The sed command to execute (like substitution).
    • file: The file to process.

awk and sed are not just tools; they are powerful allies in text processing and data manipulation. Their integration into Unix/Linux environments and scripting capabilities make them indispensable for anyone working with text and data. Beginner or an experienced professional, mastering these tools can significantly enhance productivity and broaden our capabilities in data handling.

Example 1: Extracting and Formatting Data from a Log File

Task: From a log file, server.log, extract records from a specific date, format them into CSV, and perform some clean-up tasks. Each line follows this format:

[Date] [Time] [IP Address] [URL]
Enter fullscreen mode Exit fullscreen mode

Our goal is to extract records from a specific date, format them into CSV, and perform some clean-up tasks.

Step 1: Inspecting the Log File

Sample server.log entries:

2024-01-10 09:15:23 192.168.1.1 /index.html
2024-01-10 09:17:45 192.168.1.2 /about.html
2024-01-11 10:15:24 192.168.1.3 /contact.html
...
Enter fullscreen mode Exit fullscreen mode

Step 2: Filtering and Formatting with awk

To extract and format records from 2024-01-10:

awk '$1 == "2024-01-10" {print $1","$2","$3","$4}' server.log > output.csv
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • $1 == "2024-01-10": Filters lines where the first field (date) is 2024-01-10.
  • print $1","$2","$3","$4: Formats the selected line into CSV format and writes to output.csv.

Step 3: Advanced Text Manipulation with sed

Suppose we want to clean the URLs by removing http://www.example.com and changing /index.html to /home.html:

sed -i 's/http:\/\/www.example.com//g' output.csv
sed -i 's/\/index.html/\/home.html/g' output.csv
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • The first sed command removes the specified URL part.
  • The second sed command replaces /index.html with /home.html.

Final Output

output.csv will now contain clean, well-formatted data:

2024-01-10,09:15:23,192.168.1.1,/home.html
2024-01-10,09:17:45,192.168.1.2,/about.html
...
Enter fullscreen mode Exit fullscreen mode

Example 2: Extracting Specific Fields and Summarizing Data

Task: From a log file, sales.log, extract sales data for a particular product, summarize total sales.

Log File Format:

Date, Product ID, Quantity, Price
Enter fullscreen mode Exit fullscreen mode

awk Command:

awk -F, '$2 == "Product123" {total += $3 * $4} END {print "Total Sales for Product123: ", total}' sales.log
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • -F,: Sets field separator as comma.
  • $2 == "Product123": Filters records for 'Product123'.
  • {total += $3 * $4}: Calculates total sales.
  • END {print ...}: Prints the final total.

Example 3: Combining awk and sed for Data Cleanup and Transformation

Task: From server.log, extract data, reformat dates, and remove specific query parameters from URLs.

Server Log Format:

YYYY-MM-DD HH:MM:SS IP /path?query
Enter fullscreen mode Exit fullscreen mode

Commands:

awk '$1 >= "2024-01-01" && $1 <= "2024-01-31" {print $1","$2","$3","$4}' server.log | sed -E 's/([0-9]{4})-([0-9]{2})-([0-9]{2})/\3-\2-\1/' | sed 's/\?.*//'
Enter fullscreen mode Exit fullscreen mode

Explanation:

  1. awk filters and formats January 2024 data.
  2. First sed changes date format to DD-MM-YYYY.
  3. Second sed removes query parameters.

Example 4: Analyzing and Reformatting Multi-Line Records

Task: Process a multi-line log file where each record spans multiple lines, and reformat the output.

Multi-Line Log Format:

Start: [Date]
Info: [Details]
End: [Date]
---
Enter fullscreen mode Exit fullscreen mode

awk Command:

awk '/Start/ {start = $2} /Info/ {info = $2} /End/ {print "Record from " start " to " $2 ": " info}' multi-line.log
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • Uses pattern matching over multiple lines.
  • Accumulates data and prints a summary for each record.

Example 5: Sed for Advanced In-Place Editing

Task: Modify config.txt by replacing specific configuration values.

Commands:

sed -i '/max_connections/c\max_connections=500' config.txt
sed -i '/timeout/s/30/60/' config.txt
Enter fullscreen mode Exit fullscreen mode

Explanation:

  1. Changes max_connections to 500.
  2. Replaces timeout value 30 with 60.

Example 6: awk for Column-Based Arithmetic Operations

Task: Calculate average memory usage from a system log file, system.log.

System Log Format:

Timestamp, CPU Load, Memory Usage
Enter fullscreen mode Exit fullscreen mode

awk Command:

awk -F, '{total += $3; count++} END {print "Average Memory Usage:", total/count}' system.log
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • Summarizes memory usage and calculates average.

Example 7: Filtering and Counting Unique IP Addresses

Task: From a web server log file, access.log, extract and count unique IP addresses.

Log File Format:

IP Address - Date Time Request ...
Enter fullscreen mode Exit fullscreen mode

awk Command:

awk '{print $1}' access.log | sort | uniq -c
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • awk '{print $1}': Extracts the first field, typically the IP address.
  • sort: Sorts the IP addresses.
  • uniq -c: Counts the occurrences of each unique IP address.

Example 8: Processing Multi-Field Records and Summarizing

Task: In a file inventory.csv, calculate total value of each product category.

CSV File Format:

Category,Product,Quantity,Unit Price
Enter fullscreen mode Exit fullscreen mode

awk Command:

awk -F, '{total[$1] += $3 * $4} END {for (cat in total) print cat, total[cat]}' inventory.csv
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • -F,: Sets field separator as comma.
  • {total[$1] += $3 * $4}: Aggregates total value per category.
  • END {for (cat in total) ...}: Iterates over categories to print total values.

Example 9: sed for Line Deletion Based on Pattern

Task: Remove all comments from a configuration file, config.txt.

Configuration File Format:

# This is a comment
parameter=value
# Another comment
Enter fullscreen mode Exit fullscreen mode

sed Command:

sed '/^#/d' config.txt
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • /^#/d: Deletes lines that start with # (comments).

Example 10: awk for Complex Pattern Matching and Extraction

Task: From a log file security.log, extract timestamps of failed login attempts.

Log File Format:

[Timestamp] [Log Level] [Message]
Enter fullscreen mode Exit fullscreen mode

awk Command:

awk '/Failed login/ {print $1}' security.log
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • /Failed login/: Filters lines containing 'Failed login'.
  • {print $1}: Prints the timestamp field of those lines.

Example 11: Extracting and Sorting Log Entries by Date

Task: Sort application.log entries by date in descending order.

Log File Format:

[YYYY-MM-DD] [Message]
Enter fullscreen mode Exit fullscreen mode

awk and sort Command:

awk '{print $1, $2}' application.log | sort -r
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • awk '{print $1, $2}': Extracts the date and time.
  • sort -r: Sorts the results in reverse (descending) order.

Example 12: Merging Two Files Line by Line

Task: Merge lines from file1.txt and file2.txt.

awk Command:

awk 'NR==FNR{a[NR]=$0; next} {print a[FNR], $0}' file1.txt file2.txt
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • NR==FNR: True for the first file.
  • a[NR]=$0; next: Stores the lines of the first file.
  • {print a[FNR], $0}: Prints the corresponding lines from both files.

Example 13: Replacing Multiple Patterns with sed

Task: In config.txt, replace multiple outdated parameters with new ones.

sed Command:

sed -e 's/oldparam1/newparam1/' -e 's/oldparam2/newparam2/' config.txt
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • Multiple -e options allow for applying several substitution patterns.

Example 14: awk for Column-Wise Summation

Task: Sum values in the third column of data.csv.

CSV File Format:

Field1,Field2,Value
Enter fullscreen mode Exit fullscreen mode

awk Command:

awk -F, '{sum += $3} END {print sum}' data.csv
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • -F,: Sets the comma as the field separator.
  • {sum += $3}: Sums up the third column.
  • END {print sum}: Outputs the total sum.

Example 15: Filtering and Redirecting Output with awk

Task: Extract error messages from system.log and save them to errors.log.

Log File Format:

[Timestamp] [Error Level] [Message]
Enter fullscreen mode Exit fullscreen mode

awk Command:

awk '/ERROR/ {print > "errors.log"}' system.log
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • /ERROR/: Filters lines containing 'ERROR'.
  • {print > "errors.log"}: Redirects them to errors.log.

Example 16: sed for Adding Text Based on a Pattern

Task: Add a warning message after every error line in logs.txt.

sed Command:

sed '/ERROR/i\Warning: Check the error above.' logs.txt
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • /ERROR/i\...: Inserts the warning message before lines containing 'ERROR'.

Example 17: awk for Conditional Field Modification

Task: In report.csv, double the value in the second column if the first column is 'A'.

CSV File Format:

Category,Value
Enter fullscreen mode Exit fullscreen mode

awk Command:

awk -F, '$1 == "A" {$2 *= 2} {print}' report.csv
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • $1 == "A" {$2 *= 2}: Doubles the second field if the first is 'A'.

Example 18: Combining awk and sed for Complex File Editing

Task: Extract lines containing 'DEBUG' from dev.log, and remove timestamps.

Commands:

awk '/DEBUG/ {print}' dev.log | sed 's/^\[[0-9-: ]*\] //'
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • awk '/DEBUG/ {print}': Filters lines with 'DEBUG'.
  • sed 's/^\[[0-9-: ]*\] //': Removes the timestamp pattern.

Example 19: sed for Inline Commenting in Code Files

Task: Comment out lines containing 'deprecate' in code.py.

sed Command:

sed '/deprecate/s/^/#/' code.py
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • /deprecate/s/^/#/: Prepends # to lines containing 'deprecate'.

Example 20: awk for Extracting and Reporting Unique Values

Task: Find and count unique user IDs in users.log.

Log File Format:

UserID,Action
Enter fullscreen mode Exit fullscreen mode

awk Command:

awk -F, '{users[$1]++} END {for (user in users) print user, users[user]}' users.log
Enter fullscreen mode Exit fullscreen mode

**Explanation:

**

  • {users[$1]++}: Counts occurrences of each UserID.
  • END {...}: Prints each UserID and its count.

Example 21: Parsing JSON-like Logs

Task: Extract values from a JSON-like log file, json.log, where each line is a JSON object.

Log File Format:

{"date": "2024-01-10", "event": "login", "user": "user1"}
Enter fullscreen mode Exit fullscreen mode

awk Command:

awk -F'[,:}]' '{for (i=1; i<=NF; i++) if ($i ~ /"user"/) print $(i+1)}' json.log | tr -d ' "'
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • -F'[,:}]': Sets field separator to handle JSON structure.
  • if ($i ~ /"user"/): Searches for "user" field.
  • tr -d ' "': Removes extra characters.

Example 22: Extracting Data from Nested Log Structures

Task: Parse nested data structures in complex.log.

Log File Format:

[main:[subkey:value, ...], otherinfo:...]
Enter fullscreen mode Exit fullscreen mode

awk Command:

awk -F'[,:[]' '{for (i=1; i<=NF; i++) if ($i ~ /subkey/) print $(i+1)}' complex.log
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • -F'[,:[]': Custom field separators for nested structure.
  • if ($i ~ /subkey/): Identifies desired nested key.

Example 23: sed for Conditional Replacement in Files

Task: In settings.conf, replace 'enabled' with 'disabled' only in lines containing 'featureX'.

sed Command:

sed '/featureX/s/enabled/disabled/' settings.conf
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • /featureX/s/enabled/disabled/: Conditional pattern replacement.

Example 24: awk for Generating Reports with Headers and Footers

Task: Generate a report from data.csv with a header and footer.

CSV File Format:

Name,Value
Enter fullscreen mode Exit fullscreen mode

awk Command:

awk -F, 'BEGIN {print "Report Header"} {print $1, $2} END {print "Report Footer"}' data.csv
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • BEGIN {print "Report Header"}: Prints header.
  • END {print "Report Footer"}: Prints footer.

Example 25: sed for File Content Reversal

Task: Reverse the order of lines in file.txt.

sed Command:

sed '1!G;h;$!d' file.txt
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • '1!G;h;$!d': Complex command to reverse line order.

Example 26: awk for Data Range Selection

Task: Select records from records.csv within a value range.

CSV File Format:

ID,Value
Enter fullscreen mode Exit fullscreen mode

awk Command:

awk -F, '$2 >= 100 && $2 <= 200' records.csv
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • $2 >= 100 && $2 <= 200: Selects rows with second column in specified range.

Example 27: sed for Selective Line Number Printing

Task: Print specific line numbers from file.txt, for example, line 10 to 20.

sed Command:

sed -n '10,20p' file.txt
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • -n '10,20p': Prints lines 10 to 20.

Example 28: awk for Field Rearrangement

Task: Rearrange the fields in data.csv.

CSV File Format:

Field1,Field2,Field3
Enter fullscreen mode Exit fullscreen mode

awk Command:

awk -F, '{print $3, $1, $2}' data.csv
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • {print $3, $1, $2}: Changes field order.

Example 29: sed for Inserting Lines After a Match

Task: In script.sh, insert a new command after each 'echo' command.

sed Command:

sed '/echo/a\new_command' script.sh
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • /echo/a\new_command: Inserts 'new_command' after lines containing 'echo'.

Example 30: awk for Selective Field Capitalization

Task: Capitalize the first field in names.csv.

CSV File Format:

firstname,lastname
Enter fullscreen mode Exit fullscreen mode

awk Command:

awk -F, '{print toupper($1), $2}' names.csv
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • toupper($1): Capitalizes the first field.

Example 31: Parsing Apache Log Files

Task: Extract and list unique visiting IPs from access.log.

awk Command:

awk '{print $1}' access.log | sort -u
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • $1 extracts the first field, typically the IP address.
  • sort -u sorts and lists unique entries.

Example 32: Text Wrapping in Paragraphs

Task: Wrap each paragraph in text.txt to 80 characters.

awk Command:

awk 'BEGIN{RS=""; FS="\n"} {gsub(/.{80}/,"&\n",$0); print}' text.txt
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • RS="" treats blank lines as record separators.
  • gsub(/.{80}/,"&\n",$0) wraps lines at 80 characters.

Example 33: Incrementing Numbers in File

Task: Increment all numbers by 1 in data.txt.

sed Command:

sed 's/\([0-9]\+\)/\1+1/' data.txt
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • \([0-9]\+\) matches numbers.
  • \1+1 increments the matched numbers.

Example 34: Extracting Email Addresses

Task: Extract email addresses from contacts.txt.

awk Command:

awk '{for(i=1;i<=NF;i++) if ($i ~ /^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/) print $i}' contacts.txt
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • Regular expression matches email addresses.

Example 35: Converting CSV to Tab-Delimited

Task: Convert data.csv to tab-delimited format.

awk Command:

awk 'BEGIN{FS=","; OFS="\t"} {print $1,$2,$3}' data.csv
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • FS="," sets the field separator to a comma.
  • OFS="\t" sets the output field separator to a tab.

Example 36: Removing Duplicate Lines

Task: Remove duplicate lines in file.txt.

awk Command:

awk '!seen[$0]++' file.txt
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • !seen[$0]++ tracks and filters out duplicates.

Example 37: Counting Word Frequency

Task: Count the frequency of each word in text.txt.

awk Command:

awk '{for(i=1;i<=NF;i++) words[$i]++} END{for(word in words) print word, words[word]}' text.txt
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • Counts and prints the frequency of each word.

Example 38: In-place File Editing

Task: Replace 'oldtext' with 'newtext' in file.txt.

sed Command:

sed -i 's/oldtext/newtext/g' file.txt
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • In-place substitution of text.

Example 39: Summarizing Disk Usage

Task: Summarize disk usage from df output.

awk Command:

df | awk 'NR>1 {sum += $3} END {print "Total used:", sum}'
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • Sums the third column representing disk usage.

Example 40: Sed for Line Numbering

Task: Add line numbers to file.txt.

sed Command:

sed = file.txt | sed 'N;s/\n/ /'
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • Adds line numbers before each line.

Example 41: Joining Lines Based on Pattern

Task: Join lines ending with a backslash in script.sh.

sed Command:

sed ':a; /\\$/N; s/\\\n//; ta' script.sh
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • Joins lines ending with a backslash.

Example 42: Multi-file Processing with awk

Task: Calculate the total of the third column across multiple CSV files.

awk Command:

awk 'BEGIN{FS=","} {total += $3} END{print total}' *.csv
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • Sums the third column of all CSV files.

Example 43: Filtering Specific Time Range

Task: Extract log entries from server.log within a specific hour.

awk Command:

awk '$2 >= "10:00:00" && $2 <= "11:00:00"' server.log
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • Filters entries based on time range.

Example 44: Random Line Selection

**

Task:** Select a random line from quotes.txt.

awk Command:

awk 'BEGIN{srand()} {lines[NR]=$0} END{print lines[int(rand()*NR)+1]}' quotes.txt
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • Selects a random line from the file.

Example 45: Sed for Case Conversion

Task: Convert all text to uppercase in file.txt.

sed Command:

sed 's/.*/\U&/' file.txt
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • Converts each line to uppercase.

Example 46: Extracting Specific Columns

Task: Extract the first and third columns from data.csv.

awk Command:

awk -F, '{print $1,$3}' data.csv
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • Prints the first and third fields.

Example 47: In-line Comment Removal

Task: Remove C++ style comments (//) from code.cpp.

sed Command:

sed 's|//.*||' code.cpp
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • Removes in-line comments.

Example 48: Splitting a File into Multiple Parts

Task: Split largefile.txt into multiple files based on a pattern.

awk Command:

awk '/PATTERN/{n++}{print > "outfile" n ".txt"}' largefile.txt
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • Splits file at each occurrence of 'PATTERN'.

Example 49: Aggregating Network Traffic Data

Task: Summarize total bytes transferred from network.log.

Log File Format:

IP,Bytes
Enter fullscreen mode Exit fullscreen mode

awk Command:

awk -F, '{total[$1] += $2} END{for (ip in total) print ip, total[ip]}' network.log
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • Sums bytes for each IP address.

Example 50: Sed for Conditional File Inclusion

Task: Include extra.txt contents in main.txt after a specific pattern.

sed Command:

sed '/PATTERN/r extra.txt' main.txt
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • Inserts extra.txt contents after 'PATTERN'.

Conclusion

awk and sed, two of the most potent text processing utilities available in Unix-like environments. These tools, with their rich set of features, can handle a vast range of tasks from simple text substitutions to complex data manipulations and analyses. Their usefulness extends across various domains, making them indispensable for system administrators, data scientists, and developers alike. With a little practice, we can master these tools and significantly enhance our productivity and capabilities in data handling.

💖 💪 🙅 🚩
amalshehu
Amal Shehu

Posted on January 10, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Advanced Data Processing with awk and sed
programming Advanced Data Processing with awk and sed

January 10, 2024