huntereducative

Hunter Johnson

Posted on June 19, 2023

Hands-on AWK

This article was authored by Mehvish Poshni, a member of Educative's technical content team.

For many of us, our first exposure to a programming language is a general-purpose programming language like C, Python, or Java. AWK, on the other hand, was designed with a very targeted goal of being able to process text-based data without having to write several lines of code. That's not to say that AWK is limited to performing this function alone—far from it. However, effectiveness of AWK is largely due to the control it offers over writing these quick one-liner command-line programs, as well as short scripts to serve an immediate need. Imagine having the prowess to manipulate system logs, configuration files and spreadsheet data from the command line in just a few keystrokes. Another reason why it's worthwhile learning AWK is because it comes pre-installed as the utility awk on Unix-like operating systems, and its inclusion into the Unix ecosystem makes it very convenient to use.

The letters A, W, and K in AWK stand for the last names of the individuals (Alfred Aho, Caspar Weinberger, and Brian Kernighan) who designed the programming language in the late 1970s.

AWK

Note: This blog assumes a passing familiarity using a command-line shell (cat, echo, pipe, and redirection), some prior programming exposure (concepts like comparison and logical operators, expressions, conditionals, and loops).

Input structure for an AWK program

An AWK program takes input either in the form of one or more text files, or as the standard input stream coming from the shell environment in which awk executes. The default behavior is that each line in the input stream is considered one record, and each record has fields (text) separated by one or more whitespace characters (spaces or tabs). This default behavior can be overridden easily.

The coding environments included in this blog make use of an input text file. For convenience, we show the file here in a tabular format:

AWK input

Code format:(inputfile)

Colton Dominguez 28 Feb-20-2021 33 Marketing No
Megan Porter 29 Dec-03-2021 81 Engineering Yes
Candace Walsh 25 Apr-14-2023 43 Sales No
Grady Clements 40 Feb-15-2023 36 Sales Yes
Macaulay Roy 33 Jul-11-2022 63 Engineering Yes
Abraham Strickland 31 Aug-25-2022 93 Marketing Yes
Joelle Higgins 42 Sep-23-2022 89 Engineering Yes
Enter fullscreen mode Exit fullscreen mode

Note: The input file may not necessarily have the same number of fields on each line.

An unusual workflow

The manner in which an AWK program runs is unusual because when it's run, the code is repeatedly executed for each record in the input — behind the scenes.

AWK

Whenever a record is read, there are special built-in variables $1, $2, $3 (and so on) that can be used for accessing the values in the first, second, third (and so on) fields of that record. The entire record can also be retrieved all at once using the built-in variable $0.

An AWK program

A basic AWK program consists of one or more pattern-action pairs in the following general form.

pattern { action }
Enter fullscreen mode Exit fullscreen mode
  • The pattern is an expression that evaluates to a value that's regarded as true or false.

Note: AWK does not have a boolean data type, but 0 and the empty string "" are regarded as false, and all other values as true.

  • The action consists of one or more statements. In case there are multiple statements within an action, they may be separated by either a semicolon character (;) or a newline.

When running an AWK program, each pattern is tested against every record of the input stream, one by one.
Whenever the pattern evaluates to true, the corresponding action is executed.

AWK programming

The entire program is enclosed within single quotes, and can be run from the command line using the awk utility:

awk 'pattern { action }' inputfile
Enter fullscreen mode Exit fullscreen mode

Here, inputfile is the input to the program. More than one file can also be passed as input.

awk 'pattern { action }' file1 file2
Enter fullscreen mode Exit fullscreen mode

Since we are running the program using the awk utility in the shell, output can be stored in a file using the redirection operator >.

awk 'pattern { action }' inputfile > outputfile
Enter fullscreen mode Exit fullscreen mode

In the same vein, the input can also be taken using the pipe operator.

cat inputfile | awk 'pattern { action }'
Enter fullscreen mode Exit fullscreen mode

Examples

1- In the following one-liner, we print the records for which the age (in the third column) is less than 30.

awk '$3 < 30 { print $0 }' inputfile
Enter fullscreen mode Exit fullscreen mode

Output:

Colton Dominguez 28 Feb-20-2021 33 Marketing No
Megan Porter 29 Dec-03-2021 81 Engineering Yes
Candace Walsh 25 Apr-14-2023 43 Sales No
Enter fullscreen mode Exit fullscreen mode

2- See how none of the records get printed when 0 or the empty string "" is used as a pattern.

cat inputfile | awk '0 { print $0 }' 

awk '"" { print $0 }' inputfile
Enter fullscreen mode Exit fullscreen mode

Output: "Success"

3- The pattern in the following snippet is a non-empty string (from the second column in inputfile) which is considered true. So the action is executed for all records.

Observe, also, how we can concatenate different strings by placing them side by side.

awk '$2 { print $2 ", " $1 }' inputfile > outputfile

cat outputfile # To display the contents of the outputfile
Enter fullscreen mode Exit fullscreen mode

Output:

Dominguez, Colton
Porter, Megan
Walsh, Candace
Clements, Grady
Roy, Macaulay
Strickland, Abraham
Higgins, Joelle
Enter fullscreen mode Exit fullscreen mode

Patterns and actions are optional

It isn't necessary to specify both pattern and { action }. Just one of them suffices:

  • When pattern is not specified, the action is performed for all the records.
awk '{ print $1 "." $2 "@educative.io" }' inputfile
Enter fullscreen mode Exit fullscreen mode

Output:

Colton.Dominguez@educative.io
Megan.Porter@educative.io
Candace.Walsh@educative.io
Grady.Clements@educative.io
Macaulay.Roy@educative.io
Abraham.Strickland@educative.io
Joelle.Higgins@educative.io
Enter fullscreen mode Exit fullscreen mode
  • When { action } is not specified, the default action is to print all the matched records.
awk '$6 == "Marketing" || $7 == "No"' inputfile
Enter fullscreen mode Exit fullscreen mode

Output:

Colton Dominguez 28 Feb-20-2021 33 Marketing No
Candace Walsh 25 Apr-14-2023 43 Sales No
Abraham Strickland 31 Aug-25-2022 93 Marketing Yes
Enter fullscreen mode Exit fullscreen mode

The BEGIN and END patterns

There are other ways to specify a pattern than creating expressions using numbers, strings, arithmetic or logical operators.

  • The pattern BEGIN is matched with the beginning of the input file. So, its associated action is executed in the beginning before any other record is read. It makes sense to use it for tasks like initializing variables.
  • The pattern END matches the end of the file and is executed once at the end of the input file.

Think about how to add the scores listed in the fifth column of the input file.

awk 'BEGIN { sum = 0 } 
{ sum += $5 } 
END { print "Sum of scores is " sum }' inputfile
Enter fullscreen mode Exit fullscreen mode

Output:

Sum of scores is 438
Enter fullscreen mode Exit fullscreen mode

Regular expressions as patterns

Regular expressions (regex) are symbolic ways to represent a pattern, and specify what the matching text should look like.

The syntax of regular expressions used in AWK is known as the Extended Regular Expression (ERE).

This syntax is also used by many other languages and unix-based utilities. So, it's super useful to know.

Using a regex

Regex

In AWK, when specifying a regex as a pattern, we can include it between two forward slashes. The simplest form of a regex is as a plain sequence of characters. For example, the pattern /Feb/ matches all records containing the text Feb.

awk '/Feb/' inputfile
Enter fullscreen mode Exit fullscreen mode

Output:

Colton Dominguez 28 Feb-20-2021 33 Marketing No
Grady Clements 40 Feb-15-2023 36 Sales Yes
Enter fullscreen mode Exit fullscreen mode

Instead of searching the entire record for a match against a regex, we can use the operator ~ to check if a regular expression matches a smaller portion of the given text. Similarly, the operator !~ is useful for checking if there is no match.

Usage: The regular expression must appear on the right of ~ or !~, and the text being searched must go on the left.

awk '$6 ~ /Sa/' inputfile # Sa present in the 2nd last column
echo " "
awk '$(6+1) !~ /Y/' inputfile #  Absence of Y in the last column 
Enter fullscreen mode Exit fullscreen mode

Output:

Candace Walsh 25 Apr-14-2023 43 Sales No
Grady Clements 40 Feb-15-2023 36 Sales Yes

Colton Dominguez 28 Feb-20-2021 33 Marketing No
Candace Walsh 25 Apr-14-2023 43 Sales No
Enter fullscreen mode Exit fullscreen mode

Regex metacharacters

A regular expression may include some special characters called metacharacters, so called because they are not matched with a text in a literal sense. Instead they are interpreted as a rule for matching text. Here are some examples:

  • The metacharacters [ and ] match one of (possibly) many characters that appear enclosed within the brackets. For example, [AbC] means a single character: either A, b, or C. Expressions like these are called character classes.
  • A range of characters can also be represented as character classes. For example:
    • [0-9] means a single numeric character from $0$ to $9$.
    • [a-zA-Z] means a single alphabetical character in upper or lower case.
  • The metacharacters [^ ] specify a single character other than the ones appearing after the symbol ^ inside the character class. For example, [^bcd] means any character other than b, c, or d.
# score column contains a number in the 20-25 range 
awk '$5 ~ /[20-25]/ { print $1 " scored in the 20 to 25 range" }' inputfile
echo " "
# 2022 or 2023 not present in the 4th column
awk  '$4 ~ /202[^23]/ { print "Joining year of " $1 " is neither 2022 nor 2023" }' inputfile 
Enter fullscreen mode Exit fullscreen mode

Output:

Megan scored in the 20 to 25 range

Joining year of Colton is neither 2022 nor 2023
Joining year of Megan is neither 2022 nor 2023
Enter fullscreen mode Exit fullscreen mode

The metacharacters $ and ^ (outside a character class) take on meaning relative to some other character X in the following way:

  • ^X means lines that start with X.
  • X$ means lines that end with X.
awk '/^J/' inputfile # Lines that start with J
echo " "
awk '/o$/' inputfile #Lines that end with o
Enter fullscreen mode Exit fullscreen mode

Output:

Joelle Higgins 42 Sep-23-2022 89 Engineering Yes

Colton Dominguez 28 Feb-20-2021 33 Marketing No
Candace Walsh 25 Apr-14-2023 43 Sales No
Enter fullscreen mode Exit fullscreen mode
  • The metacharacter . means any single character.
  • The metacharacter | means characters specified by the regex on its left or its right. For example, ab|[cd] matches either ab, c or d.
  • The metacharacters () are used for grouping characters. For example, ^M versus ^(Me) mean two different things (lines beginning with M versus lines beginning with Me).

Note: The GNU implementation of AWK, known as GAWK, also supports additional features including the use of metacharacters () for capturing portions of matched text for later use.

awk '/(M..a)/' inputfile # Matches substrings of Megan and Macaulay
echo " "
awk '/(D|P)o/' inputfile # Matches substrings of Dominguez and Porter
Enter fullscreen mode Exit fullscreen mode

Output:

Megan Porter 29 Dec-03-2021 81 Engineering Yes
Macaulay Roy 33 Jul-11-2022 63 Engineering Yes

Colton Dominguez 28 Feb-20-2021 33 Marketing No
Megan Porter 29 Dec-03-2021 81 Engineering Yes
Enter fullscreen mode Exit fullscreen mode

The metacharacters *, +, ?, {m,n} are called quantifiers. They also take on meaning relative to their preceding character, say X:

  • The expression X* means zero or more occurrences of X.
  • The expression X+ means one or more occurrences of X.
  • The expression X? means zero or one occurrence of X.
  • The expression X{n,m} means at least n and at most m occurrences of X. (This is not supported below.)
awk '/i[g]+/' inputfile # Colton Mscaulsy Joelle
echo " "
awk '/oe?l/' inputfile #  Joelle and Colton
Enter fullscreen mode Exit fullscreen mode

Output:

Joelle Higgins 42 Sep-23-2022 89 Engineering Yes

Colton Dominguez 28 Feb-20-2021 33 Marketing No
Joelle Higgins 42 Sep-23-2022 89 Engineering Yes
Enter fullscreen mode Exit fullscreen mode

Note: To match a metacharacter literally, we need to use the escape character \. For example /\*/ to match the character *.

Data structure: Associative array

An associative array is the only data structure supported by AWK. It essentially consists of index and value pairs, where the index can be used for retrieving the corresponding value.

Array

An associative array is created simply through an assignment statement that maps a value to an index. The syntax looks like this:

arr["ind"] = "val"
Enter fullscreen mode Exit fullscreen mode

We can also add more elements to an array using assignment statements like the one above.

awk '{ arr[$1] = $3 } 
END{ for (i in arr ) 
        { 
            print i " " arr[i]
        } 
    }' inputfile
Enter fullscreen mode Exit fullscreen mode

Output:

Grady 40
Macaulay 33
Megan 29
Colton 28
Joelle 42
Candace 25
Abraham 31
Enter fullscreen mode Exit fullscreen mode

Notice how we loop over the array arr using a for(i in arr) style loop. In each round, i is set to the index of an element in arr (and not an element in arr).

AWK also supports a C-style for loop (see exact syntax below), but it isn't suitable for traversing over an associative array because the keys of an associative array may not fall in the required range of numbers.

for(i = 1; i < 10; i++) 
Enter fullscreen mode Exit fullscreen mode

Here's another example where the number of individuals in each team count is computed.

awk '!arr[$6] { arr[$6] = 0 }
{ arr[$6] += 1 }
END {
        for (i in arr)
        {
           print i " : " arr[i]
        }
    }' inputfile
Enter fullscreen mode Exit fullscreen mode

Output:

Marketing : 2
Sales : 2
Engineering : 3
Enter fullscreen mode Exit fullscreen mode

Built-in variables and functions

Other than $0, and $1, $2, $3 etc., there are other built-in variables that are easy to remember and easy to use. Some of these are shown in the following table:

AWK

AWK supports many predefined mathematical functions (like log, sqrt, exp, sin) as well as functions for working with strings (such as substr, length, toupper ).

Let's see a few more examples before we call it a day.

Example 1: Overriding default values

We can use any character as a field separator in the output by changing the default value of the variable OFS. The default values for OFS can be overridden as shown below.

Also note how, in the following example, we print the record number for each row using the variable NR (for number of records).

awk '{ print NR, $1, $5 }' OFS=, inputfile
Enter fullscreen mode Exit fullscreen mode

Output:

1,Colton,33
2,Megan,81
3,Candace,43
4,Grady,36
5,Macaulay,63
6,Abraham,93
7,Joelle,89
Enter fullscreen mode Exit fullscreen mode

Example 2: Accessing fields using rvalues

If a variable varname is assigned an integer k, then the syntax $varname can be used for accessing the fields in the k^{th} column.

For example, since NF stores the number of fields in the current record, we can access the last field in that row using the syntax $NF.

awk '{ print NR, $(NF-1), $NF }' inputfile 
Enter fullscreen mode Exit fullscreen mode

Output:

1 Marketing No
2 Engineering Yes
3 Sales No
4 Sales Yes
5 Engineering Yes
6 Marketing Yes
7 Engineering Yes
Enter fullscreen mode Exit fullscreen mode

Example 3: Formatting output

The C style printf is used for showing the output formatted in a tabular form. The argument %-20s sets the width of the padded string at 20& characters and aligns it to the right.

awk 'BEGIN { printf "%-20s | %-5s\n", "Full Name", "Score" } 
{ printf "%-20s | %-5d\n",  $1 " " $2, $3 }' inputfile
Enter fullscreen mode Exit fullscreen mode

Output:

Full Name            | Score
Colton Dominguez     | 28   
Megan Porter         | 29   
Candace Walsh        | 25   
Grady Clements       | 40   
Macaulay Roy         | 33   
Abraham Strickland   | 31   
Joelle Higgins       | 42   
Enter fullscreen mode Exit fullscreen mode

The next two examples use built-in functions.

Example 4: Splitting a string

The built-in function split(str, arr, ch) is used, which splits the string str around the character ch and stores the resulting substrings in the array arr. We use this function below to extract the month and year from each individual's joining date.

awk '{ split($4, arr, "-"); 
printf "%-10s | %-10s\n",  $1 , arr[1] " " arr[3] }' inputfile 
Enter fullscreen mode Exit fullscreen mode

Output:

Colton     | Feb 2021  
Megan      | Dec 2021  
Candace    | Apr 2023  
Grady      | Feb 2023  
Macaulay   | Jul 2022  
Abraham    | Aug 2022  
Joelle     | Sep 2022  
Enter fullscreen mode Exit fullscreen mode

Example 5: Find and replace

The function gsub(regex,subst,str) looks for all matches made by the regular expression regex in the string str, and replaces it by string subst. The g in gsub is for "global". There's also a related function sub (for replacing a single occurrence).

awk '{ gsub(/[0-9]+/,"X",$0); print }' inputfile 
Enter fullscreen mode Exit fullscreen mode

Output:

Colton Dominguez X Feb-X-X X Marketing No
Megan Porter X Dec-X-X X Engineering Yes
Candace Walsh X Apr-X-X X Sales No
Grady Clements X Feb-X-X X Sales Yes
Macaulay Roy X Jul-X-X X Engineering Yes
Abraham Strickland X Aug-X-X X Marketing Yes
Joelle Higgins X Sep-X-X X Engineering Yes
Enter fullscreen mode Exit fullscreen mode

Example 6: Bigger programs

In AWK programs, we can use many constructs similar to the ones available in other languages like if, else, while, switch, and more. One can also define a function in an AWK program, and then call it from within the scope of an action.

awk 'BEGIN { max = -1; name = "" } 
{
    if (max < $5)
    { 
        max = $5
        name = $1 
    }
}  
END { print name ": " getMaxScore() } 

function getMaxScore() { return max }' inputfile
Enter fullscreen mode Exit fullscreen mode

Output:

Abraham: 93

A final word

It's worth noting that AWK is a Turing complete language, which means that it can be utilized for implementing any algorithm. That being said, AWK is primarily useful for tasks like data filtration and manipulation.

This blog is far from being a complete tutorial, but we hope that it is effective in removing any entry level barriers for a faster and a happier learning experience.

Happy learning!

💖 💪 🙅 🚩
huntereducative
Hunter Johnson

Posted on June 19, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related