Emulating regexp lookarounds in GNU sed

learnbyexample

Sundeep

Posted on November 3, 2020

Emulating regexp lookarounds in GNU sed

Photo Credit: Henda Watani on Pexels

This stackoverflow Q&A got me thinking about various ways to construct a solution in GNU sed if lookarounds are needed.

Note: Only single line (with newline as the line separator) processing is presented here. Equivalent lookaround syntax with grep -P or perl is also shown for comparison. Cases where multiple lines and/or ASCII NUL characters are present in the pattern space is left as an exercise.

Filtering

Here, you only need to decide whether the input line has to be matched or not. sed supports grouping commands inside {} that should be executed only if a filtering condition is matched. The condition could be negated by adding a ! character. In this way, you can emulate chaining of multiple positive and/or negative lookaround conditions.

$ cat items.txt
1,2,3,4
apple=50 ;per kg
a,b,c,d
;foo xyz3

# filter lines containing a digit character followed by a ; character
# lookaround isn't needed here
# same as: grep '[0-9].*;' or grep -P '\d(?=.*;)'
$ sed -n '/[0-9].*;/p' items.txt
apple=50 ;per kg

# filter lines containing both digit and ; characters in any order
# same as: grep -P '^(?=.*;).*\d'
$ sed -n '/;/{ /[0-9]/p }' items.txt
apple=50 ;per kg
;foo xyz3

# filter lines containing both digit and ; characters
# but not if the line also contains character a
# same as: grep -P '^(?!.*a)(?=.*;).*\d'
$ sed -n '/a/!{ /;/{ /[0-9]/p } }' items.txt
;foo xyz3
Enter fullscreen mode Exit fullscreen mode

For some cases, multiple condition check like the previous examples is not enough. For example, filter a line if it contains par as long as cart isn't present later in the line. Presence of cart earlier in the line shouldn't affect the outcome. In such cases, you can first change the input line to add a newline character wherever cart is present and then construct a condition such that it depends on the newline character instead of cart. If a match is found, delete all the newline characters and then print the line.

$ s='par carted spare cart park city'

# same as: grep -P 'par(?!.*cart)'
$ echo "$s" | sed -n 's/cart/\n&/g; /par[^\n]*$/{ s/\n//g; p }'
par carted spare cart park city
Enter fullscreen mode Exit fullscreen mode

Note: Newline is a safe character to choose for default line by line processing, as sed removes it from the pattern space. If you are processing a pattern space that contains newline character (for example: -z option, N command, etc), then you can still perform this trick as long as you know a character that is guaranteed to be absent from the input data.

Substitution

In previous section, you saw how to modify input line with newline character to make it easier to construct a lookaround condition. This trick comes in handy for substitution as well. However, for search and replace cases, you also need to emulate zero-width nature of lookarounds. To achieve this, you can make use of t command to construct a loop that performs substitution as long as a match is found. See my chapter on Control structures for more details about branching commands in GNU sed.

Here's an example of looping. Aim is to delete fin from the given input recursively.

# manual repetition, assuming count is known
$ echo 'coffining' | sed 's/fin//'
cofing
$ echo 'coffining' | sed 's/fin//; s///'
cog

# :loop marks the 's' command with label 'loop'
# tloop will jump to label 'loop' as long as the substitution succeeds
$ echo 'coffining' | sed ':loop s/fin//; tloop'
cog
Enter fullscreen mode Exit fullscreen mode

Negative lookarounds

Some cases can be solved by performing substitution only if a condition is first satisfied. Note that {} grouping is optional here.

# same as: perl -ne 'print if s/^(?!;).*?\K[ ,].*//'
$ sed -n '/^;/! s/[ ,].*//p' items.txt
1
apple=50
a
Enter fullscreen mode Exit fullscreen mode

Change foo to [baz] only if it is not followed by a digit character. Note that foo at the end of string also satisfies this assertion. foofoo has two matches as the assertion is zero-width in nature, i.e. it doesn't consume characters. Here, the first step is inserting a newline character between foo and a digit character. Then change all foo to [baz] as long as it is at the end of string or if it isn't followed by a newline character. Once the loop ends, remove all the newline characters.

$ s='hey food! foo42 foot5 foofoo'

# same as: perl -pe 's/foo(?!\d)/[baz]/g'
$ echo "$s" | sed -E 's/(foo)([0-9])/\1\n\2/g;
                      :a s/foo([^\n]|$)/[baz]\1/; ta;
                      s/\n//g'
hey [baz]d! foo42 [baz]t5 [baz][baz]
Enter fullscreen mode Exit fullscreen mode

Change foo to [baz] only if it is not preceded by _ character. foo at the start of string is matched as well.

$ s='foo _foo 42foofoo'

# same as: perl -pe 's/(?<!_)foo/[baz]/g'
$ echo "$s" | sed -E 's/(_)(foo)/\1\n\2/g;
                      :a s/(^|[^\n])foo/\1[baz]/; ta;
                      s/\n//g'
[baz] _foo 42[baz][baz]
Enter fullscreen mode Exit fullscreen mode

Replace par with [xyz] as long as s character is not present later in the input. This assumes that the assertion doesn't conflict with the search pattern, for example s will not conflict with par but would affect if it was r and par.

$ s='par spare part party'

# same as: perl -pe 's/par(?!.*s)/[xyz]/g'
$ echo "$s" | sed -E 's/s/&\n/g;
                      :a s/par([^\n]*)$/[xyz]\1/; ta;
                      s/\n//g'
par s[xyz]e [xyz]t [xyz]ty
Enter fullscreen mode Exit fullscreen mode

Replace all empty fields with NA for csv input (assuming no embedded comma, newline characters, etc).

$ s=',1,,,two,3,,,'

# same as: perl -lpe 's/(?<![^,])(?![^,])/NA/g'
$ echo "$s" | sed -E ':a s/,,/,NA,/g; ta; s/^,/NA,/; s/,$/,NA/'
NA,1,NA,NA,two,3,NA,NA,NA
Enter fullscreen mode Exit fullscreen mode

Replace if go is not there between at and par.

$ s='fox,cat,dog,parrot,dot,park,go,spare'

# same as: perl -pe 's/at((?!go).)*par/[xyz]/'
$ echo "$s" | sed 's/go/\n&/g; s/at[^\n]*par/[xyz]/; s/\n//g'
fox,c[xyz]k,go,spare
Enter fullscreen mode Exit fullscreen mode

Positive lookarounds

Surround fields with [] except first and last fields for csv input (assuming no embedded comma, newline characters, etc). With positive lookaround emulation, the modified string may continue to satisfy the matching condition, resulting in infinite looping. In this example, the fields themselves may contain [] characters, so you cannot use them to prevent infinite loop. The newline character trick comes in handy again.

$ s='1,t[w]o,[3],f[ou]r,5'

# same as: perl -pe 's/(?<=,)[^,]+(?=,)/[$&]/g'
$ echo "$s" | sed -E ':a s/,([^,\n]+),/,\n[\1],/g; ta; s/\n//g'
1,[t[w]o],[[3]],[f[ou]r],5
Enter fullscreen mode Exit fullscreen mode

Add space at word boundaries, but not at the start or end of string. Also, don't add space if it is already present. Here, negated character class on space character is enough to emulate the assertion.

$ s='total= num1+35*42/num2'

# same as: perl -lpe 's/(?<=[^ ])\b(?=[^ ])/ /g'
$ echo "$s" | sed -E ':a s/([^ ])\b([^ ])/\1 \2/; ta;'
total = num1 + 35 * 42 / num2
Enter fullscreen mode Exit fullscreen mode

Replace par with [xyz] as long as part occurs as a whole word later in the line. Here, the nature of the modified string itself prevents the possibility of infinite loop.

$ s='par spare part party'

# same as: perl -pe 's/par(?=.*\bpart\b)/[xyz]/g'
$ echo "$s" | sed -E ':a s/par(.*\bpart\b)/[xyz]\1/; ta'
[xyz] s[xyz]e part party
Enter fullscreen mode Exit fullscreen mode

Summary

Branching commands and some creative preprocessing of the input can be combined to emulate lookaround assertions in sed. Given that Unix utility sed is Turing complete, it's perhaps not a big surprise. Now, please excuse me, I'll be busy reaping points on stackoverflow/unix.stackexchange for this edge case ;)

💖 💪 🙅 🚩
learnbyexample
Sundeep

Posted on November 3, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related