Acronym exercise: splitting strings like a pro

This article was born during #12in23 challenge on Exercism which is pretty awesome so please check it out. I didn't have any prior experience with Clojure and decided to give it a try during Functional February.

We have a simple challenge to convert a phrase to its acronym.

Techies love their TLA (Three Letter Acronyms)!
Simplest example would be "Portable network graphics" -> PNG.

This looks trivial, in functional programming we try to push our data through the pipeline transforming it step-by-step to get the desired result. Here we need to split the phrase, grab the first letter, join those back together and convert result into the upper-case. Luckily in Clojure we can express this idea quite elegantly e.g.

(defn acronym
  [phrase]
  (->> (str/split phrase #" ")
       (map first)
       (str/join "")
       str/upper-case))

Good job folks, looks like we're done here! Unfortunately, some tests are failing, I must have misunderstood the requirements a little. Failed assertions are

❌ "Complementary metal-oxide semiconductor" -> CMOS (not CMS!)
❌ "HyperText Markup Language" -> "HTML" (not HML!)

Fixing first test looks straight-forward - split function's second parameter is a regular expression so instead of using space as the only separator we can generalize it a little and use #"[\s-]" meaning any (one) whitespace character or a hyphen.

Second test is a bit less trivial, looks like we have to support splitting Camel case without actually getting rid of any characters e.g. split "HyperText" into ["Hyper" "Text"].

The feature we'll use to achieve that is called "positive lookahead".

Find expression A where expression B follows: A(?=B)

Instead of matching a particular character as a separator we actually can match a specific place in a string. #"(?=[A-Z])" is translated as "find a place in a string which has a capital letter right after it".

Combining those two together our solution now looks like this

(defn acronym
  [phrase]
  (->> (str/split phrase #"[\s-]|(?=[A-Z])")
       (map first)
       (str/join "")
       str/upper-case))

Oh no, looks like we accidentally broke a test that used to work fine before the change:

❌ "PHP: Hypertext Preprocessor" -> PHP (not PHPHP!)

That means we have a special case in our hands... If it is a Recursive acronym like PHP or generally if any part of the acronym is already an acronym we must use only the first letter.

Naive approach that I first followed was just that; if token is already and acronym - don't do anything, else let's use our Camel case split strategy:

(defn is-acronym
  [line]
  (= line (str/upper-case line)))

(defn acronym
  [phrase]
  (->> (str/split phrase #"[\s-]")
       (map #(if (is-acronym %) % (str/split % #"(?=[A-Z])")))
       flatten
       (map first)
       (str/join "")
       str/upper-case))

Apparently, this code works, all tests are green ✅. However, it is considerably more clunky: now we're doing split in two stages and also have to flatten the structure e.g.

(flatten '(["Hyper" "Text"] ["Markup"] ["Language"]))
;; => ("Hyper" "Text" "Markup" "Language")

But of course there is a way to get rid of that pesky if and make our solution more generic, "positive lookbehind" is to rescue!

Find expression A where expression B precedes: (?<=B)A

Similar to what we've had before we want to "find a place in a string which has a capital letter right after it AND has a lowercase letter right before it" - #"(?<=[a-z])(?=[A-Z])". Works like a charm:

(defn acronym
  [phrase]
  (->> (str/split phrase #"[\s-]|(?<=[a-z])(?=[A-Z])")
       (map first)
       (str/join "")
       str/upper-case))

That's all I have to say about the exercise itself, I don't think we should spend any more time improving this. Thank you for taking the time to read this post, I hope you found it useful.

I also want to thank @tasxatzial who mentored me through all of the iterations on Exercism and patiently pushed me to gradually improve my solution. We're in the middle of the Mechanical March, high time to learn some Rust!

Blog

Acronym exercise: splitting strings like a pro

'(babysitter)

Join Our Newsletter. No Spam, Only the good stuff.

Related