Making a string shorter without losing its meaning

icanhazstring

Andreas Frömer

Posted on July 10, 2020

Making a string shorter without losing its meaning

Today I was confronted with a small but meaningful problem writing some piece of software. I reached some kind of threshold where the string does not fit, so I would have to shorten it.

  1. I could remove all characters of each word keeping only the first three
  2. Remove all vowels

But how do I make sure to keep the meaning of the string so you could still understand what's written 🤔

Small overview

To give you a small outline of what I was doing - I am importing huge amounts of data of unknown structure. So I am doing a scan over all data sets trying to figure out the structure, flatten it, and use it as MySQL column names.

Here is an example structure:

{
   "payload":{
      "product":{
         "details":{
            "aru":{
               "children":[
                  {
                     "aru":{
                        "products":[
                           {
                              "aru":{
                                 "street":"",
                                 "streetnumber":"",
                                 "productKey":"",
                                 "price":"",
                                 "tariffName":"",
                                 "insurancePolicyNumber":""
                              }
                           }
                        ]
                     }
                  }
               ]
            }
         }
      }
   }
}
Enter fullscreen mode Exit fullscreen mode

You might already see an upcoming problem here 😉
And don't ask about the structure - it is what it is.

The limitations

Using innodb the limitation for column names is 64 characters. At first glance, and with multiple scans, this was going to be close to maximum but still worked.

Hence comes the problem

Of course, the limitations were reached while going into production as there was new data coming which exceeded it.

I am now presented with a string of 69 characters of length. Knowing that there will be a string coming exceeding it even further there needed to be some kind of reduction in characters.

Possible solutions

Given the following strings here are some solutions I have considered.

payloadProductDetailsAruChildren3AruProducts0AruStreet
payloadProductDetailsAruChildren3AruProducts0AruStreetnumber
payloadProductDetailsAruChildren3AruProducts0AruProductKey           
payloadProductDetailsAruChildren3AruProducts0AruPrice                
payloadProductDetailsAruChildren3AruProducts0AruTariffName           
payloadProductDetailsAruChildren3AruProducts0AruInsurancePolicyNumber
Enter fullscreen mode Exit fullscreen mode

Only keep first 3 characters

Shortening the string by keeping the first 3 characters was the first and a very fast solution. Which ended in not even a solution but making it even worse:

payProDetAruChi3AruPro0AruStr
payProDetAruChi3AruPro0AruStr
payProDetAruChi3AruPro0AruProKey           
payProDetAruChi3AruPro0AruPri                
payProDetAruChi3AruPro0AruTarNam           
payProDetAruChi3AruPro0AruInsPolNum
Enter fullscreen mode Exit fullscreen mode

You see the problem, we ended up with two equal column names which would be a disaster since we would be losing data.

Remove all vowels

A colleague of mine came up with the idea of removing all lower case vowels.

pyldPrdctDtlsArChldrn3ArPrdcts0ArStrt
pyldPrdctDtlsArChldrn3ArPrdcts0ArStrtnmbr
pyldPrdctDtlsArChldrn3ArPrdcts0ArPrdctKy           
pyldPrdctDtlsArChldrn3ArPrdcts0ArPrc                
pyldPrdctDtlsArChldrn3ArPrdcts0ArTrffNm           
pyldPrdctDtlsArChldrn3ArPrdcts0ArInsrncPlcyNmbr
Enter fullscreen mode Exit fullscreen mode

Which looks better and we don't have a conflicting column name. But still, some column names seem hard to read. For example pyldPrdctDtlsArChldrn3ArPrdcts0ArPrc. Thinking of Price at the end is hard to nearly impossible.

Also, the product name Aru is shortened. That is not a big problem but still 😉

The Significance of Letter Position in Word Recognition

This is the title of a Ph.D. thesis written by Graham Rawlinson back in 1979.

In a short summary: Every word having the first and last character in place might miss or have shuffled characters in between and you might still understand the word.

With this in mind, I improved the "vowel solution" to:

Only remove lowercase vowels which are not followed by an uppercase character or beginning or end of the string.

The final solution

Using regex substitution I shortened the string so that it should be understandable for someone working in the same field.

(?<!^)[aeiou](?!([A-Z]|$))

The result is better understandable and still with no naming conflicts:

pyldPrdctDtlsAruChldrn3AruPrdcts0AruStrt
pyldPrdctDtlsAruChldrn3AruPrdcts0AruStrtnmbr
pyldPrdctDtlsAruChldrn3AruPrdcts0AruPrdctKy           
pyldPrdctDtlsAruChldrn3AruPrdcts0AruPrce                
pyldPrdctDtlsAruChldrn3AruPrdcts0AruTrffNme           
pyldPrdctDtlsAruChldrn3AruPrdcts0AruInsrncePlcyNmbr
Enter fullscreen mode Exit fullscreen mode

There is still the possibility to end up with identical consonants after removing the vowels. But this is considered a lower probability then simply removing all vowels.

One word could be:

Color => Clr
Clear => Clr
Enter fullscreen mode Exit fullscreen mode

If you liked the article leave me a comment/like. Do you have any other suggestions? Let me know!

You can follow me on twitter with @icanhazstring


Update: 2020-07-14
Changed the final solution regular expression to avoid removing lowercase vowels at the beginning of the string

💖 💪 🙅 🚩
icanhazstring
Andreas Frömer

Posted on July 10, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related