Andreas Frömer
Posted on July 10, 2020
Today I was confronted with a small but meaningful problem writing some piece of software. I reached some kind of threshold where the string does not fit, so I would have to shorten it.
- I could remove all characters of each word keeping only the first three
- Remove all vowels
But how do I make sure to keep the meaning of the string so you could still understand what's written 🤔
Small overview
To give you a small outline of what I was doing - I am importing huge amounts of data of unknown structure. So I am doing a scan over all data sets trying to figure out the structure, flatten it, and use it as MySQL column names.
Here is an example structure:
{
"payload":{
"product":{
"details":{
"aru":{
"children":[
{
"aru":{
"products":[
{
"aru":{
"street":"",
"streetnumber":"",
"productKey":"",
"price":"",
"tariffName":"",
"insurancePolicyNumber":""
}
}
]
}
}
]
}
}
}
}
}
You might already see an upcoming problem here 😉
And don't ask about the structure - it is what it is.
The limitations
Using innodb
the limitation for column names is 64 characters. At first glance, and with multiple scans, this was going to be close to maximum but still worked.
Hence comes the problem
Of course, the limitations were reached while going into production as there was new data coming which exceeded it.
I am now presented with a string of 69 characters of length. Knowing that there will be a string coming exceeding it even further there needed to be some kind of reduction in characters.
Possible solutions
Given the following strings here are some solutions I have considered.
payloadProductDetailsAruChildren3AruProducts0AruStreet
payloadProductDetailsAruChildren3AruProducts0AruStreetnumber
payloadProductDetailsAruChildren3AruProducts0AruProductKey
payloadProductDetailsAruChildren3AruProducts0AruPrice
payloadProductDetailsAruChildren3AruProducts0AruTariffName
payloadProductDetailsAruChildren3AruProducts0AruInsurancePolicyNumber
Only keep first 3 characters
Shortening the string by keeping the first 3 characters was the first and a very fast solution. Which ended in not even a solution but making it even worse:
payProDetAruChi3AruPro0AruStr
payProDetAruChi3AruPro0AruStr
payProDetAruChi3AruPro0AruProKey
payProDetAruChi3AruPro0AruPri
payProDetAruChi3AruPro0AruTarNam
payProDetAruChi3AruPro0AruInsPolNum
You see the problem, we ended up with two equal column names which would be a disaster since we would be losing data.
Remove all vowels
A colleague of mine came up with the idea of removing all lower case vowels.
pyldPrdctDtlsArChldrn3ArPrdcts0ArStrt
pyldPrdctDtlsArChldrn3ArPrdcts0ArStrtnmbr
pyldPrdctDtlsArChldrn3ArPrdcts0ArPrdctKy
pyldPrdctDtlsArChldrn3ArPrdcts0ArPrc
pyldPrdctDtlsArChldrn3ArPrdcts0ArTrffNm
pyldPrdctDtlsArChldrn3ArPrdcts0ArInsrncPlcyNmbr
Which looks better and we don't have a conflicting column name. But still, some column names seem hard to read. For example pyldPrdctDtlsArChldrn3ArPrdcts0ArPrc
. Thinking of Price
at the end is hard to nearly impossible.
Also, the product name Aru
is shortened. That is not a big problem but still 😉
The Significance of Letter Position in Word Recognition
This is the title of a Ph.D. thesis written by Graham Rawlinson back in 1979.
In a short summary: Every word having the first and last character in place might miss or have shuffled characters in between and you might still understand the word.
With this in mind, I improved the "vowel solution" to:
Only remove lowercase vowels which are not followed by an uppercase character or beginning or end of the string.
The final solution
Using regex substitution I shortened the string so that it should be understandable for someone working in the same field.
(?<!^)[aeiou](?!([A-Z]|$))
The result is better understandable and still with no naming conflicts:
pyldPrdctDtlsAruChldrn3AruPrdcts0AruStrt
pyldPrdctDtlsAruChldrn3AruPrdcts0AruStrtnmbr
pyldPrdctDtlsAruChldrn3AruPrdcts0AruPrdctKy
pyldPrdctDtlsAruChldrn3AruPrdcts0AruPrce
pyldPrdctDtlsAruChldrn3AruPrdcts0AruTrffNme
pyldPrdctDtlsAruChldrn3AruPrdcts0AruInsrncePlcyNmbr
There is still the possibility to end up with identical consonants after removing the vowels. But this is considered a lower probability then simply removing all vowels.
One word could be:
Color => Clr
Clear => Clr
If you liked the article leave me a comment/like. Do you have any other suggestions? Let me know!
You can follow me on twitter with @icanhazstring
Update: 2020-07-14
Changed the final solution regular expression to avoid removing lowercase vowels at the beginning of the string
Posted on July 10, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.