Custom (De)serialization with Scala (Scala DCP #003)

awwsmm

Andrew (he/him)

Posted on March 9, 2020

Custom (De)serialization with Scala (Scala DCP #003)

Daily Coding Problem is a website which will send you a programming challenge to your inbox every day. I want to show beginners how to solve some of these problems using Scala, so this will be an ongoing series of my solutions. Feel free to pick them apart in the comments!

Problem

Given the root to a binary tree, implement serialize(root), which serializes the tree into a string, and deserialize(s), which deserializes the string back into the tree.

For example, given the following Node class

class Node:
    def __init__(self, val, left=None, right=None):
        self.val = val
        self.left = left
        self.right = right

The following test should pass:

node = Node('root', Node('left', Node('left.left')), Node('right'))
assert deserialize(serialize(node)).left.left.val == 'left.left'

Strategy

See my discussion of this problem (and my Java implementation) here!

Code

I feel like I'm cheating a bit for these first few Daily Coding Problems, because I've solved them previously using Java. Maybe we can improve on the Java solution with Scala, though. Here's the Node class I defined in Java, without the serialize and deserialize methods:

public class Node {

  private String val;
  private Node left;
  private Node right;

  public Node (String val, Node left, Node right) {
    this.val = val;
    this.left = left;
    this.right = right;
  }

  public Node (String val, Node left) {
    this(val, left, null);
  }

  public Node (String val) {
    this(val, null, null);
  }

  public Node left() { return this.left; }

  public Node right() { return this.right; }

  public String val() { return this.val; }

}
Enter fullscreen mode Exit fullscreen mode

Believe it or not, all of this can be replaced with a single line of Scala code:

case class Node (value: String, left: Node = null, right: Node = null)
Enter fullscreen mode Exit fullscreen mode

A case class in Scala is shorthand for a class which contains only immutable member variables. In the example above, value, left, and right will be assignable only when a Node object is created, and cannot be changed once that object is instantiated. case classes also define a companion object of the same name which can be used for (among other things) making object instantiation a bit less verbose.

The above Scala code also uses default arguments to define left and right as null when they're not specified by the user. So a Node object can be created with zero, one, or two child Nodes, without needing to define additional constructors.

Finally, case classes provide getters (but not setters, because of the aforementioned immutability) for all of the member variables defined in the argument list. They will have the same names as the arguments themselves.

So that single line of Scala code above allows us to do all of the following:

scala> val nA = Node("nA")
nA: Node = Node(nA,null,null)

scala> val nB = Node("nB", nA)
nB: Node = Node(nB,Node(nA,null,null),null)

scala> val nC = Node("nC", nB, nA)
nC: Node = Node(nC,Node(nB,Node(nA,null,null),null),Node(nA,null,null))

scala> nA.value
res4: String = nA

scala> nB.left
res5: Node = Node(nA,null,null)

scala> nC.right
res6: Node = Node(nA,null,null)
Enter fullscreen mode Exit fullscreen mode

If all of the above benefits of case classes weren't enough, they also define a default toString method which prints the internal data in a nice, human-readable format, as seen above. In the Java example, we needed to write that code, as well. To get the equivalent functionality of this single line of Scala code, we needed roughly fifty lines of Java. That's quite the reduction in complexity.

Next, I'd like to take a few artistic liberties. The problem asks us to define serialize and deserialize methods using (presumably) Python. Serializing an object really just means defining a toString method. In Python, the default toString equivalent is about as ugly as the default Java toString:

Python 2.7.16 (default, Dec 13 2019, 18:00:32) 
[GCC 4.2.1 Compatible Apple LLVM 11.0.0 (clang-1100.0.32.4) (-macos10.15-objc-s on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> class Node:
...     def __init__(self, val, left=None, right=None):
...         self.val = val
...         self.left = left
...         self.right = right
... 
>>> node = Node('root', Node('left', Node('left.left')), Node('right'))
>>> print(node)
<__main__.Node instance at 0x10d312ab8>
Enter fullscreen mode Exit fullscreen mode

When we print the Node, it is serialized into a string representation using an automatically defined method. In Python, this method is called __str__(). Although __str__() can be redefined directly, the problem asks us to define a serialize method instead. Rather than reinventing the wheel, I would prefer to just use our default toString provided by our Node case class in Scala. The spirit of the problem is simply to be able to convert these objects to their string representations, then back into objects.

However, it's at this point that I'd like to draw your attention to something which may end up bring a problem during our deserialization process. Notice how our Scala Nodes are serialized:

scala> Node("heyo")
res16: Node = Node(heyo,null,null)
Enter fullscreen mode Exit fullscreen mode

The Node itself is denoted by the literal string Node(, followed by comma-separated arguments, then closed with a ). A straightforward deserialization might involve using these three things as sentinels:

  1. the string Node( begins the definition of a Node
  2. commas , separate the value from the left and right Nodes, and
  3. the character ) ends the definition of the Node

You might be able to see how this could be abused. First, imagine someone simply doesn't want their Node to have a value, they might define:

scala> Node("null",null,null)
res19: Node = Node(null,null,null)

scala> Node(null,null,null)
res20: Node = Node(null,null,null)
Enter fullscreen mode Exit fullscreen mode

Note how both of these Nodes have the same string representation, even though they are really very different things. We can therefore require that the value of a Node is not equal to the string "null" by modifying the case class definition:

case class Node (value: String, left: Node = null, right: Node = null) {
  require(value != "null") }
Enter fullscreen mode Exit fullscreen mode

Now the compiler will throw an Exception when our value is a string masquerading as a null:

scala> Node("null",null,null)
java.lang.IllegalArgumentException: requirement failed
  at scala.Predef$.require(Predef.scala:268)
  ... 29 elided
Enter fullscreen mode Exit fullscreen mode

Alternatively, we could override the default toString method to have it surround the value with quotes. Something like...

scala> Node("null",null,null)
res42: Node = Node("null",null,null)

scala> Node(null,null,null)
res43: Node = Node(null,null,null)
Enter fullscreen mode Exit fullscreen mode

This, I think, is a better solution. In addition to solving this null issue, it also allows users to put commas , within labels, which otherwise could result in confusing serialized outputs like:

scala> Node("null,null,null",null,null)
res42: Node = Node(null,null,null,null,null)
Enter fullscreen mode Exit fullscreen mode

...this would be a bit more difficult to parse. If you don't think that's too bad, imagine a more pathological example:

scala> Node("null,Node(null,Node(null,null,null),null)",null,null)
res42: Node = Node(null,Node(null,Node(null,null,null),null),null,null)
Enter fullscreen mode Exit fullscreen mode

So maybe it is best that we redefine the toString method, at least to surround the value in quotes. Below, we override the default toString method, and make use of triple quotes to avoid having to escape the literal " characters within the string:

scala> case class Node (value: String, left: Node = null, right: Node = null) {
     |   override def toString = s"""Node("$value",$left,$right)""" }
defined class Node

scala> Node("null,Node(null,Node(null,null,null),null)",null,null)
res27: Node = Node("null,Node(null,Node(null,null,null),null)",null,null)
Enter fullscreen mode Exit fullscreen mode

Remember that we're defining toString and fromString rather than serialize and deserialize. This is to avoid having the methods toString and serialize do nearly (but not exactly) the same thing.

...now we can clearly tell where the value begins and ends, by looking for the " characters (or the syntax highlighting for strings). We can then sanitize user inputs by replacing any " characters which appear within the value with their escaped equivalents:

scala> case class Node (value: String, left: Node = null, right: Node = null) {
     |   override def toString = s"""Node("${value.replaceAll("\"", "\\\\\"")}",$left,$right)""" }
defined class Node

scala> Node("""Dwayne "The Rock" Johnson""",null,null)
res28: Node = Node("Dwayne \"The Rock\" Johnson",null,null)
Enter fullscreen mode Exit fullscreen mode

Beautiful! Finally, we can parse the serialized output using this clever recursive pattern matching technique. The gist of this method is that we define a regular expression which matches terminal Nodes (leaves) and another regular expression which matches arbitrary Nodes. We try to match on the terminal regex and if that doesn't work, we try to match on the arbitrary one.

For our serialization technique, a terminal Node should match the pattern:

scala> val terminal = """Node\("((?:[^\"]|\")*)",null,null\)""".r
terminal: scala.util.matching.Regex = Node\("((?:[^\"]|\")*)",null,null\)

scala> val rock = Node("""Dwayne "The Rock" Johnson""",null,null)
rock: Node = Node("Dwayne \"The Rock\" Johnson",null,null)

scala> rock.toString match { case terminal(value) => println(s"$value") }
Dwayne \"The Rock\" Johnson

scala> val list = Node("My list is List(1,2,3)",null,null)
list: Node = Node("My list is List(1,2,3)",null,null)

scala> list.toString match { case terminal(value) => println(s"$value") }
My list is List(1,2,3)
Enter fullscreen mode Exit fullscreen mode

The .r at the end of the string indicates that it's a regular expression.

...but any arbitrary Node should match the pattern:

scala> val arbitrary = """Node\("((?:[^\"]|\\")*)",(null|Node\(.*\)),(null|Node\(.*\))\)""".r
arbitrary: scala.util.matching.Regex = Node\("((?:[^\"]|\\")*)",(null|Node\(.*\)),(null|Node\(.*\))\)

scala> val nodes = Node("null",rock,list)
nodes: Node = Node("null",Node("Dwayne \"The Rock\" Johnson",null,null),Node("My list is List(1,2,3)",null,null))

scala> nodes.toString match { case arbitrary(value, left, right) => println(s"$value -- $left -- $right") }
null -- Node("Dwayne \"The Rock\" Johnson",null,null) -- Node("My list is List(1,2,3)",null,null)

scala> rock.toString match { case arbitrary(value, left, right) => println(s"$value -- $left -- $right") }
Dwayne \"The Rock\" Johnson -- null -- null

scala> list.toString match { case arbitrary(value, left, right) => println(s"$value -- $left -- $right") }
My list is List(1,2,3) -- null -- null
Enter fullscreen mode Exit fullscreen mode

The only remaining issue we have is when value is null, in which case toString throws a NullPointerException:

scala> val noll = Node(null,null,null)
java.lang.NullPointerException
  at Node.toString(<console>:14)
  at scala.runtime.ScalaRunTime$.inner$1(ScalaRunTime.scala:254)
  at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:259)
  at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:267)
  at .$print$lzycompute(<console>:9)
  at .$print(<console>:6)
  at $print(<console>)
...
Enter fullscreen mode Exit fullscreen mode

...we can avoid this by disallowing null values in our constructor:

scala> case class Node (value: String, left: Node = null, right: Node = null) {
     |   require (value != null)
     |   override def toString = s"""Node("${value.replaceAll("\"", "\\\\\"")}",$left,$right)"""
     | }
defined class Node

scala> val noll = Node(null,null,null)
java.lang.IllegalArgumentException: requirement failed
  at scala.Predef$.require(Predef.scala:268)
  ... 29 elided
Enter fullscreen mode Exit fullscreen mode

Note that empty value strings are still allowed, though

scala> val noll = Node("",null,null)
noll: Node = Node("",null,null)
Enter fullscreen mode Exit fullscreen mode

Finally, using this SO answer as a template, we can deserialize any arbitrary Node:

scala> def fromString(serialized: String): Node = {
     |   val arbitrary = """Node\("((?:[^\"]|\\")*)",(null|Node\(.*\)),(null|Node\(.*\))\)""".r
     |   serialized match {
     |     case arbitrary(value,left,right) => Node(
     |       value.replaceAll("\\\\\"", "\""),
     |       if (left == "null") null else fromString(left),
     |       if (right == "null") null else fromString(right))
     |   }
     | }
fromString: (serialized: String)Node

scala> println(rock.toString); println(fromString(rock.toString).toString)
Node("Dwayne \"The Rock\" Johnson",null,null)
Node("Dwayne \"The Rock\" Johnson",null,null)

scala> println(list.toString); println(fromString(list.toString).toString)
Node("My list is List(1,2,3)",null,null)
Node("My list is List(1,2,3)",null,null)

scala> println(nodes.toString); println(fromString(nodes.toString).toString)
Node("null",Node("Dwayne \"The Rock\" Johnson",null,null),Node("My list is List(1,2,3)",null,null))
Node("null",Node("Dwayne \"The Rock\" Johnson",null,null),Node("My list is List(1,2,3)",null,null))
Enter fullscreen mode Exit fullscreen mode

And the final test...

scala> val node = Node("root",Node("left",Node("left.left")),Node("right"))
node: Node = Node("root",Node("left",Node("left.left",null,null),null),Node("right",null,null))

scala> fromString(node.toString).left.left.value == "left.left"
res34: Boolean = true
Enter fullscreen mode Exit fullscreen mode

Nice!

Discussion

This problem is, in essence, an exercise in string manipulation. We want to make sure that our nodes are serialized in such a way that there are no degenerate cases (where two distinct nodes are mapped to the same serialized form), and we want to make sure that the user can't throw a wrench in the deserialization process by defining a value with hard-to-parse characters like " and ,. Since these are our delimiters, the user could very easily make deserialization much more difficult for us by including them.

Since this is essentially a string manipulation problem, and since I'm pretty experienced with regex, I decided to go in that direction. There may be a better or clearer way of solving this problem in Scala without regex, but my entire solution is only 10 lines of code:

case class Node (value: String, left: Node = null, right: Node = null) {
  require (value != null)
  override def toString = s"""Node("${value.replaceAll("\"", "\\\\\"")}",$left,$right)"""
}

def fromString(serialized: String): Node = {
  val arbitrary = """Node\("((?:[^\"]|\\")*)",(null|Node\(.*\)),(null|Node\(.*\))\)""".r
  serialized match {
    case arbitrary(value,left,right) => Node(
      value.replaceAll("\\\\\"", "\""),
      if (left == "null") null else fromString(left),
      if (right == "null") null else fromString(right))
  }
}
Enter fullscreen mode Exit fullscreen mode

I think my solution is clear, simple, and probably about as performant as any string manipulation problem can be, but if you have a better (or different) solution, I'd love to see it!


All the code for my Daily Coding Problems solutions is available at github.com/awwsmm/daily.

Suggestions? Let me know in the comments.

If you enjoyed this post, please consider supporting my work by buying me a coffee!

💖 💪 🙅 🚩
awwsmm
Andrew (he/him)

Posted on March 9, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related