Proteins as programs - to mock a matchingbird

palm86

Danie Palm

Posted on July 28, 2021

Proteins as programs - to mock a matchingbird

In this instalment of Elixir of Life we devise a mechanism by which chemical affinity and binding can be realized in our artificial life system. As a refresher, we are trying to build an artificial life system using an artificial chemistry that is based on programs that interact with one another. These programs are written in a restricted version of the stack-based Joy programming language in which we permit only functions and quotations. This restricted version of Joy is implemented in Elixir, and as a side-effect we explore several features of Elixir along the way.

One of the Elixir features that we encounter in this post is the idea of using mocks to fulfil explicit contracts when testing code using the mox library. Coincidentally, Joy has strong ties to combinatory logic, its applicative cousin. Combinatory logic was invented by Moses Schönfinkel (German for something like "pretty finch"), elaborated on by Haskell Curry, and popularized by Raymond Smullyan in the delightful little book To Mock a Mockingbird.

In this book Smullyan likens function application in combinatory logic to birds (some of which are finches) calling out the name of a bird upon hearing the name of a bird. One of the popular birds in the book is the Mockingbird, which when hearing the bird name x will make the same call that x would have made hearing its own name. When a mocking bird hears its own name, it results in one of the most elegant self-reproducing "programs" or quines in existence.

If functions are birds, then this post will introduce a couple of matchingbirds. That kind of explains the title.

Chemical recognition, affinity, and binding

The enzymatic complexes that initiate or catalyse the processes of transcription and translation often rely on recognizing patterns within the substrate sequences on which they operate. Examples of such patterns are the TATA box sequence, a stretch of DNA that indicates where transcription should begin, and the Shine-Dalgarno sequence, which lies upstream of the start codon AUG in bacterial and archaeal mRNA. We could even consider the start and stop codons of the genetic code to fall into this category of "sequences with meanings that need to be recognized".

While we could implement functionality that allows enzymes to recognize such sequences
with a predicate function like equal (as we have done up to now), the strict matching that we
get in such an implementation precludes any mistakes and variations from creeping in. Moreover, in biochemistry, sequences often match one another in a complementary way. Consider, for instance, how purines (A and G) strongly prefer to pair with pyrimidines (C and T/U) and weakly, if at all, with themselves. At first glance, such strong complementarity might not be obvious elsewhere in biochemistry, but consider how enzymatic substrate binding sites are really complementary to their substrates. Also consider how a single binding site could bind multiple related substrates, perhaps with different affinities or strengths.

So we turn to something fuzzy instead.

The match predicate function

The match function is a predicate that returns true with a probability that is proportional to the degree to which a candidate C matches a given pattern P, taking complementary binding into account as appropriate. In other words:

C P match == B
Enter fullscreen mode Exit fullscreen mode

where B is a boolean value or, when using Church-encoding, [pop i] for true and [swap pop i] for false. In other words, match is more likely to return [pop i] for values of C that strongly match P. Let us consider a few concrete examples involving nucleotide bases. Note that the results are probabilistic, I'm providing the most probable answer:

# individual RNA bases
a u match == true
a g match == false

# sequences of bases
[] [] match == true
[] [a] match == false
[a u g] [u a c] match == true
Enter fullscreen mode Exit fullscreen mode

But how does match determine how well the candidate matches the pattern?

Recall the second post of this series in which we've implemented the Needleman-Wunsch global sequence alignment algorithm and hinted at using (complementary) sequence alignment scores as mechanisms by which binding affinity could be determined. In the meantime, I've extended that work to include the Smith-Waterman local sequence alignment algorithm, and modularized the code to allow pluggable implementations of the similarity score. These alignment algorithms are available in the belign library which I hope to open-source at some point.

This will allow us to determine a similarity or rather "match" score for any two molecules in our system.

Let's have a look at the implementation of match in our base language Elixir:

def match([pattern, candidate | tail] = _stack) do
  # Determine the match score between the pattern and the candidate
  score = Belign.Similarity.Global.similarity(pattern, candidate)
  # Calculate a binding probability from the score
  binding_probability = Float.pow(score * 1.0, 10)
  # Pick a random pivot
  pivot = Slyk.Random.random()

  # Consider it a match if the binding probability is larger than the pivot.
  # @truthy equals [pop i] and @falsy equals [swap pop i] as per a previous
  # post on Church-encoding of booleans
  if binding_probability > pivot do
    [@truthy | tail]
  else
    [@falsy | tail]
  end
end
Enter fullscreen mode Exit fullscreen mode

The comments explain the code fairly well. The heavy lifting is of course done in Belign.Similarity.Global.similarity/2. Without going into the details, you can imagine that function as representing a similarity matrix that has rows and columns for each molecule (function or quotation) in our artificial life system. The value at i,j in this matrix represents how well the candidate j matches the pattern i.

One aspect to highlight is that the similarity function can also deal with cases in which the pattern or candidate is not just a single function, but a list of functions (a quotation). This is where the Needleman-Wunsch algorithm comes in. When dealing with quotations, the similarity function delegates to
Belign.align_global(pattern, candidate, opts). One of the options that is passed as the final argument is
this: similarity: Belign.Similarity.Global which tells the Needleman-Wunsch algorithm to use Belign.Similarity.Global.similarity/2 as it's similarity score function. This recursive relationship between Belign.Similarity.Global.similarity/2 and Belign.align_global/3 means that we can determine the degree to which molecules, containing arbitrary levels of nesting, match one another.

Testing the match predicate function

We have not discussed testing in any of the posts so far. However, a solid suite of tests at every level of abstraction
is going to be indispensable as we go forward. For most of the functions we've come across so far, unit tests are trivial to implement. The pure nature of these functions means that a given input will always lead to a given output. So we just have to think of all the cases we can hit, and we're done.

Here is a first approach at testing the match function.

test "match/1 correctly matches complementary mRNA sequences" do
  assert ~J"[a u g] [u a c] match" == ~J"true"
  assert ~J"[a u u] [u a c] match" == ~J"false"
  assert ~J"[a u c] [u a c] match" == ~J"false"
  assert ~J"[a u a] [u a c] match" == ~J"false"
  # ... and many more
end
Enter fullscreen mode Exit fullscreen mode

However, the test only passes for a fraction of the times that we run it. This is because the match function is not a pure function. It has a random component to it. This is where mox comes in. Slyk.Random.random/0 delegates to an underlying implementation. Under normal operating
conditions, it delegates to a function that calculates a uniform random floating point number between 0 and 1. An implementation of which looks like this:

defmodule Slyk.Random.Uniform do
  @behaviour Slyk.Random.Behaviour

  @impl Slyk.Random.Behaviour
  @spec random :: float
  def random() do
    :rand.uniform()
  end
end
Enter fullscreen mode Exit fullscreen mode

When we test however, we want a controlled environment, which we can obtain by using a mock implementation of Slyk.Random. We add the following to our test/test_helpers.exs file:

Mox.defmock(Slyk.Random.Mock, for: Slyk.Random.Behaviour)
Enter fullscreen mode Exit fullscreen mode

And then modify the test like this:

test "match/1 correctly matches complementary mRNA sequences" do
  # Tell Slyk.Random to delegate to Slyk.Random.Mock
  Application.put_env(:slyk, Slyk.Random, implementation: Slyk.Random.Mock)

  # Provide a case-specific stub implementation of Slyk.Random.Mock
  Slyk.Random.Mock
  |> stub(:random, fn ->
    0.8
  end)

  # The score is always above 0.8 for the first case
  assert ~J"[a u g] [u a c] match" == ~J"true"
  # The score is always below 0.8 for the last three cases
  assert ~J"[a u u] [u a c] match" == ~J"false"
  assert ~J"[a u c] [u a c] match" == ~J"false"
  assert ~J"[a u a] [u a c] match" == ~J"false"
end
Enter fullscreen mode Exit fullscreen mode

A primordial soup, a slimy pool

We are now finally getting close to a point where we have all the basic components required to realize our artificial life system. One of the last hurdles that we need to overcome is to devise a mechanism by which molecules (programs) in our system can bind, react, and dissociate.

For illustration purposes, the ribosome has been used like this so far:

mrna ribosome # leaves an unfolded protein on the stack
Enter fullscreen mode Exit fullscreen mode

This of course doesn't tell us how the mRNA ended up on the stack in the first place. What we really want is for molecules, and specifically enzymes like the ribosome, to bind their substrates from a pool of possibilities according to an inherent affinity. And to explicitly release any products to the pool, rather than leaving them on the stack.

Enter getl, getg, and put. The former two are similar to Joy's built-in get function which waits for input from the standard input. But they are actually closer to Elixir's IO.gets/1 which takes a prompt as argument and then waits for input. Likewise, put is similar to Joy's put and Elixir's IO.puts/1 that write a string to the standard output.

But instead of writing and reading strings to and from standard IO, we implement getl and getg to "read" a candidate program that matches a pattern (prompt) from a pool of programs using, respectively, the local (Smith-Waterman) and global (Needleman-Wunsch) alignment algorithms and placing the program on the stack as a quotation. Similarly, put takes a program from the top of the stack and inserts it into a pool of programs.

The pool that I'm referring to here is similar enough to the slimy one that we've described in the very first post of this series that I will not discuss it again, except to say that in its latest incarnation it implements the Slyk.Pool behaviour so that we can easily test programs that rely on it using a mock pool implementation. More on this in upcoming posts.

Here are the Elixir implementations for getl (getg is similar) and put:

def getl([pattern | tail] when is_list(pattern)) do
    {:ok, match} = Slyk.Pool.getl(pattern)
    [match | tail]
  end
end

def put([head | tail] when is_list(head)) do
  Slyk.Pool.put(head)
  tail
end
Enter fullscreen mode Exit fullscreen mode

The default Slyk.Pool implementation is backed by a GenServer. This means that, for instance, Slyk.Pool.getl delegates to Slyk.Pool.Default.getl, which uses GenServer.call to make a synchronous request to a GenServer process, which when it deals with the incoming message, executes a callback that looks like this:

def handle_call({:getl, prompt}, _from, state) do
  # Get a random pivot value
  pivot = Slyk.Random.random()

  # Find the index of the first element that has a
  # binding probability that is larger than the random pivot
  state
  |> Enum.find_index(fn candidate ->
    score = Belign.Similarity.Local.similarity(prompt, candidate)

    binding_probability = Float.pow(score * 1.0, 10)
    binding_probability > pivot
  end)
  |> case do
    nil ->
      {:reply, {:error, :no_match}, state}

    index ->
      {winner, rest} = List.pop_at(state, index)
      {:reply, {:ok, winner}, rest}
  end
end
Enter fullscreen mode Exit fullscreen mode

The probability function is a tenth-degree polynomial. We might refine this later, but for now it maps 0 to 0, 1 to 1, and most values in between to something close to 0.

For completeness, here is the callback that handles put. Slyk.Pool.Default.put uses GenServer.cast, which makes it asynchronous. This is appropriate because we are not interested in a reply from the pool process after it inserts the element in the pool.

def handle_cast({:put, element}, state) do
  {:noreply, [element | state]}
end
Enter fullscreen mode Exit fullscreen mode

Testing the waters

There are various levels of abstraction at play here that require testing. We need to test Slyk.Pool.Default to ensure that Slyk.Pool.Default.put, Slyk.Pool.Default.getl and Slyk.Pool.Default.getg behave as expected. In these tests we care about the implementation details and so essentially wish to test the callback functions. The callback for put is easy to test because it is pure. The callback for getl, on the other hand again calls Slyk.Random.random() for which we will use a mock:

  describe "handle_call({:getl, [:a, :u, :g]}" do
    test "binds [:u, :a, :c], regardless of where it is located in the pool" do
      Slyk.Random.Mock
      |> stub(:random, fn ->
        0.8
      end)

      # winning candidate at end of pool
      pool = [[:random], [:molecules], [:u, :a, :c]]

      assert {:reply, {:ok, [:u, :a, :c]}, [[:random], [:molecules]]} =
               Slyk.Pool.Default.handle_call({:getl, [:a, :u, :g]}, make_ref(), pool)

      # winning candidate in middle of pool
      pool = [[:random], [:u, :a, :c], [:molecules]]

      assert {:reply, {:ok, [:u, :a, :c]}, [[:random], [:molecules]]} =
               Slyk.Pool.Default.handle_call({:getl, [:a, :u, :g]}, make_ref(), pool)

      verify!()
    end

    test "fails if [:u, :a, :c] or something similar is not present" do
      Slyk.Random.Mock
      |> stub(:random, fn ->
        0.8
      end)

      # the pool
      state = [[:random], [:molecules]]

      assert {:reply, {:error, :no_match}, [[:random], [:molecules]]} =
               Slyk.Pool.Default.handle_call({:getl, [:a, :u, :g]}, make_ref(), state)

      verify!()
    end
  end
Enter fullscreen mode Exit fullscreen mode

At a higher level, we should be testing our newly introduced Joy functions getl, getg, and put. Their functionality is, however, independent of the underlying pool implementation, so instead of mocking Slyk.Random, we can opt to mock Slyk.Pool instead. At yet a higher level, we need tests for entire programs (proteins) and how they interact with their environment. More on that in another post.

Conclusion

We are now approaching a chemical system in which molecules can be "activated" one at a time (or concurrently - effortlessly because of Elixir). Molecules are Joy programs and activation means execution. Joy programs are really function compositions (syntactically, whitespace represents the composition operator) that are made up of functions like pop and swap, and quotations which are lists of functions. A quotation is actually also just a function that places itself on the stack.

Using special getl/getg and put IO-like functions, molecules can "bind" or recruit other molecules from the mix, just like enzymes or proteins would bind their substrates. They can also "release" products back into the pool. It might turn out that only one of getl and getg is required. That will be a happy day, but for now getl will probably be more useful in binding polymers and getg will be more useful in binding oligomers/monomers.

In addition, we have explored preliminary implementations of a ribosome (something like a universal constructor) and a "chaperone" (something like a universal assembler). We have also defined a preliminary artificial genetic code. Armed with the notions of affinity, association, and dissociation developed in this post, we can now develop fully-fledged ribosomes, chaperones, and the multitude of other enzymes that we'll need (this is only the beginning). We can also then define the final version of an artificial genetic code, which we'll use to reverse-construct something like DNA, which will contain recipes for all the enzymes and rRNA in our system.

💖 💪 🙅 🚩
palm86
Danie Palm

Posted on July 28, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related