Proteins as programs - to mock a matchingbird
Danie Palm
Posted on July 28, 2021
In this instalment of Elixir of Life we devise a mechanism by which chemical affinity and binding can be realized in our artificial life system. As a refresher, we are trying to build an artificial life system using an artificial chemistry that is based on programs that interact with one another. These programs are written in a restricted version of the stack-based Joy programming language in which we permit only functions and quotations. This restricted version of Joy is implemented in Elixir, and as a side-effect we explore several features of Elixir along the way.
One of the Elixir features that we encounter in this post is the idea of using mocks to fulfil explicit contracts when testing code using the mox
library. Coincidentally, Joy has strong ties to combinatory logic, its applicative cousin. Combinatory logic was invented by Moses Schönfinkel (German for something like "pretty finch"), elaborated on by Haskell Curry, and popularized by Raymond Smullyan in the delightful little book To Mock a Mockingbird.
In this book Smullyan likens function application in combinatory logic to birds (some of which are finches) calling out the name of a bird upon hearing the name of a bird. One of the popular birds in the book is the Mockingbird, which when hearing the bird name x will make the same call that x would have made hearing its own name. When a mocking bird hears its own name, it results in one of the most elegant self-reproducing "programs" or quines in existence.
If functions are birds, then this post will introduce a couple of matchingbirds. That kind of explains the title.
Chemical recognition, affinity, and binding
The enzymatic complexes that initiate or catalyse the processes of transcription and translation often rely on recognizing patterns within the substrate sequences on which they operate. Examples of such patterns are the TATA box sequence, a stretch of DNA that indicates where transcription should begin, and the Shine-Dalgarno sequence, which lies upstream of the start codon AUG in bacterial and archaeal mRNA. We could even consider the start and stop codons of the genetic code to fall into this category of "sequences with meanings that need to be recognized".
While we could implement functionality that allows enzymes to recognize such sequences
with a predicate function like equal
(as we have done up to now), the strict matching that we
get in such an implementation precludes any mistakes and variations from creeping in. Moreover, in biochemistry, sequences often match one another in a complementary way. Consider, for instance, how purines (A and G) strongly prefer to pair with pyrimidines (C and T/U) and weakly, if at all, with themselves. At first glance, such strong complementarity might not be obvious elsewhere in biochemistry, but consider how enzymatic substrate binding sites are really complementary to their substrates. Also consider how a single binding site could bind multiple related substrates, perhaps with different affinities or strengths.
So we turn to something fuzzy instead.
The match
predicate function
The match
function is a predicate that returns true with a probability that is proportional to the degree to which a candidate C
matches a given pattern P
, taking complementary binding into account as appropriate. In other words:
C P match == B
where B
is a boolean value or, when using Church-encoding, [pop i]
for true and [swap pop i]
for false. In other words, match
is more likely to return [pop i]
for values of C
that strongly match P
. Let us consider a few concrete examples involving nucleotide bases. Note that the results are probabilistic, I'm providing the most probable answer:
# individual RNA bases
a u match == true
a g match == false
# sequences of bases
[] [] match == true
[] [a] match == false
[a u g] [u a c] match == true
But how does match
determine how well the candidate matches the pattern?
Recall the second post of this series in which we've implemented the Needleman-Wunsch global sequence alignment algorithm and hinted at using (complementary) sequence alignment scores as mechanisms by which binding affinity could be determined. In the meantime, I've extended that work to include the Smith-Waterman local sequence alignment algorithm, and modularized the code to allow pluggable implementations of the similarity score. These alignment algorithms are available in the belign
library which I hope to open-source at some point.
This will allow us to determine a similarity or rather "match" score for any two molecules in our system.
Let's have a look at the implementation of match
in our base language Elixir:
def match([pattern, candidate | tail] = _stack) do
# Determine the match score between the pattern and the candidate
score = Belign.Similarity.Global.similarity(pattern, candidate)
# Calculate a binding probability from the score
binding_probability = Float.pow(score * 1.0, 10)
# Pick a random pivot
pivot = Slyk.Random.random()
# Consider it a match if the binding probability is larger than the pivot.
# @truthy equals [pop i] and @falsy equals [swap pop i] as per a previous
# post on Church-encoding of booleans
if binding_probability > pivot do
[@truthy | tail]
else
[@falsy | tail]
end
end
The comments explain the code fairly well. The heavy lifting is of course done in Belign.Similarity.Global.similarity/2
. Without going into the details, you can imagine that function as representing a similarity matrix that has rows and columns for each molecule (function or quotation) in our artificial life system. The value at i,j
in this matrix represents how well the candidate j
matches the pattern i
.
One aspect to highlight is that the similarity function can also deal with cases in which the pattern or candidate is not just a single function, but a list of functions (a quotation). This is where the Needleman-Wunsch algorithm comes in. When dealing with quotations, the similarity function delegates to
Belign.align_global(pattern, candidate, opts)
. One of the options that is passed as the final argument is
this: similarity: Belign.Similarity.Global
which tells the Needleman-Wunsch algorithm to use Belign.Similarity.Global.similarity/2
as it's similarity score function. This recursive relationship between Belign.Similarity.Global.similarity/2
and Belign.align_global/3
means that we can determine the degree to which molecules, containing arbitrary levels of nesting, match one another.
Testing the match
predicate function
We have not discussed testing in any of the posts so far. However, a solid suite of tests at every level of abstraction
is going to be indispensable as we go forward. For most of the functions we've come across so far, unit tests are trivial to implement. The pure nature of these functions means that a given input will always lead to a given output. So we just have to think of all the cases we can hit, and we're done.
Here is a first approach at testing the match
function.
test "match/1 correctly matches complementary mRNA sequences" do
assert ~J"[a u g] [u a c] match" == ~J"true"
assert ~J"[a u u] [u a c] match" == ~J"false"
assert ~J"[a u c] [u a c] match" == ~J"false"
assert ~J"[a u a] [u a c] match" == ~J"false"
# ... and many more
end
However, the test only passes for a fraction of the times that we run it. This is because the match
function is not a pure function. It has a random component to it. This is where mox
comes in. Slyk.Random.random/0
delegates to an underlying implementation. Under normal operating
conditions, it delegates to a function that calculates a uniform random floating point number between 0 and 1. An implementation of which looks like this:
defmodule Slyk.Random.Uniform do
@behaviour Slyk.Random.Behaviour
@impl Slyk.Random.Behaviour
@spec random :: float
def random() do
:rand.uniform()
end
end
When we test however, we want a controlled environment, which we can obtain by using a mock implementation of Slyk.Random
. We add the following to our test/test_helpers.exs
file:
Mox.defmock(Slyk.Random.Mock, for: Slyk.Random.Behaviour)
And then modify the test like this:
test "match/1 correctly matches complementary mRNA sequences" do
# Tell Slyk.Random to delegate to Slyk.Random.Mock
Application.put_env(:slyk, Slyk.Random, implementation: Slyk.Random.Mock)
# Provide a case-specific stub implementation of Slyk.Random.Mock
Slyk.Random.Mock
|> stub(:random, fn ->
0.8
end)
# The score is always above 0.8 for the first case
assert ~J"[a u g] [u a c] match" == ~J"true"
# The score is always below 0.8 for the last three cases
assert ~J"[a u u] [u a c] match" == ~J"false"
assert ~J"[a u c] [u a c] match" == ~J"false"
assert ~J"[a u a] [u a c] match" == ~J"false"
end
A primordial soup, a slimy pool
We are now finally getting close to a point where we have all the basic components required to realize our artificial life system. One of the last hurdles that we need to overcome is to devise a mechanism by which molecules (programs) in our system can bind, react, and dissociate.
For illustration purposes, the ribosome has been used like this so far:
mrna ribosome # leaves an unfolded protein on the stack
This of course doesn't tell us how the mRNA ended up on the stack in the first place. What we really want is for molecules, and specifically enzymes like the ribosome, to bind their substrates from a pool of possibilities according to an inherent affinity. And to explicitly release any products to the pool, rather than leaving them on the stack.
Enter getl
, getg
, and put
. The former two are similar to Joy's built-in get
function which waits for input from the standard input. But they are actually closer to Elixir's IO.gets/1
which takes a prompt as argument and then waits for input. Likewise, put
is similar to Joy's put
and Elixir's IO.puts/1
that write a string to the standard output.
But instead of writing and reading strings to and from standard IO, we implement getl
and getg
to "read" a candidate program that matches a pattern (prompt) from a pool of programs using, respectively, the local (Smith-Waterman) and global (Needleman-Wunsch) alignment algorithms and placing the program on the stack as a quotation. Similarly, put
takes a program from the top of the stack and inserts it into a pool of programs.
The pool that I'm referring to here is similar enough to the slimy one that we've described in the very first post of this series that I will not discuss it again, except to say that in its latest incarnation it implements the Slyk.Pool
behaviour so that we can easily test programs that rely on it using a mock pool implementation. More on this in upcoming posts.
Here are the Elixir implementations for getl
(getg
is similar) and put
:
def getl([pattern | tail] when is_list(pattern)) do
{:ok, match} = Slyk.Pool.getl(pattern)
[match | tail]
end
end
def put([head | tail] when is_list(head)) do
Slyk.Pool.put(head)
tail
end
The default Slyk.Pool
implementation is backed by a GenServer
. This means that, for instance, Slyk.Pool.getl
delegates to Slyk.Pool.Default.getl
, which uses GenServer.call
to make a synchronous request to a GenServer
process, which when it deals with the incoming message, executes a callback that looks like this:
def handle_call({:getl, prompt}, _from, state) do
# Get a random pivot value
pivot = Slyk.Random.random()
# Find the index of the first element that has a
# binding probability that is larger than the random pivot
state
|> Enum.find_index(fn candidate ->
score = Belign.Similarity.Local.similarity(prompt, candidate)
binding_probability = Float.pow(score * 1.0, 10)
binding_probability > pivot
end)
|> case do
nil ->
{:reply, {:error, :no_match}, state}
index ->
{winner, rest} = List.pop_at(state, index)
{:reply, {:ok, winner}, rest}
end
end
The probability function is a tenth-degree polynomial. We might refine this later, but for now it maps 0 to 0, 1 to 1, and most values in between to something close to 0.
For completeness, here is the callback that handles put
. Slyk.Pool.Default.put
uses GenServer.cast
, which makes it asynchronous. This is appropriate because we are not interested in a reply from the pool process after it inserts the element in the pool.
def handle_cast({:put, element}, state) do
{:noreply, [element | state]}
end
Testing the waters
There are various levels of abstraction at play here that require testing. We need to test Slyk.Pool.Default
to ensure that Slyk.Pool.Default.put
, Slyk.Pool.Default.getl
and Slyk.Pool.Default.getg
behave as expected. In these tests we care about the implementation details and so essentially wish to test the callback functions. The callback for put
is easy to test because it is pure. The callback for getl
, on the other hand again calls Slyk.Random.random()
for which we will use a mock:
describe "handle_call({:getl, [:a, :u, :g]}" do
test "binds [:u, :a, :c], regardless of where it is located in the pool" do
Slyk.Random.Mock
|> stub(:random, fn ->
0.8
end)
# winning candidate at end of pool
pool = [[:random], [:molecules], [:u, :a, :c]]
assert {:reply, {:ok, [:u, :a, :c]}, [[:random], [:molecules]]} =
Slyk.Pool.Default.handle_call({:getl, [:a, :u, :g]}, make_ref(), pool)
# winning candidate in middle of pool
pool = [[:random], [:u, :a, :c], [:molecules]]
assert {:reply, {:ok, [:u, :a, :c]}, [[:random], [:molecules]]} =
Slyk.Pool.Default.handle_call({:getl, [:a, :u, :g]}, make_ref(), pool)
verify!()
end
test "fails if [:u, :a, :c] or something similar is not present" do
Slyk.Random.Mock
|> stub(:random, fn ->
0.8
end)
# the pool
state = [[:random], [:molecules]]
assert {:reply, {:error, :no_match}, [[:random], [:molecules]]} =
Slyk.Pool.Default.handle_call({:getl, [:a, :u, :g]}, make_ref(), state)
verify!()
end
end
At a higher level, we should be testing our newly introduced Joy functions getl
, getg
, and put
. Their functionality is, however, independent of the underlying pool implementation, so instead of mocking Slyk.Random
, we can opt to mock Slyk.Pool
instead. At yet a higher level, we need tests for entire programs (proteins) and how they interact with their environment. More on that in another post.
Conclusion
We are now approaching a chemical system in which molecules can be "activated" one at a time (or concurrently - effortlessly because of Elixir). Molecules are Joy programs and activation means execution. Joy programs are really function compositions (syntactically, whitespace represents the composition operator) that are made up of functions like pop
and swap
, and quotations which are lists of functions. A quotation is actually also just a function that places itself on the stack.
Using special getl
/getg
and put
IO-like functions, molecules can "bind" or recruit other molecules from the mix, just like enzymes or proteins would bind their substrates. They can also "release" products back into the pool. It might turn out that only one of getl
and getg
is required. That will be a happy day, but for now getl
will probably be more useful in binding polymers and getg
will be more useful in binding oligomers/monomers.
In addition, we have explored preliminary implementations of a ribosome (something like a universal constructor) and a "chaperone" (something like a universal assembler). We have also defined a preliminary artificial genetic code. Armed with the notions of affinity, association, and dissociation developed in this post, we can now develop fully-fledged ribosomes, chaperones, and the multitude of other enzymes that we'll need (this is only the beginning). We can also then define the final version of an artificial genetic code, which we'll use to reverse-construct something like DNA, which will contain recipes for all the enzymes and rRNA in our system.
Posted on July 28, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.