A typogenetics implementation in Elixir
Danie Palm
Posted on June 14, 2020
One of my favourite artificial life systems is called Typographical Genetics, or typogenetics in short. It was introduced by Douglas Hofstadter in his book Gödel, Escher, Bach. It is purposefully simplistic, but captures a number of principles found in "real" genetics remarkably well. In a sense, typogenetics was developed as a game for the amusement of the reader, and as such a crucial part of its functionality is left to the reader to play: the role of the ribosome.
In this post we take a look at typogenetics as a friendly introduction to the biological process of translation according to the rules of the genetic code. We then continue to implement typogenetics in Elixir.
Biology
There are two types of molecules or strings in typogenetics - typographical DNA and typographical enzymes. Just as real DNA is a strand made up of the nitrogen bases A, C, G and T, the DNA in typogenetics (which also plays the role of RNA) is made up of the letters "A", "C", "G", and "T". Similarly, just as real enzymes are chains of amino acids (each of which is usually represented as a three letter abbreviation), so enzymes in typogenetics are strings of three-letter "amino acids" like "cut", "cop", "swi".
Such a typographical enzyme is really a composition of the individual functions that correspond to each of the amino acids of the enzyme. Enzymes take DNA strands as input and never act on other enzymes.
Once bound to a DNA strand, an enzyme executes each of its amino acids or functions on the strand. In this way, an enzyme can introduce additional letters into the strand, delete letters, copy the strand, search for specific types of letters, or even switch to the complementary strand and continue its execution there. Here is the full list of typographical amino acids and their effect on DNA strands (the letters in parentheses represent a twist that we'll discuss later):
cut (s) - cut the DNA strand(s) to the right of the current position
del (s)- delete a base from the strand and move to the unit on the right
swi (r)- switch the enzyme to the other strand
mvr (s)- move the enzyme one DNA unit to the right
mvl (s)- move the enzyme one DNA unit to the left
cop (r)- enable copy mode
off (l)- disable copy mode
ina (s)- insert an A to the right of the current position
inc (r)- insert a C to the right of the current position
ing (r)- insert a G to the right of the current position
int (l)- insert a T to the right of the current position
rpy (r)- move to the nearest pyrimidine (C or T) to the right
rpu (l)- move to the nearest purine (A or G) to the right
lpy (l)- move to the nearest pyrimidine to the left
lpu (l)- move to the nearest purine to the left
Initially, copy mode is off. Any invocation of off
has no effect in this case. Likewise, cop
turns on copy mode only if it is not already on. The insert functions (in<x>
) insert the given letter in the active strand, but also insert the complement of the given letter in the complementary strand if copy mode is on. del
on the other hand always only acts on the active strand, whereas cut
again acts on both strands. Switching (swi
) terminates the enzyme execution if no base is present on the complementary strand at that position. Execution is also halted when moving or searching past the boundaries of the active strand, or when all the amino acids have been executed.
Like real enzymes, enzymes in typogenetics also fold up into a tertiary structure that ultimately depends on its primary amino acid sequence. And the tertiary structure in turn determines the binding preference of the enzyme, much like the tertiary structure of real enzymes forms binding sites that are specific to the enzyme's substrates. Each typogenetics amino acid is assigned a "kink" (straight, left, or right) that determines how the enzyme turns at every point. These kinks are listed in the table of amino acids above. Finally, the relative angles between the first and last amino acids (of which only four unique values exist: 0, 90, 180, and 270 degrees) are mapped onto the four bases in DNA, determining its affinity.
If the enzyme is always oriented so that the first segment points in an eastern direction, and if the last amino acid's kink is ignored, then the enzyme's affinity can effectively be determined by considering only the direction of the final segment of the enzyme:
First segment | Last segment | Affinity |
---|---|---|
East | East | A |
East | North | C |
East | South | G |
East | West | T |
Note that all of the inner amino acids (excluding only the first and last) determine the enzyme's affinity as a whole.
The typogenetics game starts off with a pool of DNA strands. First, a DNA strand is selected from the pool and translated according to the typogenetic code into an enzyme. The enzyme is then folded according to the kinks of its amino acids, its binding affinity is determined, it binds to a random DNA strand at a random position that matches its affinity, and starts to execute its amino acids one by one.
There are 15 amino acids, and hence we need pairs of DNA letters (called codons) to represent single amino acids. There are 16 possible pairs, leaving one pair that is not assigned to any amino acid. This extra unmatched codon (which is chosen to be "AA") will be used as punctuation, i.e. a stop codon. In real genetics, in contrast, codons are made up of 3 bases and there are 20 amino acids. This leaves a lot of room for redundancy. The table below summarizes the typogenetic code (rows represent the first letter and columns the second letter of the codon).
A | C | G | T | |
---|---|---|---|---|
A | cut | del | swi | |
C | mvr | mvl | cop | off |
G | ina | inc | ing | int |
T | rpy | rpu | lpy | lpu |
Let's consider a quick end-to-end example. We pick a random DNA strand 'CGGATACTAAACCGA' and translate it to an enzyme. First, we chunk the strand by subsequences of length 2 (the length of a codon):
CG GA TA CT AA AC CG A
According to the typogenetic code, this maps to the following amino acids:
cop ina rpy off
and
cut cop
Note how the AA
codon resulted in the initiation of a new enzyme and how the terminal unpaired A
was discarded. Now let's pick the first enzyme cop-ina-rpy-off
and fold it. The corresponding kinks are r s r l
. We ignore the first kink and start of with the direction east
, we go straight and hence the direction remains east
, we turn right and now face south
and finally we turn left to face east
again.
cop - ina
|
rpy - off
The folded enzyme has three segments: east, south, east
. The final segment is east
so that the enzyme only has a binding affinity for the letter A
.
Now consider another random DNA strand, CAGAGAGT
that will serve as input to the enzyme above. Suppose the enzyme binds to the second A
. It immediately switches on copy mode (cop
) and then inserts an A
(ina
):
T
CAGAAGAGT
Since copy mode is active, a complementary T
is inserted in the complementary strand. Next we move right until a pyrimidine is reached (rpy
). As the enzyme moves, it copies the active strand by inserting complementary bases in the inactive strand:
TCTC
CAGAAGAGT
When T
, a pyrimidine, is reached, copy mode is switched off (off
). The enzyme has now executed all its amino acids and so terminates and is discarded. This releases the input and complementary strands back into the DNA pool. Note however that the complementary strand is inserted in reverse: CAGAAGAGT
and CTCT
.
Code
The Elixir implementation of typogenetics has many similarities to the "pool of protein" prototypes that we have developed so far in this series. I will not discuss it in detail here. Interested readers are invited to look at the source code on GitHub:
palm86 / typogenetics
Hofstadter's typographical genetics in Elixir
One interesting aspect is perhaps the data structure that is used to represent DNA strands. I have toyed with the idea of implementing typogenetics on a few occasions, but was never quite happy with the strand representations that I came up with. Approaching the problem from a functional angle finally led me to something that I find elegant.
Strands are complicated because they consist of a main sequence of letters, to which a complementary letter could be bound at any position, forming a potentially discontinuous complementary sequence. Moreover, strands are directional - the conventional "forward" direction is in the so called 5' to 3' direction. But when two strands pair up, they do so in opposite directions. Take for instance the sequence ACGT
bound to its complementary sequence:
3'-TGCA-5'
5'-ACGT-3'
The final complication is that, apart from the letters in the strands, it is also necessary to keep track of the position at which the enzyme is bound to the strands.
+---+---+---+---+
| | | C | | <- complementary
+---------------+
| A | C | G | T | <- active
+---+---+---+---+
^
pointer
There are of course many ways to represent an enzyme-bound strand. One solution is to employ two arrays, one for the main strand and one for the complementary strand, and a pointer that keeps track of the index at which the enzyme is bound. This solution is not ideal because it requires resizing the arrays as the strands grow. And whenever the swi
(switch) operation is encountered, the arrays have to be reversed and the pointer needs to be recalculated.
+---+---+---+---+
| | | C | | <- complementary
| A | C | G | T | <- active
+---+---+---+---+
^
pointer
A slightly better solution is to use only a single array in which the elements are 2-tuples or pairs - the first element of the tuple corresponding to a base of the active strand and the second element corresponding to a base of the complementary strand. This solution requires only a single array to be resized or reversed, but it still requires a separate pointer that has to be recalculated on swi
.
+-----+ +-----+
| C, | | G,C |
+-----+ +-----+
| A, | | T, |
+-----+ +-----+
^ ^
left_stack right_stack
But we can still do better. Instead of using an array of pairs, two stacks can be used to elegantly get rid of the pointer. In other words, a pair of stacks of pairs. The top of the first stack represents the current position to which the enzyme is bound. The elements below it represent, in reverse order, all the bases up to the start. The second stack represents all the remaining base pairs in the usual order. The enzyme's position can be moved to the right by pushing the top of the right stack onto the left stack. Similarly, pushing the top of the left stack onto the right stack, moves the enzyme one step to the left. Switching strands is accomplished by swapping the left and right stacks and by swapping the base pairs in each stack.
In the stack-based solution, a bound strand is somewhat analogous to an open book. The stack of pages on the left hand side is sorted in descending order and the stack on the right is sorted in ascending order. The current position is implied by the number of pages on each stack. Let's arbitrarily suppose that the current page is the top page of the left hand stack. The current position can easily be changed by flipping pages from the right hand stack to the left hand stack, or the other way around. One could force the analogy and say that the "active" text is on one side of a page, and its complement or reverse shines through when viewing the page in the light. Pages can also be inserted in theory.
Let's consider an example representation in Elixir. Take for example the strand
A|A
GAT|TACA
The |
represents the current position. The active strand is at the bottom and the complementary strand is at the top. The current position is just left of the |
in the active strand. We can represent this in Elixir with a 2-tuple of lists of 2-tuples:
{left_stack, right_stack}
With left_stack
being:
[{:T, :A}, {:A, nil}, {:G, nil}]
Each entry of the list is a 2-tuple with the active strand base as first position, and its complement or nil
(if no complementary base is present) as the second. Note that the list is reversed in comparison the the string representation above.
Similarly, the right_stack
looks like this:
[{:T, :A}, {:A, nil}, {:C, nil}, {:A, nil}]
The only difference being that this time the order agrees with the string representation. Note that lists in Elixir are linked lists, which means that they behave pretty much like stacks.
You can clone the GitHub repo and run a simulation with:
$ iex -S mix
iex> Typogenetics.run()
This will generate 20 random strands of length 20 and start the execution. The output will look something like:
13:31:14.323 [info] Consumed: ATACGAGCCGTCTATGGCCG
13:31:14.325 [info] Produced: ATACCGCAGCCGTCTATGGCCG
13:31:14.722 [info] Consumed: TGTCTGCAGGTTTGACCACT
13:31:14.722 [info] Produced: TGTCTGCAGGTTTGCTTACCACT
13:31:14.722 [info] Produced: AAGCA
Alternatively, you can use the package as a dependency and wire up the subcomponents for all kinds of experiments.
To conclude, typogenetics was designed as a game to introduce readers of Gödel, Escher, Bach to real genetics. But Hofstadter also posed the question as to whether typogenetics would be capable of self-reproduction. I.e. would it be possible to seed the game with just the right set of strands so that all of the strands will eventually (if not directly) be reproduced. Self-reproduction in typogenetics has been the topic of a number of theses and blogs and the system does indeed lend itself to all kinds of self-reproducing schemes.
However, a critical missing piece in typogenetics is the ribosome, an enzyme that is able to translate DNA into amino acids according to the Typogenetic Code. As it stands, the rules of typogenetics doesn't allow for a ribosome that is a true internal typogenetics enzyme. Such an enzyme would have to be able to act on DNA and produce an amino acids sequence. Instead, an external ribosome is pulled in, the player of the game.
Hofstadter was very much aware of the central role played by ribosomes, and purposefully omitted it to keep typogenetics digestible. Nevertheless, it is tempting to contemplate the minimal set of changes that can be elegantly introduced to typogenetics that would allow for ribosomes as first class citizens.
In the artificial life system that we are developing in this series, there will be no fixed rules to the effect that enzymes can only act on and produce DNA. Any such rules should come from within the system and should in principle be able to evolve with the system. This will allow us to develop a ribosome as first-class citizen.
Posted on June 14, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.