Assignment 2

First Part Due on 3/2/17

Kuaffman automata
 

The aim of this assignment is to show the value of simple models in exploring aspects of complex problems in biology and also provide an example of how connection properties of a network connecting various nodes that constitute a system can influence the collective dynamics of the system.

As a concrete example, let us consider the transcription regulation network in cells that control the expression of genes resulting in a specific phenotype. The discovery that genes can switch each other on or off (by a gene expressing a protein that bind to the regulatory site of another gene, promoting or repressing its expression) and are thus interacting with each other via a network of mutual dependence, can be traced to the work of Francois Jacob and Jacques Monod on the lac operon in the bacterium Escherichia coli (E. coli) in the 1960s.

One of the simplest models of this system was proposed in 1969 by Stuart Kauffman. In this setting, we simplify the bipartite network of gene-protein associations (genes connect to proteins that they express and proteins connect to genes that they regulate) and reduce it to a simple network of genes - with one gene having a directed connection to another gene if the former expresses a protein that regulates the latter. Each gene is assumed to be in any one of two states, expressing (ON=1) or non-expressing (OFF=0) at any given time - hence the state of a gene is a Boolean variable (i.e., 0 or 1). As each of the gene-regulatory interactions for a species may not be known, and moreover, what exactly is the role of a regulatory protein on the expression of a gene may often be unclear, we can further simplify by looking (instead of the transcription network of a particular species) at a generic instance of such a network referred to as the NK Random Boolean network model (or Kauffman Automata).

Here we consider that there are N nodes (genes) each of which receives incoming connections from K nodes in the network (i.e., the gene is controlled in principle by K proteins expressed by the corresponding K genes). Once N and K is specified, we need to therefore construct a directed network of N nodes where the links are chosen to be placed between randomly chosen pairs of nodes such that the in-degree of each node is K. This defines the (directed) adjacency matrix of the system.

Once the adjacency matrix is constructed, we assign a ``logic function'' to each node which specifies what the output of the node will be for each of the 2^K possible input strings of 0s and 1s. Thus, out of the 2^(2^K) possible logic functions we randomly pick one and assign to a node, and do this in turn for all the N nodes. Finally, we pick an initial state for the network (i.e., an assignment of 0 or 1 for each of the N nodes) and start the time-evolution of the system. The system evolves in discrete time-steps by looking at the current state of the nodes and using the logic functions associated with each node, to determine the next state of the nodes.

Eventually, the system may settle down to an attractor that could be time-invariant ("fixed-point") or having a simple period (e.g., switches back and forth between two network states) when K is small. When K is large (the extreme case being K=N-1, such that every node regulates every other) we can get attractors with periods so long that for all practical purposes they appear to be aperiodic.

The first part of the assignment is to program such a network with N=10 and observe its evolution for different values of K - viz., K=1, 2, 3, 4 and 9. For each N,K, you can construct the adjacency matrix and the logic tables randomly, and then tabulate the output starting from any of the 2^N (=1024) possible initial states.

There are two ways you should do this.

First, you can determine the behavior of the network for a given value of N,K by taking a random sample of initial states (let's say 100) and calculating (i) the number of different attractors (i.e., how many distinct end-states can the network converge to starting from different initial conditions), (ii) the period of these attractors (i.e., whether they are fixed point, period-2 etc.) and (iii) an estimate of the size of the basin of attraction (i.e., how many different initial conditions converge to each attractor).
Once you have finished a given sample, choose another adjacency network (for the given N and K) and assignment of logic tables and then again carry out the exercise with a new random sample of initial states. In this way, you are sampling also different realizations of connection structure and input-output functions, so that your final result does not depend on a fluke that you happened to choose a very special adjacency matrix or set of logic functions. Finally average the properties you are interested in (i, ii and iii) over at least 100 different matrices and logic function assignments (and recall, that for each of these choices you have 100 different initial states - so that you are averaging over 10^4 random realizations in all) to get the mean values of these properties for the N,K you started out with.
Do you see any qualitative difference in the properties for K=1, K=2, K=3, K=4 and K=9 ?

Second, for a given adjacency matrix and logic table assignment, one can go through each of the 1024 possible initial states in turn and ask to what state they map to after one iteration of the system dynamics. Then one can construct a directed network of 1024 nodes, each node corresponding to a state of the Boolean network and connected by an arrow to the state that it will flow to after one iteration [note that a node can have a self-loop if it maps to itself after one iteration - i.e., if it is a fixed-point attractor]. One can then see each of the properties (i), (ii) and (iii) above in terms of network properties, viz., the number of attractors is the number of disconnected fragments of the network, the length of the period is the length of the cycle (by definition, each attractor will be a cycle - a fixed point being a cycle of length 0), and the size of the basin of attraction being the number of nodes in a connected fragment that are not part of a cycle.
Do this exercise for K=1, K=2, K=3, K=4 and K=9. Do you see any qualitative difference in the network structure for these different values of K ?

This part of the assignment is to be turned in on Feb 3.

The next task of the assignment is to check in a real transcription network (such as regulonDB for the E coli) what the typical value of K is. We can then ask whether the behavior we see in a real transcription network is only a result of the values of N and K or could there be other factors at work here.

To see how network connection structure may play a role in the dynamics, in the second part of the assignment we will explore how introducing modular organization (e.g., dividing a network of 20 nodes into 4 parts each comprising 5 nodes, with connections being more likely within nodes belonging to the same part) can affect the behavior of a network even when N and K is kept fixed. More details about it after you turn in the first part of the assignment.

You are free to research the web to find out more about the properties of this cellular automata (but not allowed to copy)...