Network Topology and Parameter Inference Challenge

Important Note

The data of the challenges cannot be used for publication without the explicit permission of the data producers and the DREAM organization. Feel free to contact us about this.  

 

 

Please go to bottom of the page for challenge solutions


Synopsis

Participants are asked to develop and/or apply optimization methods, including the selection of the most informative experiments, to accurately estimate parameters and predict outcomes of perturbations in Systems Biology models, given:

- In Model 1 the complete structure of the model (including expressions for the kinetic rate laws) for a gene regulatory network composed of 9 genes.Protein and mRNA are explicitly modeled.

- In Model 2 an incomplete structure of the model, with missing regulatory links, for a gene regulatory network composed of 11 genes. Here, participants will also have to find the missing links. Only proteins are explicitly modeled

Background

 Much of the recent focus on reverse engineering of gene networks has been on the identification of causal interaction topologies between genes (1, 2). Indeed that was the focus of several challenges in past editions of DREAM (3, 4). Such interactions are usually inferred from high-dimensional gene expression experiments involving different perturbations to the network of interest. Once network interaction topologies are characterized with some reasonable level of confidence, how can we decide between models having very similar topologies and how do we characterize the actual kinetics of these networks in a way that accurately reflects the causal relationships implied in the proposed topology?

In addressing these questions, two key areas that require attention are the tasks of estimating model parameters from data in the case of a known (assumed) model structure, and designing the most informative experiments for inference of parameters and correct network topology. These reverse engineering tasks provide the area of focus for the current challenge, namely the  Network TopologyandParameterInference Challenge.

 The desired outcome for the  Network TopologyandParameterInference Challenge  is the development, improvement and application of optimization methods (for model parameter estimation) and experimental design that are particularly well-suited for models of complex (highly-parameterized) systems and for which data are limited, often quite noisy and experiments need to be designed iteratively with the model building process. Despite the fact that these conditions are commonly encountered in biological systems modeling, the problem of parameter estimation and iterative experimental design remains one of the hardest challenges in systems biology (5-8).

The Challenge

In model 1, participating teams are provided the full regulatory interaction network topology for a gene network with 9 genes. Since mRNA and protein are explicitly modeled only for this model, there are 18 relevant variables involved for model 1.

In model 2, participating teams are provided with an incomplete regulatory interaction network topology for a gene network with 11 genes. As in model 2 only protein is explicitly modeled, there are only 11 relevant variables.

 

Description of gene network models 

The models provided for this challenge assume linear kinetics for mRNA degradation and protein synthesis (translation) and degradation. In most cases, mRNA synthesis (transcription) is modeled as a nonlinear (Hill-type) function of 1 or 2 regulatory inputs (i.e. transcription factors), each of which is either an activating or an inhibitory input. In the case where there is no regulatory input to a gene, a basal (constant) rate of transcription is assumed. The detailed functional forms of these kinetic laws are specified below.

Data Used for Parameter Estimation 

In addition to network topologies and mathematical models, a limited amount of time-course data is initially provided (start-up data). These data can be used in the early stages of parameter estimation. However, to reflect conditions that exist in actual scientific practice, teams also have the opportunity to ‘purchase’ additional experimental data sets from a menu of perturbations and experimental modalities of their choice.

The data sets to be used as input for the parameter estimation have been generated (by simulation) in reponse to different network perturbations that include gene deletion, siRNA-mediated knockdown and change of RBS (ribosomal binding site) activity. Refer to the section “Allotted Budget and Price List for Perturbation and Measurement Experiments” for a listing of available experiments and their costs. For each of these types of perturbation experiments, teams may purchase data for either mRNA (for all genes) or protein abundance (for 2 proteins of their choice). Prices vary for each of these, reflecting the relative ease or difficulty of acquiring this type of data in reality.

Parameters to be Estimated 

All mRNA degradation rate constants in model 1 have been fixed to a value of 1 and therefore will not be estimated by participating groups. The unit of time is normalized with the inverse of the mRNA degradation rate, and therefore time is non-dimensional, and measured in units of the mRNA half-life. In model 2 mRNA is NOT taken into consideration. For both models, the protein degradation value is unknown and must be estimated from data as part of the challenge. The values of all protein degradation rate constants in model 1 are identical, but this is NOT the case in model 2 and degradation constants vary for ALL proteins. All other model parameters will likewise be estimated from data chosen by the challenge participants. For each regulation process (activation and repression), there are two parameters to be estimated: Kd and h (to be discussed below). In model 1, for each protein production process, there are two parameters that would need to be estimated: the promoter strength and the ribosomal binding site strength. The transcription rate for a given gene is assumed to be proportional to promoter strength of the corresponding promoter and the translation rate is assumed to be proportional to the ribosomal binding site strength.

In model 2, for each protein production process only the promoter strength has to be estimated and the protein production rate for a given gene is assumed to be proportional to promoter strength of the corresponding promoter. Also for this model, the 3 missing links in the gene network topology will have to be indicated as well as their associated parameters (Kd and h).

 

Allotted Budget and Price List for Perturbation and Measurement Experiments 

The allotted budget for each team will be 10,000 credits per model. The cost breakdown for available perturbations and experiments is summarized in Tables 1 to 4. All time-course data will be provided in concentration units (nM) and time units are dimensionless.

Perturbation Experiments 

When requesting data, teams will specify the type of perturbation and the identity of the gene to be targeted. For model 1, three types of perturbation experiments are possible as summarized in Table 1. In all cases, only a single gene can be targeted at a time for perturbation.  The deletion of a gene results in a complete elimination of both mRNA and protein production for the targeted gene. In the case of siRNA-mediated knockdown, mRNA degradation is increased by 10 fold resulting in a decrease in both mRNA and protein concentrations. Finally, change of RBS activity results in modification of protein expression for the targeted gene without affecting mRNA concentration.

 Table 1. Type and cost of the perturbation experiments available for experimental designs for model 1

   Perturbation Experiment

Cost

   Gene deletion to eliminate both mRNA and protein production for a specified gene

800

    mRNA knockdown using siRNA to achieve a 10-fold increase in mRNA degradation

350

   Decrease of RBS activity by 10-fold to decrease translation rate

450

 For model 2, three types of perturbation experiments are possible as summarized in Table 2. In all cases, only a single gene can be targeted at a time for perturbation.  The deletion of a gene results in a complete elimination of protein production for the targeted gene. Finally, change of promoter activity, upregulation and downregulation results in modification of protein expression for the targeted gene.

 Table 2. Type and cost of the perturbation experiments available for experimental designs for model 2

   Perturbation Experiment

Cost

   Gene deletion to eliminate protein production for a specified gene

800

 Increase of promoter activity by 10 times to increase translation rate

350

  Decrease of promoter activity by 50% (i.e. half) to decrease translation rate

450

 

Computational Experiments 

In addition to specifying a perturbation type (including the target gene), teams will be asked to specify what is measured in response to the perturbation.

For model 1, time courses of all mRNA concentrations may be obtained by a microarray experiment at two temporal sampling resolutions. In the low resolution sampling rate, data was recorded every 4x time units. For the higher resolution, the sampling rate was twice as fast (i.e., every 2x time units).

For model 2, time courses of all protein concentrations may be obtained by a mass spectrometry experimentat two temporal sampling resolutions. In the low resolution sampling rate, data was recorded every 4x time units. For the higher resolution, the sampling rate was twice as fast (i.e., every 2x time units).

In both models, the highest-resolution of one point every x time units (the time unit x=0.5), may be obtained for protein concentrations time traces (as would be measured from a fluorescence microscope experiment), although this data type is limited to only 2 proteins at a time (to be specified by the group requesting the data) for a given experiment.

Table 3. Cost of measurement experiments. The time unit x = 0.5.

   Measurement Experiment

Output

Cost

Microarray (model 1)

Time-courses of all mRNAs

500 (low resolution: 1 data point every 4x time units)
1000 (higher resolution: 1 data point every 2x time units)

Mass Spectrometry array (model 2)

Time-courses of all proteins

500 (1 data point every 4x time units)
1000 (higher resolution: 1 data point every 2x time units)

Fluorescent protein fusion (only 2 proteins available at a time)

Time-courses of protein abundance for 2 chosen proteins

400 (highest resolution: 1 data point every x time units)

 

Determination of Kdand h 

There is also the option for both models to buy estimated values of the dissociation constant (Kd) and Hill coefficient (h) for a specified protein (transcription factor), as would be obtained from a gel-shift experiment. This is more expensive and can only be measured for a single transcription factor per experiment.

Table 4. Experiment for the direct determination of some of the model parameters and its cost

   Experiment

Output

Cost

   Gel-shift assay

Binding affinity (Kd) and Hill coefficient (h) of any one transcription factor at a time

1600

 

Here is the link for the purchase data system for model 1.

Here is the link for the purchase data system for model 2. (Note that p4 and p5 are controlled by two different promoter sites,perturbation experiments can be made in sites pro3 and pro4 for p4 and are named in the credit system respectively 4a and 4bsites pro5 and pro6 for p5 are namedin the credit systemrespectively 5a and 5b)

 

Models and Data

For model 1 full model topologies are provided in graphical form as shown in Fig. 1.

image

Figure 1A. Model 1 whose parameters and response to perturbations are requested as part of this challenge.

image

Figure 1B. key to the symbols and notations used in Fig. 1A and in the  models of this challenge. Model 2 does not contain RBS sites as mRNA is not considered.

 Variables are labeled according to their type. For example, variables in the model representing protein concentrations are labeled as p1, p2, ... p9.  The meaning of each symbol is detailed in the key to the symbols (Fig. 1B), which also lists the prefix that is used to name the model variables. Each line connecting a protein coding sequence to a protein represents a protein production process and is labeled with the prefix "pp". In model 1, protein production consists of two steps: transcription and translation. For simplicity, these two steps, shown in Fig. 2, are not explicitly shown in the diagram of Fig. 1A. The variable names for the mRNA resulting from the transcription of a coding sequence will contain the "v" prefix. For example, the mRNA corresponding to the v3 will be labeled v3_mrna.

image

Figure 2. Detailed representation of the protein production process.  

The same transcription factor can regulate the transcription of one or more genes (e.g., p6 regulates g1, g7 and g9 through repressor sites rs1b, and activator sites as7 and as9 respectively, see Fig. 1). In these situations, the transcriptional control of a transcription factor on different genes could be different if the corresponding parameters are different (e.d., rs3 has parameters r7 and as3 has parameters r5, see Fig 1). The translation rate can also be different if the ribosomal binding site strengths differ.

The transcription process is modeled using a single rate equation. The rate equation is expressed as a sum of the transcriptional activity ( as) of all the activators in that promoter region multiplied by the product of the transcriptional activity ( rs ) of all the repressor binding sites in the same region. For example the transcription rate of v4_mrna in Fig. 3 will be modeled as  pro4_strength * as4 * rs2 , where  

image

image

 Figure 3. Example of a case of regulation of the transcription of coding sequence g4. 

Also referring to the example of Figure 3, the rate of production of protein "p4" is given by linear rate equation rbs4_strength * v4_mrna.If there are two transcription factor binding sites that are activating, then a sum is used rather than a product. For example, in the diagram of Fig. 4, the transcription rate of v3_mrna is modeled as pro3_strength *(as1 + as3).

image

 Figure 4. An instance in which two activation sites regulate the transcription of a coding sequence.

For model 2, the transcription process is neglected assuming that the lifetime of the mRNA is much shorter than that of protein such that in the timescale of protein lifetime, the mRNA dynamics are already in equilibrium. The protein synthesis is modeled using a rate equation, which takes an identical form to that of the transcription process in Model 1 as shown for as4 and rs2, but for this model, rbs sites are absent (see Fig.5). In addition, we allow an identical gene to be placed under different promoters. In this case, the repression does not have to be expressed as a product form. For example, in Model 2, p5 is expressed under pro6 as well as under pro5. The protein synthesis rate of p5 will be the sum of the two promoter activities. We also allowed a case of post-translation regulation for p3 and let it be transformed into p4.

image

Figure 5. Model 2 whose parameters and completed network topology are requested as part of this challenge.

 

Models 

In addition to image files for the networks (in ‘.jpg’ format), participants can download different mathematical description for the models in four formats: SBML or Systems Biology Markup Language (extension ‘.sbml’), Matlab (extension ‘.m’), Copasi (extension ‘.cps’) (9) and Jarnac (extension ‘.jan’) (10). We compared different equation solvers and computed their relative errors. Copasi, Roadrunner and Matlab gave  similar results while SBMLSim and Jarnac have a small systematic bias not larger than 1%. The corresponding files are named as:

  • model<i>.<ext>

where <i> = 1,2 and <ext> = m, sbml, cps, jan, txt or png.

TinkerCell was used to design the networks and to generate the network images. It is available for free download here: http://www.tinkercell.com/downloads-2. Antimony scripts may be opened, edited and converted to SBML using an Antimony editor available for download here: http://antimony.sourceforge.net/main.html). Also available for free is the Systems Biology Workbench (SBW), which may be downloaded from the following location: (http://sourceforge.net/projects/jdesigner/files/Systems%20Biology%20Workbench/2.8.1/). Using a translator tool provided by SBW, SBML files may be converted to other formats (including C/C++, C#, Java and Mathematica). Simply open the SBML file in the translator tool, select the desired output format and save the translated model description. Note that SBML files can be loaded into any compatible tool such as COPASI, CellDesigner, SBML for Matlab, etc. SBML files can also be converted to human readable reports (latex files, pdf, etc.) using SBML2LATEX, a conversion tool available online at http://webservices.cs.uni-tuebingen.de/. The resulting files contain, among other things, the list of parameters, the kinetic rules and the basic ODEs.

Data

Initially all participants will receive a start-up data with the time course data with the dynamics of all the mRNA species for each model. The corresponding simulations initial conditions were set as follows: for model 1 at t=0, all the mRNA species are set to 0 and all the protein concentrations are set to 1, and for model 2  p1,p2,p4,p5 are set to 1 and proteins p3,p6,p7,p8,p9,p10, p11 are set to 0. The data will be contained in the file:

  • wildtype_lowresmRNA1.tab

 

  • wildtype_lowresPROT2.tab

 

respectively referring to model 1 and model 2.

A web interface permits easy access to data. Users need to specify the type of perturbation to be performed in their model, and the type of data they wish to acquire (RNA microarray and/or protein concentration). Credits will automatically be deducted from the participating team’s account according to the cost break-down summarized in the previous section (“Allotted Budget and Price List for Perturbation and Measurement Experiments”). The names of the resulting files will follow the nomenclature of Table 5 for model 1 and Table 6 for model 2. For example, if a team requests the perturbation “deletion of gene 3” and the data type “microarray” in high sampling rate, the system will offer the user to download the file “array_del_3_high.tab”.  The gel-shift experiments will provide the values for the parameters Kd and h for the indicated activation and repressor sites.

Table 5. File Names for the different perturbations and measurement types that can be acquired for model1 in this challenge

Filename extension for all files =  .tab

      Deletion of gene g    

Downregulation of gene g

Downregulation of protein p 

Wild Type

Microarray

array_mod_1

_del_pro_<res>

array_mod_1_dwn_v_i_mrna

_<res>

array_mod_1_dwn_rbs

_<res>

array_mod_1_wildtype

_<res>

Fluorescence of proteins j and k

pj_pk_mod_1

_del_pro

pj_pk_mod_1_dwn_v_i_mrna

pj_pk__mod_1_dwn_rbs

pj_pk_mod_1_wildtype

Gel-shift assay for activation site as_j    or repressor site rs_j

N/A

N/A

N/A

model_1_asj*

or

model_1_rsj*

 

*note that rs1 is different as it has two options rs1a and rs1b

Table 6. File Names for the different perturbations and measurement types that can be acquired for model 2 In this challenge

Filename extension for all files =  .tab

      Deletion of gene g    

Upregulation of gene g 

Downregulation of gene g 

Wild Type

Mass Spectrometry experiments

array_mod_2

_del_pro_<res>

array_mod_2_over_pro 

_<res>

array_mod_2_dwn_pro 

_<res>

array_mod_2_wildtype

 _<res>

Fluorescence of proteins j and k

pj_pk_mod_2

_del_pro

pj_pk__mod_2_over_pro

pj_pk__mod_2_dwn_pro

pj_pk_mod_2_wildtype

Gel-shift assay for activation site as_j    or repressor site rs_j

N/A

N/A

N/A

model_2_asj

or

model_2_rsj


 

Formats, Initial conditions and noise 

The microarray files for model 1 (named ‘array_*.tab’) are tab delimited files containing the time traces of the mRNA abundances for all the transcripts in a given model, at the times indicated in the first column. The mass spectrometry files for model 2 (named ‘array_*.tab’) are tab delimited files containing the time traces of the protein abundances for all the transcripts in a given model, at the times indicated in the first column.The protein-pair files (named ‘pj_pk_*.tab’) are tab delimited files containing the time traces of the proteins pj and pk abundances in a given model, at the times indicated in the first column. In both data modalities, the initial conditions for the time traces are: at t=0, all the mRNA species are set to 0 and all the protein concentrations are set to 1. The initial conditions are the same even in the case of perturbations to the system. For example, if gene 3 is deleted in model 1, its corresponding protein p3 will still have an initial condition of p3=1 at t=0.

In model 2  at t=0 proteins p1,p2,p4,p5 are set to 1 and proteins p3,p6,p7,p8,p9,p10, p11 are set to 0.

Also in both data modalities, a noisy measurement is simulated by adding some noise to the deterministic value of each variable. More precisely, if v is the simulated value, we report as measured the value

 vnoisy = max[0,v + 0.1*g1 + C*g2*v] ,

where g1 and g2 are Gaussian random variables with standard deviation of 1, and C= 0.2 for both the mRNA measurements and for the protein measurements. That is, for small v the stdev of the noise is close to 0.1, while for large value of v, the noisy v has a coefficient of variation C of 20% in both the measurements of proteins and mRNA. Note that if value after noise addition is smaller than zero, the value of vnoisy  is clipped at 0.

Submission of Predictions and Writeup

Participants are required to submit a writeup file indicating the methods used to solve the challenge, and 4 additional files corresponding to 2 files per model: In model 1, one for the parameters and the other for the predicted time courses. In model 2, one for the parameters and the other for the inferred gene network topology.

 

For model 1 participating teams should submit their predictions of the time course of the abundance of three proteins (different for each model) resulting from different perturbations, according to Table 7.  The time course should use the following initial conditions: at t=0, all the mRNA species are set to 0 and all the protein concentrations are set to 1. 

Table 7

Description of Simultaneous Perturbations for model1

Time-courses to predict

decrease r9_Kd by 10x (i.e. divide by 10)

increase rbs3 strength by 2x (i.e. multiply by 2)

increase rbs5 strength by 10x (i.e. multiply by 10)

 p3, p5 and p8

For  model 1, submit your time course predictions completing the provided template file

  • DREAM7_NeTParInf_TimeCourse_model_1_<TeamName>.txt

where <TeamName> is the name of the team with which you registered for the Challenge. This file is a \tab separated four column file, with a header in the first row indicating the nature of the information of each column. The first column lists the times at which we are requesting the predictions, while the other 3 columns contain the proteins for which we are requesting the abundance. The template file contains the word 'PREDICT' in all the rows for the second to fourth column. At submission, replace 'PREDICT' with your prediction, and save as a \tab separated '.txt' two column file keeping the header.

For each model, please submit your inferred parameters completing the provided template file

  • DREAM7_NeTParInf_Parameters_model_<i>_<TeamName>.txt

where <i> = 1,2 and <TeamName> is the name of the team with which you registered for the Challenge. This file is a \tab separated two column file, which lists all the parameters that you have to predict in the first column, and contains the word 'PREDICT' in the second column. At submission, replace 'PREDICT' with your prediction, and save as a \tab separated '.txt' two column file with no header. The parameters (Kd and h) for the 3 missing links in the gene network topology for model 2 will have to be completed in this file and will correspond to r9_Kd, r9_h, r10_Kd,r10_h, r12_Kd, r12_h. Finally the missing link r10 also adds an extra promoter that has to be inferred and will be named pi_synthesis_rate_2 where i indicates the regulated gene and takes any value from 1 to 11.

For  model 2, submit your predictions regarding r9,r10 and r12, the 3 missing network links completing the  provided template file and following the order (i.e first r9, then r10 and then r12). A single link can regulate up to TWO genes, and a gene can have no more than TWO regulators (activators and repressors) in total. If you were to submit for example in model 2, r11 link which corresponds to gene5 positively regulating both gene7 and  gene11, then you should write:

5

+

7

+

11

 

 

 

Also as in link r5, gene6 negatively regulates only gene2, we have:

6

-

2

+

0

 

 

 

 

  • DREAM7_NeTParInf_NetworkTopo_model_2_<TeamName>.txt

 

Write-up 

Finally we request that each participating team submits a short write-up (around two to three pages) explaining the methods used to arrive at their predictions. This write-up can contain pseudo-code describing the algorithm used, workflows for coming to the prediction of the isoforms, etc. Submit the write-up as the file

  • DREAM7_NeTParInf_Writeup_<TeamName>.ext

replacing <TeamName>  with the name of your team and the file extension (ext) with your choice of doc or docx. The submission of this writeup is mandatory for participation in this challenge.

Scoring Metrics

The Estimation of Model Parameters Challenge will be judged based on the accuracy of model parameter estimates and the accuracy of model predictions.

As there is no standard or obvious way of scoring in which it is optimal to spend some but not all the credits, scoring in this challenge will not take into account the amount of credits spent. Participants are encouraged to spend the whole budget.

The overall metric for evaluation of predictions will be done by combining scores computed independently for each model. The scoring metric will be based on the sum of the squares of the log ratios between inferred and known parameters and the sum of the squares of the differences between predicted and simulated protein abundance.

The precise definition of the scoring system for this challenge can be found in this document.

For suggestions about how to score or anything else about this challenge, please contact the organizers directly, or post a comment in the discussion forum.


Timeline

Deadline for submission is 5PM EST October 1st, 2012. Results will be announced no later than October 21st.


Credits

The challenge was conceived by Herbert Sauro, Pablo Meyer, Julio Saez Rodriguez and Gustavo Stolovitzky, the modelling was done by Deepak Chandran and Kyung Hyuk Kim, who also curated the challenge along with Thomas Cokelaer, Julio Saez Rodriguez, and Pablo Meyer. 

Notes

References

1. De Smet, R., and Marchal, K. 2010. Advantages and limitations of current network inference methods. Nat Rev Microbiol 8:717-729.

2. Marbach, D., Prill, R.J., Schaffter, T., Mattiussi, C., Floreano, D., and Stolovitzky, G. 2010. Revealing strengths and weaknesses of methods for gene network inference. Proc Natl Acad Sci U S A 107:6286-6291.

3. Stolovitzky, G., Monroe, D., and Califano, A. 2007. Dialogue on reverse-engineering assessment and methods: the DREAM of high-throughput pathway inference. Ann N Y Acad Sci 1115:1-22.

4. Prill, R.J., Marbach, D., Saez-Rodriguez, J., Sorger, P.K., Alexopoulos, L.G., Xue, X., Clarke, N.D., Altan-Bonnet, G., and Stolovitzky, G. 2010. Towards a rigorous assessment of systems biology models: the DREAM3 challenges. PLoS One 5:e9202.

5. Banga, J.R. 2008. Optimization in computational systems biology. BMC Syst Biol 2:47.

6. Sun, J., Garibaldi, J.M., and Hodgman, C. 2011. Parameter Estimation Using Meta-Heuristics in Systems Biology: A Comprehensive Review. IEEE/ACM Trans Comput Biol Bioinform.

7. Fernandez Slezak, D., Suarez, C., Cecchi, G.A., Marshall, G., and Stolovitzky, G. 2010. When the optimal is not the best: parameter estimation in complex biological models. PLoS One 5:e13283.

8. Kreutz, C., and Timmer, J. 2009. Systems biology: experimental design. FEBS J 276:923-942.

9. Chandran, D., Bergmann, F.T., and Sauro, H.M. 2009. TinkerCell: modular CAD tool for synthetic biology. J Biol Eng 3:19.

10. Smith, L.P., Bergmann, F.T., Chandran, D., and Sauro, H.M. 2009. Antimony: a modular model definition language. Bioinformatics 25:2452-2454.

Download

The link below will allow you to download a zipped file the wild type data for the initial process of parameter optimization, the model files and the submission template files. 

Download file (for login user):