Today, we are being deluged with an enormous amount of biological data. Web sites abound that have genomes, gene expression data, transcription factor data, phylogenetic data, and the list goes on. The Genomes OnLine Database currently lists 6,423 genomes. They also list 249 metagenomes, which are “genomes” from whole microbial communities [1]. NCBI has a Gene Expression Omnibus (GEO) that contains gene expression data and currently lists 9,053 sequencing platforms, 594,152 samples, 23,949 series and 2,720 datasets [2, 3]. Transcription factor databases tend to be species specific and contain information about the proteins and their recognition sites on DNA. There are hundreds of transcription factors for each of the thousands of sequenced genomes as listed in databases such as TRANSFAC [4]. PhylomeDB contains gene phylogenies. It currently has 17 phylomes, 416,093 trees, 165,850 alignments, 5,262,859 proteins, 717 species, and 1,053 genomes [5]. The Tree Of Life web site contains more than 10,000 pages of information about biodiversity and evolutionary history [6].
This is a great deal of data! How can we make sense of it all? Clearly, there is valuable information and insights into the biology, if only we could tease it out of this deluge of data. Most usefully, we should turn the raw data into knowledge and understanding. As humans, the challenge is too great without the use of computers. Designing computer algorithms to analyze these data sets is where the challenge lies. New algorithms have to continue to be developed to explore the data in a meaningful way.
One approach that is particularly intriguing for computational biology is to model computer algorithms on biological systems. There is a satisfying symmetry to this approach for the biologist. This blog entry will talk about neural networks, but future blogs will address evolutionary computation and ant colony optimization – other computational approaches based on biological models. These three approaches will give you a good feel for how the biology can help shape the computation.
Biological neural networks, as they exist in your brain, consist of interconnected neurons that can accomplish a task – such as recognizing a tree, your spouse, a problem or a solution. Multiple input neurons can connect to a common target, each having a different “strength” to their influence on the target neuron. Similarly, multiple target neurons, might connect to their common target neuron. These neuronal circuits help to generate the output, for example: “Oh, that’s a tree!” In reality, the connections are more complex, but this is the essence – neurons connecting to other neurons to form a network.
Artificial neural networks work on the same principle [7]. For example, let’s say we want to predict protein secondary structure from the primary amino acid sequence. Methods exist that do a decent job of this using more classic approaches [8, 9]. We might choose to have 9 input neurons (sensors, perceptrons) that recognize 9 adjacent amino acids in the sequence . These might feed into three target neurons (via 9 x 3 = 27 connections). The three target neurons might, in turn, feed into a single output neuron (three more connections) that can then predict the likely structure for this stretch of 9 amino acids (or the middle 3 or 5 amino acids). The output value of the output neuron can specify that the structure is alpha helix, beta sheet, or random coil (or some other structure). The 30 different connection weights are what actually determines the output and the analysis process of the input data. How are those weights determined?
The weights are determined by the neural network program that is initially trained on a set of proteins whose structure is known. At first, the weights are randomly assigned and the network is evaluated by how well it predicts the structures for a set of proteins with known structures. Next, the weights are adjusted and the network run again. If the accuracy improves, the next round of adjustments takes place in the same “direction” as the previous round. If the new network is worse than the previous one, the weights are reset and the process is rerun. There are many details involved in a lot of these steps. Decisions have to be made on how to measure the accuracy of the network, how to adjust the weights, can connections have zero weights, how many layers does the neural network have, how many “neurons” are there in each layer, can signals go backwards so that a layer can alter the value of a previous layer, etc. The performance of the program can be influenced dramatically by these choices. In any case, the connection weights of a network evolve (nice word!) towards a more accurate predictor. Each iteration of the network gets better and better until an optimum solution is reached.
Once the network is trained (the connection weights do not change significantly from one iteration to the next), it can be used to predict the structure of unknown proteins which can then be tested in the lab. Note that since the values are randomly chosen to begin with, each time a nueral network is trained on a set of data, it could produce a different result. Thus, it is usually advisable to train the network multiple times to ensure getting a well trained network.
It is possible to generate a number of different trained neural networks that perform equally well despite having dramatically different weights for the connections. Therein lies a weakness of the approach. Even if a neural network is perfect (100% accurate), it might not be possible to learn what is actually being modeled by just looking at the weights and connections. Basically, the neural network is a black box that takes some input and generates an output. Many feel that a black box that works is better than no method at all. If the neural network is successful, it says that the input data is sufficient to predicting the output (even if it is not understood how it is working). With that understanding, it should be possible to design a new algorithm with a precisely defined model that can do what the neural network was able to do. This would be classified as progress.
Neural nets have been developed to address a number of different biological systems. Actual protein secondary structure prediction (as in our simple example) has been approached by neural networks [10, 11] and the results are an improvement over more traditional methods [8, 9]. GRAIL is a neural network that has been developed to predict genes from genomic DNA sequences [12]. Other biological uses for artificial neural networks include modeling D1 and D2 dopamine receptors [13], designing bioactive proteins [14], predicting antigenic activity for hepatitis C protein NS3 [15], and inferring the rules of E. coli translational efficiency [16].
What artificial neural network solutions have in common is that they mimic a biological system to guide the computing. Principles that work in biological systems have been applied to computing systems with quite a bit of success. Survival of the fittest seems to work for computing as well as for biology. It is easy to imagine a positive feedback loop where using biological computation will help us understand the actual biology. This in turn will lead to more detailed and sophisticated biological models that can then be applied to redesigned and improved algorithms… It will be fun to see how this all evolves ;-).
References
1. Markowitz, V.M., N.N. Ivanova, E. Szeto, K. Palaniappan, K. Chu, D. Dalevi, I.M. Chen, Y. Grechkin, I. Dubchak, I. Anderson, A. Lykidis, K. Mavromatis, P. Hugenholtz, and N.C. Kyrpides, IMG/M: a data management and analysis system for metagenomes. Nucleic Acids Research, 2008. 36(Database issue):D534-538 http://www.ncbi.nlm.nih.gov/pubmed/17932063.
2. Barrett, T., D.B. Troup, S.E. Wilhite, P. Ledoux, C. Evangelista, I.F. Kim, M. Tomashevsky, K.A. Marshall, K.H. Phillippy, P.M. Sherman, R.N. Muertter, M. Holko, O. Ayanbule, A. Yefanov, and A. Soboleva, NCBI GEO: archive for functional genomics data sets–10 years on. Nucleic Acids Research, 2011. 39(Database issue):D1005-1010 http://www.ncbi.nlm.nih.gov/pubmed/21097893.
3. Sayers, E.W., T. Barrett, D.A. Benson, E. Bolton, S.H. Bryant, K. Canese, V. Chetvernin, D.M. Church, M. DiCuccio, S. Federhen, M. Feolo, I.M. Fingerman, L.Y. Geer, W. Helmberg, Y. Kapustin, D. Landsman, D.J. Lipman, Z. Lu, T.L. Madden, T. Madej, D.R. Maglott, A. Marchler-Bauer, V. Miller, I. Mizrachi, J. Ostell, A. Panchenko, L. Phan, K.D. Pruitt, G.D. Schuler, E. Sequeira, S.T. Sherry, M. Shumway, K. Sirotkin, D. Slotta, A. Souvorov, G. Starchenko, T.A. Tatusova, L. Wagner, Y. Wang, W.J. Wilbur, E. Yaschenko, and J. Ye, Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 2011. 39(Database issue):D38-51 http://www.ncbi.nlm.nih.gov/pubmed/21097890.
4. Matys, V., E. Fricke, R. Geffers, E. Gossling, M. Haubrock, R. Hehl, K. Hornischer, D. Karas, A.E. Kel, O.V. Kel-Margoulis, D.U. Kloos, S. Land, B. Lewicki-Potapov, H. Michael, R. Munch, I. Reuter, S. Rotert, H. Saxel, M. Scheer, S. Thiele, and E. Wingender, TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Research, 2003. 31(1):374-378 http://www.ncbi.nlm.nih.gov/pubmed/12520026.
5. Huerta-Cepas, J., S. Capella-Gutierrez, L.P. Pryszcz, I. Denisov, D. Kormes, M. Marcet-Houben, and T. Gabaldon, PhylomeDB v3.0: an expanding repository of genome-wide collections of trees, alignments and phylogeny-based orthology and paralogy predictions. Nucleic Acids Research, 2011. 39(Database issue):D556-560 http://www.ncbi.nlm.nih.gov/pubmed/21075798.
6. Maddison, D.R., K.-S. Schulz, and W.P. Maddison, The Tree of Life Web Project. Zootaxa, 2007. 1668:19-40 http://www.mapress.com/zootaxa/2007f/zt01668p040.pdf.
7. Hopfield, J.J., Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences of the United States of America, 1982. 79(8):2554-2558 http://www.ncbi.nlm.nih.gov/pubmed/6953413.
8. Chou, P.Y. and G.D. Fasman, Prediction of the secondary structure of proteins from their amino acid sequence. Advances in enzymology and related areas of molecular biology, 1978. 47:45-148 http://www.ncbi.nlm.nih.gov/pubmed/364941.
9. Garnier, J., D.J. Osguthorpe, and B. Robson, Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. Journal of molecular biology, 1978. 120(1):97-120 http://www.ncbi.nlm.nih.gov/pubmed/642007.
10. Guermeur, Y., C. Geourjon, P. Gallinari, and G. Deleage, Improved performance in protein secondary structure prediction by inhomogeneous score combination. Bioinformatics, 1999. 15(5):413-421 http://www.ncbi.nlm.nih.gov/pubmed/10366661.
11. Cai, Y.D., X.J. Liu, and K.C. Chou, Prediction of protein secondary structure content by artificial neural network. Journal of computational chemistry, 2003. 24(6):727-731 http://www.ncbi.nlm.nih.gov/pubmed/12666164.
12. Uberbacher, E.C., D. Hyatt, and M. Shah, GrailEXP and Genome Analysis Pipeline for genome annotation. Current protocols in bioinformatics / editoral board, Andreas D. Baxevanis … [et al.], 2004. Chapter 4:Unit4 9 http://www.ncbi.nlm.nih.gov/pubmed/18428726.
13. Karolidis, D.A., S. Agatonovic-Kustrin, and D.W. Morton, Artificial neural network (ANN) based modelling for D1 like and D2 like dopamine receptor affinity and selectivity. Medicinal chemistry, 2010. 6(5):259-270 http://www.ncbi.nlm.nih.gov/pubmed/20977414.
14. Huang, R.B., Q.S. Du, Y.T. Wei, Z.W. Pang, H. Wei, and K.C. Chou, Physics and chemistry-driven artificial neural network for predicting bioactivity of peptides and proteins and their design. Journal of theoretical biology, 2009. 256(3):428-435 http://www.ncbi.nlm.nih.gov/pubmed/18835398.
15. Lara, J., R.M. Wohlhueter, Z. Dimitrova, and Y.E. Khudyakov, Artificial neural network for prediction of antigenic activity for a major conformational epitope in the hepatitis C virus NS3 protein. Bioinformatics, 2008. 24(17):1858-1864 http://www.ncbi.nlm.nih.gov/pubmed/18628290.
16. Mori, K., R. Saito, S. Kikuchi, and M. Tomita, Inferring rules of Escherichia coli translational efficiency using an artificial neural network. Bio Systems, 2007. 90(2):414-420 http://www.ncbi.nlm.nih.gov/pubmed/17150301.
How living systems have helped in the evolution of computational problem solving
Today, we are being deluged with an enormous amount of biological data. Web sites abound that have genomes, gene expression data, transcription factor data, phylogenetic data, and the list goes on. The Genomes OnLine Database currently lists 6,423 genomes. They also list 249 metagenomes, which are “genomes” from whole microbial communities [1]. NCBI has a Gene Expression Omnibus (GEO) that contains gene expression data and currently lists 9,053 sequencing platforms, 594,152 samples, 23,949 series and 2,720 datasets [2, 3]. Transcription factor databases tend to be species specific and contain information about the proteins and their recognition sites on DNA. There are hundreds of transcription factors for each of the thousands of sequenced genomes as listed in databases such as TRANSFAC [4]. PhylomeDB contains gene phylogenies. It currently has 17 phylomes, 416,093 trees, 165,850 alignments, 5,262,859 proteins, 717 species, and 1,053 genomes [5]. The Tree Of Life web site contains more than 10,000 pages of information about biodiversity and evolutionary history [6].
This is a great deal of data! How can we make sense of it all? Clearly, there is valuable information and insights into the biology, if only we could tease it out of this deluge of data. Most usefully, we should turn the raw data into knowledge and understanding. As humans, the challenge is too great without the use of computers. Designing computer algorithms to analyze these data sets is where the challenge lies. New algorithms have to continue to be developed to explore the data in a meaningful way.
One approach that is particularly intriguing for computational biology is to model computer algorithms on biological systems. There is a satisfying symmetry to this approach for the biologist. This blog entry will talk about neural networks, but future blogs will address evolutionary computation and ant colony optimization – other computational approaches based on biological models. These three approaches will give you a good feel for how the biology can help shape the computation.
Biological neural networks, as they exist in your brain, consist of interconnected neurons that can accomplish a task – such as recognizing a tree, your spouse, a problem or a solution. Multiple input neurons can connect to a common target, each having a different “strength” to their influence on the target neuron. Similarly, multiple target neurons, might connect to their common target neuron. These neuronal circuits help to generate the output, for example: “Oh, that’s a tree!” In reality, the connections are more complex, but this is the essence – neurons connecting to other neurons to form a network.
Artificial neural networks work on the same principle [7]. For example, let’s say we want to predict protein secondary structure from the primary amino acid sequence. Methods exist that do a decent job of this using more classic approaches [8, 9]. We might choose to have 9 input neurons (sensors, perceptrons) that recognize 9 adjacent amino acids in the sequence . These might feed into three target neurons (via 9 x 3 = 27 connections). The three target neurons might, in turn, feed into a single output neuron (three more connections) that can then predict the likely structure for this stretch of 9 amino acids (or the middle 3 or 5 amino acids). The output value of the output neuron can specify that the structure is alpha helix, beta sheet, or random coil (or some other structure). The 30 different connection weights are what actually determines the output and the analysis process of the input data. How are those weights determined?
The weights are determined by the neural network program that is initially trained on a set of proteins whose structure is known. At first, the weights are randomly assigned and the network is evaluated by how well it predicts the structures for a set of proteins with known structures. Next, the weights are adjusted and the network run again. If the accuracy improves, the next round of adjustments takes place in the same “direction” as the previous round. If the new network is worse than the previous one, the weights are reset and the process is rerun. There are many details involved in a lot of these steps. Decisions have to be made on how to measure the accuracy of the network, how to adjust the weights, can connections have zero weights, how many layers does the neural network have, how many “neurons” are there in each layer, can signals go backwards so that a layer can alter the value of a previous layer, etc. The performance of the program can be influenced dramatically by these choices. In any case, the connection weights of a network evolve (nice word!) towards a more accurate predictor. Each iteration of the network gets better and better until an optimum solution is reached.
Once the network is trained (the connection weights do not change significantly from one iteration to the next), it can be used to predict the structure of unknown proteins which can then be tested in the lab. Note that since the values are randomly chosen to begin with, each time a nueral network is trained on a set of data, it could produce a different result. Thus, it is usually advisable to train the network multiple times to ensure getting a well trained network.
It is possible to generate a number of different trained neural networks that perform equally well despite having dramatically different weights for the connections. Therein lies a weakness of the approach. Even if a neural network is perfect (100% accurate), it might not be possible to learn what is actually being modeled by just looking at the weights and connections. Basically, the neural network is a black box that takes some input and generates an output. Many feel that a black box that works is better than no method at all. If the neural network is successful, it says that the input data is sufficient to predicting the output (even if it is not understood how it is working). With that understanding, it should be possible to design a new algorithm with a precisely defined model that can do what the neural network was able to do. This would be classified as progress.
Neural nets have been developed to address a number of different biological systems. Actual protein secondary structure prediction (as in our simple example) has been approached by neural networks [10, 11] and the results are an improvement over more traditional methods [8, 9]. GRAIL is a neural network that has been developed to predict genes from genomic DNA sequences [12]. Other biological uses for artificial neural networks include modeling D1 and D2 dopamine receptors [13], designing bioactive proteins [14], predicting antigenic activity for hepatitis C protein NS3 [15], and inferring the rules of E. coli translational efficiency [16].
What artificial neural network solutions have in common is that they mimic a biological system to guide the computing. Principles that work in biological systems have been applied to computing systems with quite a bit of success. Survival of the fittest seems to work for computing as well as for biology. It is easy to imagine a positive feedback loop where using biological computation will help us understand the actual biology. This in turn will lead to more detailed and sophisticated biological models that can then be applied to redesigned and improved algorithms… It will be fun to see how this all evolves ;-).
References
1. Markowitz, V.M., N.N. Ivanova, E. Szeto, K. Palaniappan, K. Chu, D. Dalevi, I.M. Chen, Y. Grechkin, I. Dubchak, I. Anderson, A. Lykidis, K. Mavromatis, P. Hugenholtz, and N.C. Kyrpides, IMG/M: a data management and analysis system for metagenomes. Nucleic Acids Research, 2008. 36(Database issue):D534-538 http://www.ncbi.nlm.nih.gov/pubmed/17932063.
2. Barrett, T., D.B. Troup, S.E. Wilhite, P. Ledoux, C. Evangelista, I.F. Kim, M. Tomashevsky, K.A. Marshall, K.H. Phillippy, P.M. Sherman, R.N. Muertter, M. Holko, O. Ayanbule, A. Yefanov, and A. Soboleva, NCBI GEO: archive for functional genomics data sets–10 years on. Nucleic Acids Research, 2011. 39(Database issue):D1005-1010 http://www.ncbi.nlm.nih.gov/pubmed/21097893.
3. Sayers, E.W., T. Barrett, D.A. Benson, E. Bolton, S.H. Bryant, K. Canese, V. Chetvernin, D.M. Church, M. DiCuccio, S. Federhen, M. Feolo, I.M. Fingerman, L.Y. Geer, W. Helmberg, Y. Kapustin, D. Landsman, D.J. Lipman, Z. Lu, T.L. Madden, T. Madej, D.R. Maglott, A. Marchler-Bauer, V. Miller, I. Mizrachi, J. Ostell, A. Panchenko, L. Phan, K.D. Pruitt, G.D. Schuler, E. Sequeira, S.T. Sherry, M. Shumway, K. Sirotkin, D. Slotta, A. Souvorov, G. Starchenko, T.A. Tatusova, L. Wagner, Y. Wang, W.J. Wilbur, E. Yaschenko, and J. Ye, Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 2011. 39(Database issue):D38-51 http://www.ncbi.nlm.nih.gov/pubmed/21097890.
4. Matys, V., E. Fricke, R. Geffers, E. Gossling, M. Haubrock, R. Hehl, K. Hornischer, D. Karas, A.E. Kel, O.V. Kel-Margoulis, D.U. Kloos, S. Land, B. Lewicki-Potapov, H. Michael, R. Munch, I. Reuter, S. Rotert, H. Saxel, M. Scheer, S. Thiele, and E. Wingender, TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Research, 2003. 31(1):374-378 http://www.ncbi.nlm.nih.gov/pubmed/12520026.
5. Huerta-Cepas, J., S. Capella-Gutierrez, L.P. Pryszcz, I. Denisov, D. Kormes, M. Marcet-Houben, and T. Gabaldon, PhylomeDB v3.0: an expanding repository of genome-wide collections of trees, alignments and phylogeny-based orthology and paralogy predictions. Nucleic Acids Research, 2011. 39(Database issue):D556-560 http://www.ncbi.nlm.nih.gov/pubmed/21075798.
6. Maddison, D.R., K.-S. Schulz, and W.P. Maddison, The Tree of Life Web Project. Zootaxa, 2007. 1668:19-40 http://www.mapress.com/zootaxa/2007f/zt01668p040.pdf.
7. Hopfield, J.J., Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences of the United States of America, 1982. 79(8):2554-2558 http://www.ncbi.nlm.nih.gov/pubmed/6953413.
8. Chou, P.Y. and G.D. Fasman, Prediction of the secondary structure of proteins from their amino acid sequence. Advances in enzymology and related areas of molecular biology, 1978. 47:45-148 http://www.ncbi.nlm.nih.gov/pubmed/364941.
9. Garnier, J., D.J. Osguthorpe, and B. Robson, Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. Journal of molecular biology, 1978. 120(1):97-120 http://www.ncbi.nlm.nih.gov/pubmed/642007.
10. Guermeur, Y., C. Geourjon, P. Gallinari, and G. Deleage, Improved performance in protein secondary structure prediction by inhomogeneous score combination. Bioinformatics, 1999. 15(5):413-421 http://www.ncbi.nlm.nih.gov/pubmed/10366661.
11. Cai, Y.D., X.J. Liu, and K.C. Chou, Prediction of protein secondary structure content by artificial neural network. Journal of computational chemistry, 2003. 24(6):727-731 http://www.ncbi.nlm.nih.gov/pubmed/12666164.
12. Uberbacher, E.C., D. Hyatt, and M. Shah, GrailEXP and Genome Analysis Pipeline for genome annotation. Current protocols in bioinformatics / editoral board, Andreas D. Baxevanis … [et al.], 2004. Chapter 4:Unit4 9 http://www.ncbi.nlm.nih.gov/pubmed/18428726.
13. Karolidis, D.A., S. Agatonovic-Kustrin, and D.W. Morton, Artificial neural network (ANN) based modelling for D1 like and D2 like dopamine receptor affinity and selectivity. Medicinal chemistry, 2010. 6(5):259-270 http://www.ncbi.nlm.nih.gov/pubmed/20977414.
14. Huang, R.B., Q.S. Du, Y.T. Wei, Z.W. Pang, H. Wei, and K.C. Chou, Physics and chemistry-driven artificial neural network for predicting bioactivity of peptides and proteins and their design. Journal of theoretical biology, 2009. 256(3):428-435 http://www.ncbi.nlm.nih.gov/pubmed/18835398.
15. Lara, J., R.M. Wohlhueter, Z. Dimitrova, and Y.E. Khudyakov, Artificial neural network for prediction of antigenic activity for a major conformational epitope in the hepatitis C virus NS3 protein. Bioinformatics, 2008. 24(17):1858-1864 http://www.ncbi.nlm.nih.gov/pubmed/18628290.
16. Mori, K., R. Saito, S. Kikuchi, and M. Tomita, Inferring rules of Escherichia coli translational efficiency using an artificial neural network. Bio Systems, 2007. 90(2):414-420 http://www.ncbi.nlm.nih.gov/pubmed/17150301.