Although drug design is not usually considered molecular nanotechnology, this is a misconception that presumably started because the earliest nanotechnology work focused on systems analogous to macroscopic machines. Drugs are generally small molecules. It is known that some drugs fit precisely into receptor sites to block molecular processes in the body. This must be accomplished without fitting the receptor sites of the body’s healthy molecular machinery. Furthermore, drug molecules must survive in the body long enough to be effective. Early drug discovery was accomplished without understanding these mechanisms, but modern drug designers often consciously create molecules with atomic precision to bind well to receptor sites. This is atomically precise, three-dimensional control of biological devices; i.e., molecular nanotechnology.
One approach to drug design is to find molecules similar to good drugs that have fewer negative side effects. Ideally, a candidate replacement drug is sufficiently similar to have the same beneficial effect but is different enough to avoid the side effects. In any case, to use genetic graphs for similarity-based drug discovery we need a good similarity measure that can score any molecule. [Carhart 85] defined such a similarity measure, all-atom-pairs-shortest-path, and searched a large database for molecules similar to diazepam.
Note that the problem we are solving here involves constructing molecular graphs, rather than examining static graphs. Many classic graph theoretical problems, graph coloring, finding Hamilton circuits, graph isomorphism, and maximal subgraphs discovery, have been studied in the search literature. For example, see [Cheeseman 91]. All of these problems investigate one or more static graphs. Our problem is to find a graph representing a molecule with desirable properties. Thus, the characteristics of our search space is quite different from the classic problems in the literature. There is little or no reason to believe that crossover would be useful for these classic graph problems. It is possible to view the molecular design problem as a search of the space of all molecules. This space can be represented as a graph (see below). However, rather than determining some property of the graph representing the search space we simply look for one or more points in the space that possess desirable properties. For example, molecules that might make good pharmaceutical drugs.
Genetic programming [Koza 92] uses trees to represent individuals. This is particularly useful for representing computer programs. For example, a tree node representing assignment has two child-nodes, one representing a variable and the other representing a value. The crossover operator exchanges randomly selected sub-trees between two parent-trees. Trees may be viewed as graphs without cycles. Many molecules contain cycles, which chemists call rings. Therefore, any attempt to use genetic programming to design molecules must have a mechanism to evolve cycles. This is non-trivial when crossover can replace any sub-tree with some other random sub-tree. After much thought we were unable to devise a crossover-friendly tree representation of arbitrary cyclic graphs. Crossover-friendly means that any sub-tree is a potential crossover point without restriction. Figure 1 depicts crossover using strings (genetic algorithms), trees (genetic programming) and graphs (genetic graphs) using our crossover technique.
Figure 1: Comparison of crossover operators. The interface
between different font or line thickness indicates the crossover point.
A single crossover point is adequate to divide strings (genetic algorithms)
and trees (genetic programming) into two fragments. Graphs (genetic graphs)
containing cycles require more than one crossover point to divide the system
into two fragments. Furthermore, when two graph fragments with different
numbers of crossover points must be mated, it is possible to create new
edges to satisfy the excess crossover points on one fragment.
Genetic software techniques have been used for molecular design in the past. There is a patent covering genetic graphs for molecular design [Weininger 95]. The patent describes the straightforward and fairly obvious parts of mapping standard genetic algorithm techniques to molecular design and the non-obvious portions: the crossover algorithm and fitness functions. The crossover algorithm described in the patent uses two parameters: the digestion rate which breaks bonds, and the dominance rate which apparently controls how many parts of each parent appear in descendants. This algorithm may produce fragments rather than completely connected molecules. Our paper describes a crossover algorithm that always produces connected molecules and has no parameters. This crossover algorithm is the heart of our genetic method. Fitness functions are clearly non-obvious, but must usually be custom designed for each application. Our fitness function and those used in [Weininger 95] both use the Tanimoto index as a distance measure. [Weininger 95] describes a number of fitness functions. We have used all-pairs-shortest-path in most of our work. Daylight Chemical Information Systems, Inc., which holds the patent, reports using genetic techniques to discover lead compounds for pharmaceutical drug development and other commercial successes.
[Nachbar 98] used genetic programming to evolve molecules for drug design by sidestepping the crossover/cycles problem. Each tree node represented an atom with a bond to the parent-node atom and each child-node atom. Hydrogen atoms were explicitly represented and are always leaf nodes. Rings were represented by numbering certain atoms and allowing a reference to that number to be a leaf node. Crossover was constrained not to break or form rings. Ring evolution was enabled by specific ring opening and closing mutation operators.
In a personal communication, Astro Teller reported developing a graph crossover algorithm as part of his dissertation at Carnegie Mellon University. This technique was applied in Neural Programming, a system developed by Teller that combines neural nets and genetic programming. At the time this paper was written, the details of Teller's algorithm were not available in the literature.
Each individual in the initial population is generated by choosing a random number of atoms between half and twice the number of atoms in the target molecule. Atoms are randomly chosen from the elements present in the target molecule. Bonds are then added at random to construct a spanning tree; i.e., at this point all atoms are connected into a single molecule. Then a random number of additional bonds are added to create rings. The type of each bond is selected at random from the set of bond types present in the target molecule. The number of rings is chosen to be between half and twice the number of rings in the target molecule. The number of rings, by our definition, is always = bonds – atoms + 1. For this definition, single, double, and triple bonds are counted as one bond. In addition, an unambiguous definition of the number of rings [Corey 69] is used. For this definition:
For this work, tournament selection was used to choose parents in a steady state genetic system. Tournament selection means that each parent is chosen by comparing two randomly chosen individuals and taking the best. Steady state means that new individuals (children) replace poor individuals in the population rather than creating a new generation. The poor individuals are also chosen by tournament, but the worst individual is selected for replacement. By convention, after population-size individuals have been replaced, we say that one generation is complete. The implementation follows this procedure:
Figure 2: butane and benzene are ripped apart at random points. Then one fragment of butane and a fragment of benzene are mated. Note that benzene must be cut in two places. Also, during mating the benzene fragment has more than one cut bond. A random choice is made to connect this extra cut bond to a random atom in the butane fragment. Alternatively, the extra cut bond could have been discarded.
A somewhat more complete but significantly more confusing explanation follows. The terms vertex and edge are used for atom and bond respectively to indicate that the algorithm may be applied to any graph structure.
The computational resources required for genetic software to find a solution is a function of the size of the search space, among other factors. The space of all possible graphs is combinatorial and enormous. For molecular design this space can be radically reduced by enforcing valence limits for each atom. Thus, a carbon atom with one double and two single bonds will not be allowed to add another bond. Also, avoiding explicit representation of hydrogen atoms substantially reduces the size of the graph and therefore the search space.
The distance measure used is the Tanimoto index. This is
where a is the candidate’s bag and b is the target’s bag. Two elements are considered identical for the purpose of the intersection and union operations if the atoms have the same extended types and the distance between them is identical. Each duplicate in the bag is considered a separate element for the purpose of intersection and union operators. This measure always returns a number between 0 and 1. We prefer fitness functions that return lower numbers for fitter individuals, so we subtract the Tanimoto index from one.
Search spaces with many local minima are difficult to search because many algorithms tend to converge to local minima and miss the global minimum. The all-pairs-shortest-path fitness function has many local minima over the space of all molecules whenever the target molecule contains rings. Consider benzene, a six membered ring, as the target molecule. Any ring other than a six membered ring will be at a local minimum because a bond must be removed, lowering the fitness in most cases, before bonds and atoms can be added to generate benzene. Thus, any target molecule containing rings will have associated search space containing many local minima. Interestingly, a small modification in the definition of the molecular search space eliminates the local minima. Consider mutations that replace an edge with a vertex and two edges and replace a vertex and two edges with one edge. These mutations can change the size of rings. If these mutations can create neighbors in the space of all molecules, then incorrectly sized rings are not local minima. While we think these additional neighbor definitions are unreasonable, they do illustrate the difficulty of understanding the space of all molecules.
The targets for our initial study were butane, benzene, cubane, purine, diazepam, morphine and cholesterol. All targets except butane contain rings and thus generate a search space with local minima. The fitness function can not only find similar molecules, which is useful in drug design, but can also lead evolution to the exact molecule used as a target. This proves that the algorithm can reach particular kinds of molecules and the number of generations to find the target provides a quantitative measure of performance. Unfortunately, our implementation of all-pairs-shortest-path is O(n3) so finding larger molecules can be quite time consuming.
Since the algorithm is stochastic, twenty runs were conducted for each target molecule. The number of generations and population size were varied in an attempt to have enough successful runs (at least 11) to calculate the median time to find the target. Once the target was found the run stopped. Runs also stopped after a fixed, maximum number of generations. A few of the best individuals were saved to see if the software produced molecules similar to the target. These may be useful for drug design.
20 runs for each molecule | Population size | Median generations to find target | Minimum generations to find target | Number of runs that failed to find target |
Maximum generations
|
Benzene |
200
|
39.5
|
2
|
8
|
1000
|
Cubane |
100
|
46.5
|
13
|
0
|
1000
|
Purine |
100
|
245.0
|
19
|
4
|
1000
|
Molecule | Population size | Generation found | Fitness function |
Morphine |
1000
|
208
|
all-pairs-shortest-path |
Diazepam |
200
|
256
|
all-pairs-shortest-path |
Cholesterol minus the two methyl groups connected to the rings |
500
|
1765
|
all-pairs-shortest-path plus number of rings |
To see if the crossover operator was better than random search, we searched
for purine under three conditions: crossover alone, generating random molecules
using the same algorithm as for the initial population (random search),
and a 50-50 mix of crossover and random search. Twenty-one runs of
1000 generations on a population of 200 were conducted in each case.
case | number of runs that found purine | median generations to find purine |
random search |
0
|
N/A
|
crossover alone |
21
|
37
|
50-50 mix of crossover and random search |
21
|
48
|
Finding moderate size molecules has proven difficult with the available computer resources. This is probably because our fitness function was O(n3), but also due to problems with the cycle-stealing batch system used to run this program on idle workstations. Most genetic software uses mutation as well as crossover. Mutation helps systems make small changes. While crossover seems to be capable of generating molecules, the performance is sufficiently poor that mutation operators might help a great deal. Other than the usual operators to add and remove atoms or bonds, it may be helpful to have a mutation operator that makes a random ring aromatic (alternating double and single bonds for certain sized rings). Generating aromatic rings with crossover alone is probably difficult (note the problems generating benzene) so a special mutation operator may be helpful.
Examining populations generated by diazepam, cholesterol, and morphine we noted that the best molecules all had the same ring structure. This did not appear to be true of the population as a whole. Thus, it is possible for the population to get stuck in local minima. Occasionally the best molecule in a population would be radically improved by a particularly fortuitous crossover, but this can take a long time. We found that a minor modification to the fitness function eliminated this difficulty in the cholesterol run. The fitness function was an equal combination of all-pairs-shortest-path and a modified Tanimoto index on the number of rings in the target molecule versus the candidate. This fitness function appears to do a better job of finding interesting analogues to the target molecule by keeping the ring structure diverse. With all-pairs-shortest-path alone, populations seem to converge on a single poor ring structure, at least for the best molecules in the population. Adding the distance between the number of rings generates more fit molecules and more diverse ring structures, at least in preliminary results.
Performance analysis demonstrates what might be expected — our O(n3) fitness function took most of the CPU time. A faster fitness function would substantially speed calculations. Some runs with faster fitness functions have been made, but these simpler fitness functions do not drive evolution to find the target exactly so comparison with the above data is difficult. Genetic software lore suggests that the fitness function is exceptionally important [Kinnear 94]. Our results bear this out.
Fortunately, the algorithm is embarrassingly parallel since many runs
are required. Also, there is significant potential for parallelism within
runs since fitness function execution on each individual is completely
independent. Furthermore, the algorithm can be easily restarted and can
afford to lose some runs. Thus, genetic graphs is a good candidate for
cycle-scavenging batch systems such as Condor [Litzkow
88]. Large genetic graph production runs can therefore
use otherwise wasted workstation and PC cycles at facilities with large
numbers of these machines. Although some of the results presented here
were simply run on workstations, all current jobs are run under Condor.
"... deception is not the only factor determining how hard a problem will be for a GA (genetic algorithm), it is not even clear what the relation is between the degree of deception in a problem and the problem's difficulty for the GA. There are a number of factors that can make a problem easy or hard, including those we described in this article: the presence of multiple conflicting solutions or partial solutions, the degree of overlap, and the amount of information from lower-order building blocks (as opposed to whether or not this information is misleading). Other factors include sample error [Grefenstette 89], the number of local optima in the landscape [Schaeffer 89], and the relative differences in fitness between disjointed desirable schemas [Mitchell 92]. Most efforts at characterizing the ease or difficulty of problems for the GA have concentrated on deception, but these other factors have to be taken into account as well.The primary contribution of this paper is the creation of a crossover operator for graphs. This allows the use of genetic algorithm techniques using graph representations. As [De Jong 90] states:"At present, the GA community lacks a coherent theory of the effects on the GA of the various sources of facilitation or difficulty in a given problem; we also lack a complete list of such sources, as well as an understanding of how to define and measure them. Until such a theory is developed, the terms 'GA-hard' and 'GA-easy' should be defined in terms of the GA's performance on the given problem rather than in terms of the a priori structure of the problem." (formatting modified to fit the conventions of this paper)
"The key point in deciding whether or not to use genetic algorithms for a particular problem centers around the question: what is the space to be searched? If that space is well-understood and contains structure that can be exploited by special-purpose search techniques, the use of genetic algorithms is generally computationally less efficient. If the space to be searched is not so well understood, and relatively unstructured, and if an effective GA representation of that space can be developed, then GAs provide a surprisingly powerful search technique for large, complex spaces." (italics added by [Forrest 93])Chemists have known for over a century that graphs are the most natural representation of molecules. Furthermore, the space of molecules is not well understood or characterized. Therefore, it is reasonable to presume that searching the space of molecules using genetic graphs will be profitable in a number of domains. We hope that our crossover operator will make a contribution.
Many fitness functions of interest require molecular conformation; i.e., xyz coordinates for each atom. For example, designing a molecule to fit in a protein receptor to inhibit the activity of a disease organism, as the new and fairly effective AIDS drugs do, require molecular conformation. To design a fitness function evolving molecular fit, the very bizarre molecules often created by crossover must have their energy minimized quickly. Most minimizers available today will not work well with without a "reasonable" start point. We are searching the literature for algorithms to minimize very "bad" molecules. Searching for molecules when conformation is important may be expected to be more difficult than the search is described in this paper. Energy minimization takes considerable CPU time, multiple conformations for molecules increase the size of the search space, and a different crossover operator that takes physical space into consideration may be necessary.
It may be possible to search for molecules using hill climbing algorithms. However, the space of all molecules has no derivatives and no orthogonal dimensions. Therefore, from each molecule one must examine all the neighbors to find the steepest hill to climb. Since even moderately sized molecules will have a huge number of neighbors in the molecular space defined above, hill climbing algorithms are expected to have great difficulties. It may be possible, however, to use hill climbing algorithms that climb any reasonable hill rather than the steepest one. This amounts to evolution using mutation where only profitable mutations are allowed to survive. We hope to test this technique in the near future.
Circuit design is another field for which genetic graphs should, in principle, be well suited. Genetic algorithms (using variable length strings) [Lohn 98] and genetic programming [Koza 97] have been used to design analog circuits. In the genetic programming case, a tree language to generate analog circuits compatible with the SPICE (Simulation Program with Integrated Circuit Emphasis) [Quarles 94] simulator was constructed and a 64 node (80MHz per node) parallel supercomputer was used to design the circuits. The system designed a lowpass filter, a crossover filter, a four-way source identification circuit, a cube root circuit, a time-optimal controller circuit, a 100 dB amplifier, a temperature-sensing circuit, and a voltage reference source circuit. Thus, genetic programming can design graph-structured systems. However, we have found it extremely difficult to create a tree language that can generate any possible graph and support crossover cleanly. Therefore, it may be advantageous to directly evolve graphs rather than trees that represent cyclic graphs.
Finally, it might be interesting to explore the relationship between genetic graphs, applied to molecules, and combinatorial chemistry. Combinatorial chemistry is a new field undergoing rapid development where hundreds or thousands of molecules and simultaneously manipulated and examined. The recombination rules in genetic graphs are not constrained by physical reality, whereas reactions in combinatorial chemistry experiments clearly are. Combinatorial chemistry is also much faster than genetic graphs, even on massively parallel computers. Crossover and fitness function evaluation can take seconds or even minutes on a workstation, but thousands of simultaneous reactions can occur on a much smaller time scale.
[Carhart 85] Raymond Carhart, Dennis H. Smith, and R. Venkataraghavan, "Atom pairs as molecular features in structure-activity studies: definition and application," Journal of Chemical Information and Computer Science, 23, pages 64-73, 1985.
[Cheeseman 91] Peter Cheeseman, Bob Kanefsky, William M. Taylor, "Where the Really Hard Problems Are," Proceedings of the 12th International Conference on Artificial Intelligence, Darling Harbor, Sydney, Australia, 24-30 August 1991.
[Claerbout 73] J. F. Claerbout and F. Muir, ``Robust Modeling with Erratic Data,'' Geophysics, 38, pages 826-844, 1973.
[Corey 69] E. J. Corey and W. Todd Wipke, "Computer-Assisted Design of Complex Organic Syntheses," Science, volume 166, pages 178-192, 10 October 1969.
[De Jong 90] K. A. De Jong "Introduction to the Second Special Issue on Genetic Algorithms," Machine Learning, 5 (4), 1990.
[Forrest 93] Stephanie Forrest and Melanie Mitchell, "What Makes a Problem Hard for Genetic Algorithm? Some Anomalous Results in the Explanation," Machine Learning, 13, pages 285-319, 1993.
[Goldberg 89] David E. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley Publishing Co., Inc., 1989.
[Grefenstette 89] J. J. Grefenstette and J. E. Baker, "How Genetic Algorithms Work: A Critical Look at Implicit Parallelism," Proceedings of the Third International Conference on Genetic Algorithms, J. D. Schaffer, Editor, San Mateo, CA, Morgan Kaufmann, 1989.
[Holland 75] John H. Holland, Adaptation in Natural and Artificial Systems, University of Michigan Press, 1975.
[Kinnear 94] Kenneth E. Kinnear, Jr., "A perspective on the work in this book," Advances in Genetic Programming, edited by Kenneth E. Kinnear, Jr., MIT Press, Cambridge, Massachusetts, pages 3-20, 1994.
[Koza 92] John R. Koza, Genetic Programming: on the Programming of Computers by Means of Natural Selection, MIT Press, Massachusetts, 1992.
[Koza 97] John R. Koza, Forrest H. Bennett III, David Andre, Martin A. Keane and Frank Dunlap, "Automated synthesis of analog electrical circuits by means of genetic programming," IEEE Transactions on Evolutionary Computation, volume 1, number 2, pages 109-128, July 1997.
[Litzkow 88] M. Litzkow, M. Livny, and M. W. Mutka, "Condor - a hunter of idle workstations,'' Proceedings of the 8th International Conference of Distributed Computing Systems, pp. 104-111, June1988. See http://www.cs.wisc.edu/condor/.
[Lohn 98] Jason D. Lohn and Silvano P. Colombano, "Automated analog circuit synthesis using a linear representation,'' Second International Conference on Evolvable Systems: From Biology to Hardware, Springer-Verlag, Sept.23-25, 1998. (to appear)
[Mitchell 92] M. Mitchell, S. Forrest, and J. H. Holland, "The Royal Road Genetic Algorithms: Fitness Landscapes and GA Performance," Proceedings of the First European Conference and Artificial Life, Cambridge, MA, MIT Press/Branford Books, 1992.
[Nachbar 98] Robert B. Nachbar, "Molecular evolution: a hierarchical representation for chemical topology and its automated manipulation," Proceedings of the Third Annual Genetic Programming Conference, University of Wisconsin, Madison, Wisconsin, 22-25 July 1998, pages 246-253.
[Quarles 94] T. Quarles, A. R. Newton, D. O. Pederson, and A. Sangiovanni-Vincentelli, SPICE 3 Version 3F5 User's Manual, Department of Electrical Engineering and Computer Science, University of California at Berkeley, CA, March 1994.
[Schaeffer 89] J. D. Schaffer, Editor, Proceedings of the Third International Conference on Genetic Algorithms, San Mateo, CA, Morgan Kaufmann, 1989.
[Weininger 95] David Weininger, "Method and apparatus for designing molecules with desired properties by evolving successive populations," U.S. patent US5434796, Daylight Chemical Information Systems, Inc. 1995.
[Xiao 97] Jiang Xiao, Zbigniew Michalewicz, Lixin Zhang, and Krzysztof Trojanowski, "Adaptive evolutionary planner/navigator or mobile robots," IEEE Transactions on Evolutionary Computation, volume 1, number 1, pages 18-28, April 1997.