Although drug design is not usually considered molecular nanotechnology, this is a misconception that presumably started because the earliest nanotechnology work focused on systems analogous to macroscopic machines. Drugs are small molecules that precisely fit into receptor sites to block molecular processes in the body. This must be accomplished without fitting the receptor sites of the body’s healthy molecular machinery. Furthermore, drug molecules must survive in the body long enough to be effective. Early drug discovery was accomplished without understanding these mechanisms, but modern drug design consciously creates molecules with atomic precision to bind well to receptor sites in disease organism proteins. This is atomically precise, three-dimensional control of biological devices; i.e., molecular nanotechnology.
One approach to drug design is to find molecules similar to good drugs that have negative side effects. Ideally, a candidate replacement drug is sufficiently similar to have the same beneficial effect but is different enough to avoid the side effects. In any case, to use genetic graphs for similarity-based drug discovery we need a good similarity measure that can score any molecule. [Carhart 85] defined such a similarity measure, all-atom-pairs-shortest-path, and searched a large database for molecules similar to diazepam.
Genetic programming [Koza 92] uses trees to represent individuals. This is particularly useful for representing computer programs. For example, a tree node representing assignment has two child-nodes, one representing a variable and the other representing a value. The crossover operator exchanges randomly selected sub-trees between two parent-trees. Trees may be viewed as graphs without cycles. Many molecules contain cycles, which chemists call rings. Therefore, any attempt to use genetic programming to design molecules must have a mechanism to evolve cycles. This is non-trivial when crossover can replace any sub-tree with some other random sub-tree. After much thought we were unable to devise a crossover-friendly tree representation of arbitrary cyclic graphs. Crossover-friendly means that any sub-tree is a potential crossover point without restriction. Figure 1 depicts crossover using strings (genetic algorithms), trees (genetic programming) and graphs (genetic graphs) using our crossover technique.
Figure 1: Comparison of crossover operators. The interface
to a different font or line thickness indicates the crossover point.
A single crossover point is adequate to divide strings (genetic algorithms)
and trees (genetic programming) into two fragments. Graphs (genetic graphs)
containing cycles require more than one crossover point to divide the system
into two fragments. Furthermore, when two graph fragments with different
numbers of crossover points must be mated, it is possible to create new
edges to satisfy the excess crossover points on one fragment.
Genetic software techniques have been used for molecular design in the past. There is a patent covering genetic graphs for molecular design [Weininger 95]. The patent describes the straightforward and fairly obvious parts of mapping standard genetic algorithm techniques to molecular design and the non-obvious portions: the crossover algorithm and fitness functions. The crossover algorithm described in the patent uses two parameters: the digestion rate which breaks bonds, and the dominance rate which apparently controls how many parts of each parent appears in descendants. This algorithm may produce fragments rather than completely connected molecules. Our paper describes a crossover algorithm that always produces connected molecules and has no parameters. This crossover algorithm is the heart of our genetic method. Fitness functions are clearly non-obvious, but must usually be custom designed for each application. Our fitness function and that in [Weininger 95] both use the Tanimoto index as a distance measure. [Weininger 95] describes a number of fitness functions. We have used all-pairs-shortest-path in most of our work. Daylight Chemical Information Systems, Inc., which holds the patent, reports using genetic techniques to discover lead compounds for pharmaceutical drug development and other commercial successes.
[Nachbar 98] used genetic programming to evolve molecules for drug design by sidestepping the crossover/cycles problem. Each tree node represented an atom with a bond to the parent-node atom and each child-node atom. Hydrogen atoms were explicitly represented and are always leaf nodes. Rings were represented by numbering certain atoms and allowing a reference to that number to be a leaf node. Crossover was constrained not to break or form rings. Ring evolution was enabled by specific ring opening and closing mutation operators.
In a personal communication, Astro Teller reported developing a graph crossover algorithm as part of his dissertation at Carnegie Mellon University. This technique was applied in Neural Programming, a system developed by Teller that combines neural nets and genetic programming. At the time this paper was written, the details of Teller's algorithm were not available in the published literature.
The initial population is generated by choosing a random number of atoms between half and twice the size of the target molecule. Atomic elements are randomly chosen from the elements present in the target molecule. Bonds are then added at random to construct a spanning tree; i.e., at this point all atoms are connected into a single molecule. Then a random number of additional bonds are added to create cycles. This number is chosen to be between half and twice the number of cycles in the target molecule. The number of cycles is always = bonds – atoms + 1. The type of each bond is selected at random from the set of bond types present in the target molecule.
For this work, tournament selection was used to choose parents in a steady state genetic system. Tournament selection means that each parent is chosen by comparing two randomly chosen individuals and taking the best. Steady state means that new individuals (children) replace poor individuals in the population rather than creating a new generation. The poor individuals are also chosen by tournament, but the worst individual is selected for replacement. By convention, after population-size individuals have been replaced, we say that one generation is complete. The implementation follows this procedure:
Figure 2: butane and benzene are ripped apart at random points. Then one fragment of butane and a fragment of benzene are mated. Note that benzene must be cut in two places. Also, during mating the benzene fragment has more than one cut bond. A random choice is made to connect this extra cut bond to a random atom in the butane fragment. Alternatively, the extra cut bond could have been discarded.
A somewhat more complete but significantly more confusing explanation follows. The terms vertex and edge are used for atom and bond respectively to indicate that the algorithm may be applied to any graph structure.
The computational resources required for genetic software to find a solution is a function of the size of the search space, among other factors. The space of all possible graphs is combinatorial and enormous. For molecular design this space can be radically reduced by enforcing valence limits for each atom. Thus, a carbon atom with one double and two single bonds will not be allowed to add another bond. Also, avoiding explicit representation of hydrogen atoms substantially reduces the size of the graph and therefore the search space.
The distance measure used is the Tanimoto index. This is
The targets for our initial study were butane, benzene, cubane, purine, diazepam, morphine and cholesterol. The fitness function can not only find similar molecules, which is useful in drug design, but can also lead evolution to the exact molecule used as a target. This proves that the algorithm can reach particular kinds of molecules and the number of generations to find the target provides a quantitative measure of performance. Unfortunately, all-pairs-shortest-path is an O(n3) algorithm so finding larger molecules can be quite time consuming.
benzene
(C6H6) a simple molecule with a ring.
cubane (C8H8)
a cage molecule.
purine (C5H4N4)
fused rings and heteroatoms.
diazepam (C16H13ClN2O)
used in [Carhart 85].
morphine (C17H19NO3)
Dr. Wipke's group has worked on morphine analog design for many years.
cholesterol
(C27H46O) a non-drug molecule.Since the algorithm is stochastic, twenty runs were conducted for each target molecule. The number of generations and population size was varied in an attempt to have enough successful runs (at least 11) to calculate the median time to find the target. Once the target was found the run stopped. Runs also stopped after a fixed, maximum number of generations. A few of the best individuals were saved to see if the software produced molecules similar to the target. These may be useful for drug design.
| 20 runs for each molecule | Population size | Median generations to find target | Minimum generations to find target | Number of runs that failed to find target |
Maximum generations
|
| Benzene |
200
|
39.5
|
2
|
8
|
1000
|
| Cubane |
100
|
46.5
|
13
|
0
|
1000
|
| Purine |
100
|
245
|
19
|
4
|
1000
|
| Molecule | Population size | Generation found | Fitness function |
| Morphine | 1000 | 208 | all-pairs-shortest-path |
| Diazepam | 200 | 256 | all-pairs-shortest-path |
| Cholesterol minus the two methyl groups connected to the rings | 500 | 1765 | all-pairs-shortest-path plus number of rings |
Finding moderate size molecules has proven difficult with the available computer resources. This is probably because the fitness function is O(n3), but also due to problems with the cycle-stealing batch system used to run this program on idle workstations. Most genetic software uses mutation as well as crossover. Mutation helps systems make small changes. While crossover seems to be capable of generating molecules, the performance is sufficiently poor that mutation operators might help a great deal. Other than the usual operators to add and remove atoms or bonds, it may be helpful to have a mutation operator that makes a random ring aromatic (alternating double and single bonds for certain sized rings). Generating aromatic rings with crossover alone is probably difficult, note the problems generating benzene, so a special mutation operator may be helpful.
The cholesterol run is interesting because a slightly different fitness function was used. Namely, an equal combination of all-pairs-shortest-path and a modified Tanimoto index on the number of rings in the target molecule versus the candidate. This fitness function appears to do a better job of finding interesting analogues to the target molecule. With all-pairs-shortest-path alone, populations seem to converge on a single poor ring structure, at least for the best molecules in the population. Adding the distance between the number of rings generates more fit molecules and more diverse ring structures, at least in preliminary results.
Performance analysis demonstrates what might be expected — the O(n3) fitness function takes most of the CPU time. A faster fitness function would substantially speed calculations. Some runs with faster fitness functions have been made, but these simpler fitness functions do not drive evolution to find the target exactly so comparison with the above data is difficult. Genetic software lore suggests that the fitness function is exceptionally important [Kinnear 94]. Our results bear this out.
Fortunately, the algorithm is embarrassingly parallel since many runs are required. Also, there is significant potential for parallelism within runs since fitness function execution on each individual is completely independent. Furthermore, the algorithm can be easily restarted and can afford to lose some runs. Thus, genetic graphs is a good candidate for cycle-scavenging batch systems such as Condor [Litzkow 88]. Large genetic graph production runs can therefor use otherwise wasted workstation and PC cycles at facilities with large numbers of these machines. Although some of the results presented here were simply run on workstations, all current jobs are run under Condor.
Many fitness functions of interest require molecular conformation; i.e., xyz coordinates for each atom. For example, designing a molecule to fit in a protein receptor to inhibit the activity of a disease organism. The new and fairly effective AIDS drugs are an example of this approach. To design a fitness function evolving molecular fit, the very bizarre molecules often created by crossover must be minimized quickly. Most minimizers available today will not work well with without a "reasonable" start point. We are searching the literature for algorithms to minimize very "bad" molecules.
Circuit design is another field for which genetic graphs should, in principle, be well suited. Genetic algorithms (using variable length strings) [Lohn 98] and genetic programming [Koza 97] have been used to design analog circuits. In the genetic programming case, a tree language to generate analog circuits compatible with the SPICE (Simulation Program with Integrated Circuit Emphasis) [Quarles 94] simulator was constructed and a 64 node (80MHz per node) parallel supercomputer was used to design the circuits. The system designed a lowpass filter, a crossover filter, a four-way source identification circuit, a cube root circuit, a time-optimal controller circuit, a 100 dB amplifier, a temperature-sensing circuit, and a voltage reference source circuit. Thus, genetic programming can design graph-structured systems. However, we have found it extremely difficult to create a tree language that can generate any possible graph and support crossover cleanly. Therefore, it may be advantageous to directly evolve graphs rather than evolve trees that generate graphs.
[Carhart 85] Raymond Carhart, Dennis H. Smith, and R. Venkataraghavan, "Atom pairs as molecular features in structure-activity studies: definition and application," Journal of Chemical Information and Computer Science, 23, pages 64-73, 1985.
[Holland 75] John H. Holland, Adaptation in Natural and Artificial Systems, University of Michigan Press, 1975.
[Kinnear 94] Kenneth E. Kinnear, Jr., "A perspective on the work in this book," Advances in Genetic Programming, edited by Kenneth E. Kinnear, Jr., MIT Press, Cambridge, Massachusetts, pages 3-20, 1994.
[Litzkow 88] M. Litzkow, M. Livny, and M. W. Mutka, "Condor - a hunter of idle workstations,'' Proceedings of the 8th International Conference of Distributed Computing Systems, pp. 104-111, June1988. See http://www.cs.wisc.edu/condor/.
[Lohn 98] Jason D. Lohn and Silvano P. Colombano, "Automated analog circuit synthesis using a linear representation,'' Second International Conference on Evolvable Systems: From Biology to Hardware, Springer-Verlag, Sept.23-25, 1998. (to appear)
[Nachbar 98] Robert B. Nachbar, "Molecular evolution: a hierarchical representation for chemical topology and its automated manipulation," Proceedings of the Third Annual Genetic Programming Conference, University of Wisconsin, Madison, Wisconsin, 22-25 July 1998, pages 246-253.
[Quarles 94] T. Quarles, A. R. Newton, D. O. Pederson, and A. Sangiovanni-Vincentelli, SPICE 3 Version 3F5 User's Manual, Department of Electrical Engineering and Computer Science, University of California at Berkeley, CA, March 1994.
[Koza 92] John R. Koza, Genetic Programming: on the Programming of Computers by Means of Natural Selection, MIT Press, Massachusetts, 1992.
[Koza 97] John R. Koza, Forrest H. Bennett III, David Andre, Martin A. Keane and Frank Dunlap, "Automated synthesis of analog electrical circuits by means of genetic programming," IEEE Transactions on Evolutionary Computation, volume 1, number 2, pages 109-128, July 1997.
[Weininger 95] David Weininger, "Method and apparatus for designing molecules with desired properties by evolving successive populations," U.S. patent US5434796, Daylight Chemical Information Systems, Inc. 1995.
[Xiao 97] Jiang Xiao, Zbigniew Michalewicz, Lixin Zhang, and Krzysztof Trojanowski, "Adaptive evolutionary planner/navigator or mobile robots," IEEE Transactions on Evolutionary Computation, volume 1, number 1, pages 18-28, April 1997.