Recent advances in simulated evolution and learning
We suspect this result is due to both the quality of the topology, which was highly engineered, and the size of the topology space, which makes it difficult for evolution to search. To compare these four approaches, we conducted 10 independent runs of each approach in each of the 10 training MDPs, i.
The results, averaged over all runs for each method, are shown in the left side of Fig. Average left and best right performance of the population champion over time on GHH lower is better.
Recent Advances in Simulated Evolution and Learning - K. C. Tan - Google книги
The results demonstrate that seeding the population with the baseline policy enables evolution to begin in a relatively fit region of the search space and therefore can significantly speed evolution. This is consistent with many earlier results confirming the efficacy of population seeding, e. The results also show that, when the baseline policy is used, the manually designed MLP performs substantially better than the SLP. This is not surprising since the topology was carefully engineered for helicopter control.
4 editions of this work
More surprising is the poor performance of the MLP when beginning from scratch, without population seeding. This strategy appears to perform the worst. However, a closer examination of the individual runs revealed that the vast majority achieve performance similar to the MLP using the baseline policy, albeit more slowly as in the best runs shown in the right side of Fig. The remaining few runs converge prematurely, performing badly enough to greatly skew the average.
All the approaches described in this section evolve policies only for a single training MDP, with no attempt to generalize across MDPs with different wind settings.
To determine the robustness of the resulting policies, we compared their average performance across all 10 training MDPs to their performance on the particular MDP for which they were trained. Specifically, we selected the best single-layer and multi-layer policy evolved for each MDP from the baseline policy and tested it for 10 episodes in every training MDP. This comparison demonstrates that the MLP policies are far more robust, achieving much better average performance and lower variance across the training MDPs. In fact, no specialized multi-layer policy crashes the helicopter on any of the 10 MDPs.
By contrast, the single-layer policies frequently crash on MDPs other than those they trained on, with catastrophic effects on average reward. This result underscores the challenges of achieving high performance in the generalized version of the task, which we address in Sects. The results presented above demonstrate that neuroevolution can discover effective policies for helicopter hovering. However, doing so incurs high sample costs because it requires evaluating tens of thousands of policies through interactions with the environment.
Many of these policies yield poor reward or even crash the helicopter. Consequently, directly using evolution to find policies on-line is infeasible for the competition because participants are evaluated on the cumulative reward their agents accrue during learning. Even if methods for improving the on-line performance of neuroevolution [ 14 , 89 , 90 ] were used, such an approach would not be practical. One way to reduce sample costs is to use a model-based approach. If the agent can learn a model of the environment from flight data for each testing MDP early in the run, that model can simulate the fitness function required by neuroevolution.
Thus, once a model has been learned, evolution can proceed off-line without increasing sample costs. Doing so allows the agent to employ a good policy much earlier in the run, thus increasing its cumulative reward.
- Main Navigation Menu.
- Silent Victim (Silent, Book 02)!
Learning such a model can be viewed as a form of surrogate modeling , also known as fitness modeling [ 36 ], in which a surrogate for the fitness function is learned and used for fitness evaluations. Surrogate modeling is useful for smoothing fitness landscapes [ 67 , 98 ] or when there is no explicit fitness function, e.
However, it is most commonly used to reduce computation costs [ 37 , 59 , 67 , 68 , 70 ] by finding a model that is faster than the original fitness function. The use of surrogate modeling in our setting is similar but the goal is to reduce sample costs, not computational costs. Recall that, as described in Sect. In practice, however, it is still possible to learn models that correctly predict how a given policy will perform in GHH During the competition, the details of the helicopter environment were hidden. However, helicopter dynamics have been well studied.
The accelerations depend on the values of the g vector, which represent gravity 9. The values of w are zero-mean Gaussian random variables that represent perturbations in acceleration due to noise. As described in Sect. Because this model representation was not designed for the generalized version of the problem, it does not explicitly consider the presence of wind.
Nonetheless, it can still produce accurate models if the amount of wind in the helicopter frame remains approximately constant, i. Since helicopters in the hovering problem aim to keep the helicopter as close to the target position as possible, this assumption holds in practice. Therefore, wind can be treated as a constant and learning a complete model requires only estimating values for the weights C , D , and w.
We consider three different approaches to learning these weights. In the first approach, we use evolutionary computation to search for weights that minimize the error in the reward that the model predicts a given policy will accrue during one episode. This approach directly optimizes the model for its true purpose: to serve as an accurate fitness function when evolving helicopter policies.
To do so, we apply the same steady-state evolutionary method used to evolve policies see Appendix A. Fitness is based on the error in total estimated reward per episode using a single policy trained on an MDP with no wind, which we call the generic policy. Again we use the same steady-state evolutionary method and compute fitness using the generic policy.
In the third approach, we still try to minimize error in one-step predictions but use linear regression in place of evolutionary computation. Linear regression computes the weight settings that minimize the least squared error given one episode of data gathered with the generic policy. After preprocessing, evolution or linear regression is used to estimate C and D.
The noise parameters w are approximated using the average of the squared prediction errors of the learned model on the flight data. We evaluated each of these approaches by using them to learn models for each of the test MDPs that were released after the competition ended. Then we used the learned model to evolve policies in the manner described in Sect.
They also demonstrate that, when minimizing one-step error, linear regression is more effective than evolution. Furthermore, linear regression requires vastly less computation time than evolution. This difference is not surprising since evolution requires on average approximately evaluations to evolve a model. By contrast, linear regression requires only one sweep through the flight data to estimate model weights. In this section, we describe the simple model-free approach for tackling GHH that won first place in the RL Competition. The robustness analysis presented in Sect. Thus, excelling in the competition requires learning on-line in order to adapt to each test MDP.
At the same time, a good agent must avoid crashing the helicopter and must minimize the time spent evaluating suboptimal policies during learning. Each competition test run lasts only 1, episodes but, as shown in Fig. Even if evolution could find a good policy in 1, episodes, it would accrue large negative reward along the way. As mentioned in Sect. However, at the time of the competition, we were unable to learn models accurate enough to serve as reliable fitness functions for evolution. Instead, we devised a simple, sample-efficient model-free approach. In advance of the competition, specialized policies for each of the 10 training MDPs were evolved using the procedure described in Sect.
Finally, whichever specialized policy performed the best was used for the remaining episodes of that test MDP. This strategy, depicted in Fig.
Recent Advances In Simulated Evolution And Learning 2004
The model-free approach. The dashed box occurs off-line while the solid boxes occur on-line , during actual test episodes. Of the six entries that successfully completed test runs, only one other entry managed to avoid ever crashing the helicopter, though it still incurred substantially more negative reward. In fact, all the competitors accrued at least two orders of magnitude more negative reward than our model-free approach. Due to these large differences, the results are shown in a log scale. Since this scale obscures details about the performance of the model-free method, the same results are also reproduced in a linear scale in Fig.
At left , log-scale cumulative reward accrued by competitors in GHH during the RL Competition lower is better. At right , linear-scale cumulative reward of only the model-free approach in the same competition, showing how the slope rises or falls suddenly as the test MDP changes every 1, episodes. However, one entry matched the performance of the model-free approach for approximately the first third of testing.
10th International Conference, SEAL 2014, Dunedin, New Zealand, December 15-18, Proceedings
This entry, submitted by a team from the Complutense University of Madrid, also uses neuroevolution [ 51 ]. However, only single-layer feed-forward perceptrons are evolved. Furthermore, all evolution occurs on-line, using the actual test episodes as fitness evaluations. To minimize the chance of evaluating unsafe policies, their approach begins with the baseline policy and restricts the crossover and mutation operators to allow only very small changes to the policy.
While their strategy makes on-line neuroevolution more feasible, three crashes still occurred during testing, relegating it to a fourth-place finish. After the competition, we successfully implemented the model-learning algorithms described in Sect. Given some test MDP, one episode of flight data is gathered using the generic policy, which avoids crashing but may not achieve excellent reward on that MDP.
- Bestselling Series.
- Recent Advances In Simulated Evolution And Learning 2004.
- Collected Poems 1912-1944?
Next, a complete model of the test MDP is learned from the flight data via linear regression, the best performing method. Then, neuroevolution is used to evolve a policy optimized for this MDP, using the model to compute the fitness function. Finally, the evolved policy controls the helicopter for all remaining episodes on that MDP. The model-based approach. The dashed boxes occur off-line while the solid boxes occur on-line , during actual test episodes.
We also tested an incremental model-based approach, depicted in Fig. Specifically, the incremental approach learns a new model at the end of each episode using all the flight data gathered on that MDP. Then, evolution is repeated using the latest model to find a new policy for the next episode. Once the performance of the policy in the MDP is at least as good as that predicted by the model, learning is halted and that policy is employed for the remaining episodes.