[toc]

Autonomous Helicopter Flight via Reinforcement Learning

多平台维护不易，内容实时更新于 个人网站，请移步阅读最新内容。

# Abstract

直升机飞行被普遍认为是一个具有挑战性的控制问题。作者在本文中描述了强化学习在自动直升机飞行控制器设计上的成功应用：本文首先拟合了一个随机、非线性的直升机动态模型，然后使用该模型去学习如何定点悬浮(hover in place)，以及如何飞出RC直升机竞赛的复杂行为(maneuvers)。

In this paper, we describe the successful application of reinforcement learning to designing a controller for autonomous helicopter flight.

CM (Comment): 博客着重分析直升机模型建立，以及基于模型的控制策略学习。

# INTRODUCTION

直升机悬停的难度，旋翼顺时针转动，直升机底盘会逆时针旋转；引入尾部旋翼来克服底盘的转动，但同时带来了漂移(drift)的问题。所以，直升机悬停时，一般是机体向右倾斜。

So, for a helicopter to hover in place, it must actually be tilted slightly to the right, so that the main rotor’s thrust is directed downwards and slightly to the left, to counteract this tendency to drift sideways.

对于没有较多直升机控制知识的同学，建议先观看一个讲解直升机飞行原理和控制策略的科普视频，复习一下牛顿三大定律在直升机控制上的应用，否则后面的控制学习不容易理解。

- 看完视频后，相比可以看到直升机设计精巧(ingenious solutions)，然而不直观的直升机动力学(nonintuitive dynamics)使得直升机的控制具有难度。

# Autonomous Helicopter

## 硬件设备

- Yamaha R-50 helicopter (长3.6m，载重20kg)
- an Inertial Navigation System (INS)
- a differential GPS system (a resolution of 2cm)
- a digital compass
- an onboard navigation computer

- 最后，基于卡尔曼滤波(the position estimates given by the Kalman filte)，上述硬件以50Hz对外输出直升机的状态估计(数字量)：位置(position)，方向(orientation)，速度(velocity)和角速度(angular velocities)。

基于上述视频容易理解，大多数直升机是通过4维的运动空间(4-dimensional action space)来实现飞行控制的：桨旋转平台的前后左右的倾斜(tilting this plane either forwards/backwards or sideways)，主螺旋桨角度的变化，机翼螺旋桨角度的变化。

- Cyclic pitch: The longtitudinal (front-back) and latitudinal (left-right) cyclic pitch controls. By tilting this plane (the helicopter’s rotors rotate) either forwards/backwards or sideways, these controls cause the helicopter to accelerate forward/backwards or sideways.

- Collective pitch: The (main rotor) collective pitch control. By varying the tilt angle ofthe rotor blades, the collective pitch control affects the main rotor’s thrust.

- Yaw motion: The tail rotor collective pitch control. Using a mechanism similar to the main rotor collective pitch control, this controls the tail rotor’s thrust.

综上，本文的任务就是基于卡尔曼滤波的位置估计，周期(50Hz)挑选好的控制行为(pick good control actions)。

Using the position estimates given by the Kalman filter, our task is to pick good control actions every 50th of a second.

# Model identification

## Fit the helicopter’s dynamics model

数据输入：记录人类驾驶员操作直升机的数据(12-dimensional helicopter state & 4-dimensional helicopter control inputs)，使用这些飞行数据进行模型拟合(model fittin)。

We began by asking a human pilot to fly the helicopter for several minutes, and recorded the 12-dimensional helicopter state and 4-dimensional helicopter control inputs as it was flown.

使用机体坐标系(helicopter body coordinates)，可以减少参数辨识量。

in which the x, y, and z axes are forwards, sideways, and down relative to the current position of the helicopter.

Our model is identified in the body coordinates, which has four fewer variables than the spatial (world) coordinates.

### Locally weighted linear regression (局部加权线性回归)

s(t+1) = f(s(t), a(t), noise)

By applying locally-weighted regression with the state s(t) and action a(t) as inputs, and the one-step differences of each of the state variables in turn as the target output, this gives us a non-linear, stochastic, model ofthe dynamics, allowing us to predict s(t+1) as a function of s(t) and a(t) plus noise.

此外，和直升机的先验知识结合，提出优化模型的一些策略方法，减少拟合的参数量。

Similar to the use of body coordinates to exploit symmetries, there is other prior knowledge that can be incorporated.

Similar reasoning allows us to conclude that certain other parameters should be 0， 1/50(50Hz) or g(gravity), and these were also

**hard-coded**into the model.加入三个不可观测的参数(unobserved variables)来描述模型在控制上的延迟(model latencies)。

Finally, we added three extra (unobserved) variables to model latencies in the responses to the controls.

通过图片(plots)将长时间段的均方差显示出来的交互方式，更容易看出差异和比较来选择模型，观察模型在长时间段内的拟合准确程度。

Our main tool for choosing among the models was plots.

查看在一段时间内(at longer time scales)的预测位置(estimated position)和真实位置(true position)的均方差(mean-squared error)。

For a model, the mean-squared error (as measured on test data) between the helicopter’s true position and the estimated position at a certain time in the future.

建立好模型的意义，增强后期在真实环境下的可靠性。

We wanted to verify the fitted model carefully, so as to be reasonably confident that a controller tested successfully in simulation will also be safe in real life.

对没有建模的噪声输入，担心噪声超出模型的预测。为次，通过图片检查(plots)直升机的预测输出是否超出误差阈值。

One of our concerns was the possibility that unmodeled correlations in noise might mean the noise variance of the actual dynamics is much larger than predicted by the model.

# Reinforcement learning: The PEGASUS algorithm

## 对PEGASUS的描述

- 这些使用程序很难精确计算，但是可以借助于计算机来仿真MDP的动力学。

# Learning to Hover

以其他控制器尝试为例(μ-synthesis)，来说明问题的难度和微妙

These comments should not be taken as conclusive of the viability of any of these methods; rather, we take them to be indicative of the difficulty and subtlety involved in learning a helicopter controller.

从直升机悬停开始学习策略。

We began by learning a policy for hovering in place.

直升机悬停控制的神经网络设计

The picture inside the circles indicate whether a node outputs the sum of their inputs, or the tanh of the sum of their inputs. Each edge with an arrow in the picture denotes a tunable parameter. The solid lines show the hovering policy class. The dashed lines show the extra weights added for trajectory following.

- 很明显，需要具备直升机控制的理论和实践知识。

Each of the edges in the figure represents a weight, and

**the connections were chosen via simple reasoning about which control channel should be used to control which state variables**.

结合具体的问题和目标，设计奖励函数和惩罚函数，这是需要除计算机知识和强化学习知识外的专业知识和实践经验。

- 二阶奖励函数(靠近目标点+速度变化慢)
This encourages the helicopter to hover near target position, while also keeping the velocity small and not making abrupt movements.

- 二阶惩罚函数(鼓励平滑控制和小行动)
To encourage small actions and smooth control of the helicopter, we also used a quadratic penalty for actions.

- 二阶奖励函数(靠近目标点+速度变化慢)
基于辨识的模型(model identified)，使用PEGASUS算法来获得实用策略(the utilities of policies)的近似值 $[\hat{U}(\pi)]$。

Using the model identified, we can now apply PEGASUS to define approximations $[\hat{U}(\pi)]$ to the utilities of policies.

策略和动态特性的可连续和平滑性。

Since policies are smoothly parameterized in the weights, and the dynamics are themselves continuous in the actions, the estimates of utilities are also continuous in the weights.

求取策略的权重时，应用爬山算法来最大化$[\hat{U}(\pi)]$，其中使用梯度上升(gradient ascent algorithm)和随机游走(random-walk algorithm)两种算法，来最大化策略 $[\pi]$。

We may thus apply standard hillclimbing algorithms to maximize $[\hat{U}(\pi)]$ in terms of the policy’s weights.

在策略学习中，最消耗资源的是重复加速蒙特卡罗评价(Monte Carlo evaluation)来获得策略，文章提出使用并行计算(parallelized our implementation)来加速蒙特卡罗评价的训练。

- 在不同的计算机使用不同的样本来计算；
- 然后聚合训练结果，获得策略 $[\hat{U}(\pi)]$。

The most expensive step in policy search was the repeated Monte Carlo evaluation to obtain $[\hat{U}(\pi)]$. To speed this up, we parallelized our implementation, and Monte Carlo evaluations using different samples were run on different computers, and the results were then aggregated to obtain $[\hat{U}(\pi)]$.

如何评价控制策略的表现：文章比较基于学习的策略和人类驾驶员的性能表现，来证明算法的稳定性。

We also compare the performance of our learned policy against that of our human pilot trained and licensed by Yamaha to fly the R-50 helicopter.

# Flying competition maneuvers

作者参加了AMA(the Academy of Model Aeronautics)组织的直升机飞行挑战(an annual RC helicopter competition)，挑战极具难度的3个行为，比如如下图所示的 pirouette (turning in place)。

We took the first three maneuvers from the most challenging, Class III, segment of their competition.

文章目前只是学习了如何悬停(hover)，那么如何能够沿着轨迹飞行呢？作者给出的一种做法是让直升机沿着特定的轨迹点，慢慢变化状态，控制策略和悬停基本一致。

How does one design a controller for flying trajectories? Given a controller for keeping a system’s state at a point, one standard way to make the system move through a particular trajectory is to slowly vary along a sequence of set points on that trajectory.

另一种做法，重新训练参数来准确跟踪轨迹

Retrain the policy’s parameters for accurate trajectory following.

trajectory following vs trajectory tracking

Path/Trajectory following is all about following a predefined path which does not involve time as a constraint. Thus, if you are on the path and following it with whatever speed you have reached your goal. On the contrary, trajectory tracking involves time as a constraint. Meaning that you have to be at a certain point at a certain time.

基于直升机的理论机制，优化网络模型。

Since we are now flying trajectories and not only hovering, we also augmented the policy class to take into account more of the coupling between the helicopter’s different sub-dynamics. / 比如旋转时，尾翼控制会带来漂移；上升下降控制时，带来旋转的影响，这一部分在最开始的视频中有讲到。

设计(specify)轨迹跟踪(trajectory following)时的奖励函数。

贴近轨迹(penalize deviation)：惩罚位置的偏差，计算直升机当前位置和其投影到理想轨迹的位置偏差。

“tracked” position: the “projection” of the helicopter’s position onto the path of the idealized, desired trajectory.

the learning algorithm pays a penalty that is quadratic between the actual position and the “tracked” position on the idealized trajectory.

前进奖励(making progress)：使用沿着轨迹增加的势函数(potential function)，如果直升机前进，那么可以获得正奖励(positive reward)。

Since, we are already tracking where along the desired trajectory the helicopter is, we chose a potential function that increases along the trajectory. Thus, whenever the helicopter’s makes forward progress along this trajectory, it receives positive reward.

定义解耦(decouple our definition)

Finally, our modifications have decoupled our definition of the reward function from (x

*, y*, z*, w*) and the evolution of (x*, y*, z*, w*) in time.- (x
*, y*, z*, w*) represents a desired hovering position and orientation.

We considered several alternatives, but the main one used ended up being a modification for flying trajectories that have both a vertical and a horizontal component (such as along the two upper edges of the triangle in III.1).

- (x
模型控制需要根据实际测试，人为干预，不断调整，而不是一劳永逸，比如在z方向上人为加入时延，来避免“bowed-out”或者“bowed-in”轨迹。

the z (vertical)-response of the helicopter is very fast. In contrast, the x and y responses are much slower.