Below, we first provide details on our biomechanical model. After discussing our general reinforcement learning approach, we focus on the individual components of our method, namely states, actions, scaling factors, rewards, and an adaptive target-selection mechanism. We also provide details on the implementation of our algorithm. Finally, we discuss the methods used for evaluation.
Biomechanical model of the human upper extremity
Our biomechanical model of the human upper extremity is based on the Upper Extremity Dynamic model3, which was originally implemented in OpenSim28. Kinematically, the model represents the human shoulder and arm, using seven physical bodies and five ”phantom” bodies to model the complex movement of the shoulder. This corresponds to three joints (shoulder, elbow, and wrist) with seven DOFs and five additional joints with thirteen associated components coupled by thirteen constraints with the DOFs. Each DOF has constrained joint ranges (see Table 1), which limits the possible movements. In contrast to linked-segment models, the Upper Extremity Dynamic model represents both translational and rotatory components of the movement within shoulder, clavicle, and scapula, and also within the wrist. It also contains physiological joint axis orientations instead of the perpendicular orientations in linked-segment models. The dynamics components of the musculoskeletal model are represented by the weight and inertia matrix of each non-phantom body and the default negligible masses and inertia of all phantom bodies. The dynamics properties of the model were extracted from various previously published works on human and cadaveric studies. The active components of the Upper Extremity Dynamic Model consist of thirty-one Hill-type muscles as well as of fourteen coordinate limit forces softly generated by the ligaments when a DOF approaches the angle range limit. Further details of this model are given in Saul et al.3
In order to make reinforcement learning feasible, we manually implement the Upper Extremity Dynamic Model in the fast MuJoCo physics simulation7. With respect to kinematics, the MuJoCo implementation of the model is equivalent to the original OpenSim model and contains physiologically accurate degrees of freedom, as well as corresponding constraints. We assume the same physiological masses and inertial properties of individual segments as in the OpenSim model. We do not implement muscles in the MuJoCo model, as this would significantly slow down the simulation and make reinforcement learning computationally infeasible due to the exponential growth of decision variables in the (discretized) action space when increasing the number of DOFs – the curse of dimensionality. In particular, computing dynamic actuator lengths (which significantly affect the forces produced by muscle activation patterns) has proven challenging in MuJoCo70. Instead, we implement simplified actuators, representing aggregated muscle actions on each individual DOF, which are controlled using the second-order dynamics introduced by van der Helm et al.71 with fixed excitation and activation time constants \(t_{e}=30\) ms and \(t_{a}=40\) ms, respectively. We discretize the continuous state space system using the forward Euler method, which yields the following dynamics:
$$\begin{aligned} \begin{bmatrix} \sigma _{n+1}^{(q)} \\ \dot{\sigma }_{n+1}^{(q)} \end{bmatrix} = \begin{bmatrix} 1 &{} \Delta t \\ \frac{-\Delta t}{(t_e t_a)} &{} 1 – \Delta t \frac{t_e + t_a}{t_e t_a} \end{bmatrix} \begin{bmatrix} \sigma _{n}^{(q)} \\ \dot{\sigma }_{n}^{(q)} \end{bmatrix} + \begin{bmatrix} 0 \\ \frac{\Delta t}{t_e t_a} \end{bmatrix} c_{n}^{(q)}, \end{aligned}$$
(4)
where \(c_{n}^{(q)}\) is the applied control and \(\sigma _{n}^{(q)}\) the resulting activation for each DOF \(q\in \mathcal {Q}\), and \(\mathcal {Q}\) is the set that contains all DOFs. The controls are updated every \(\Delta t\)=10 ms, at time steps \(n\in \{0, \dots , N-1\}\). To get more accurate results, at each time step n, we compute five sub-steps (during which the control \(c_{n}^{(q)}\) is constant) with a sampling time of 2 ms to arrive at time step \(n+1\).
We assume both signal-dependent and constant noise in the control, i.e.,
$$\begin{aligned} c_{n}^{(q)} = (1 + \eta _{n}) a_{n}^{(q)} + \epsilon _{n}, \end{aligned}$$
(5)
where \(a_{n}=(a_{n}^{(q)})_{q\in \mathcal {Q}}\) denotes the action vector obtained from the learned policy, and \(\eta _{n}\) and \(\epsilon _{n}\) are Gaussian random variables with zero mean and standard deviations of 0.103 and 0.185, respectively, as described by van Beers et al.4 The torques, which are applied at each DOF independently, are obtained by multiplying the respective activation \(\sigma _{n}^{(q)}\) with a constant scaling factor \(g^{(q)}\), which represents the strength of the muscle groups at the this DOF, i.e.,
$$\begin{aligned} \tau _{n}^{(q)} = g^{(q)} \sigma _{n}^{(q)}. \end{aligned}$$
(6)
We select the scaling factors, and respectively the maximum voluntary torques for the actuators given in Table 1, based on experimental data as described below. We currently do not model the soft joint ranges in MuJoCo, as the movements the model produces do not usually reach joint limits.
The used biomechanical model provides the following advantages over simple linked-segment models:
-
Phantom bodies and joints allow for more realistic movements, including both translation and rotation components within an individual joint,
-
Individual joint angle and torque limits are set for each and every DOF,
-
Axes between joints are chosen specifically and not just perpendicular between two segments,
-
The model includes physiological body segment masses, and yields better options for scaling individual body parts, e.g., based on particular individuals.
Reinforcement learning
We define the task of controlling the biomechanical model of the human upper extremity through motor control signals applied at the joints as a reinforcement learning problem, similar to recent work from Cheema et al.34 In this formulation, a policy \(\pi _\theta (a|s)\) models the conditional distribution over actions \(a \in \mathcal {A}\) (motor control signals applied at the individual DOFs) given the state \(s \in \mathcal {S}\) (the pose, velocities, distance to target, etc.). The subindex \(\theta\) denotes the parameters of the neural networks introduced below. At each timestep \(n\in \{0, \dots , N\}\), we observe the current state \(s_n\), and sample a new action \(a_n\) from the current policy \(\pi _\theta\). The physical effects of that action, i.e., the application of these motor control signals, constitute the new state \(s_{n+1}\), which we obtain from our biomechanical simulation. In our model, given \(s_{n}\) and \(a_{n}\), the next state \(s_{n+1}\) is not deterministic, since both signal-dependent and constant noise are included. Hence, we denote the probability of reaching some subsequent state \(s_{n+1}\) given \(s_{n}\) and \(a_{n}\) by \(p(s_{n+1}|s_{n}, a_{n})\), while \(p(s_0)\) denotes the probability of starting in \(s_0\). Given some policy \(\pi _{\theta }\) and a trajectory \(T=(s_0, a_0, \dots , a_{N-1}, s_N)\),
$$\begin{aligned} p_\theta (T) = p(s_0)\prod _{n=0}^{N-1}\pi _\theta (a_n|s_n)p(s_{n+1}|s_{n}, a_{n}) \end{aligned}$$
(7)
describes the probability of realizing that trajectory. Evaluating/Sampling equation (7) for all possible trajectories \(T\in \mathcal {T}\) then yields the distribution over all possible trajectories, \(\varrho _\theta ^{\mathcal {T}}\), induced by a policy \(\pi _{\theta }\).
We compute a reward \(r_n\) at each time step n, which allows us to penalize the total time needed to reach a given target. The total return of a trajectory is given by the sum of the (discounted) rewards \(\sum _{n=0}^{N}\gamma ^n r_n\), where \({\gamma \in ]0,1]}\) is a discount factor that causes the learner to prefer earlier rewards to later ones. Incorporating the entropy term,
$$\begin{aligned} \mathcal {H}(\pi _\theta (\cdot \mid s))=\mathbb {E}_{a\sim \pi _\theta (\cdot \mid s)}[-\log (\pi _\theta (a\mid s))], \end{aligned}$$
(8)
yields the expected (soft) return
$$\begin{aligned} J(\theta ) = \mathbb {E}_{T\sim \varrho _\theta ^{\mathcal {T}}} \left[ \left( \sum _{n=0}^{N-1} \gamma ^{n} \left( r_n – \alpha \log (\pi _{\theta }(a_{n} \mid s_{n}))\right) \right) + \gamma ^{N}r_{N}\right] , \end{aligned}$$
(9)
which we want to maximize with respect to the parameters \(\theta\), i.e., the goal is to identify the optimal parameters \(\theta ^*\) that maximize \(J(\theta )\). Here, the temperature parameter \(\alpha > 0\) determines the importance of assigning the same probability to all actions that yield the same return (enforced by maximizing the entropy \(\mathcal {H}\)), i.e., increasing the “stochasticity” of the policy \(\pi _{\theta }\), relative to maximizing the expected total return. It thus significantly affects the optimal policy, and finding an “appropriate” value is non-trivial and heavily depends on the magnitude of the rewards \(r_{n}\). For this reason, we decided to adjust it automatically during training together with the parameters \(\theta\), using dual gradient descent as implemented in the soft-actor critic algorithm (see below)6.
It is important to note that the soft return in Equation (9) is different from the objective function used in standard reinforcement learning. The MaxEnt RL formulation, which incorporates an additional entropy maximization term, provides several technical advantages. These include the natural state-space exploration72,73, a smoother optimization landscape that eases convergence towards the global optimum74,75,76, and increased robustness to changes in the reward function77,78. In practice, many RL algorithms have gained increased stability from the additional entropy maximization79,80,81. Conceptually, MaxEnt RL can be considered equivalent to probabilistic matching, which has been used to explain human decision making82,83. Existing evidence indicates that human adults tend to apply probabilistic matching methods rather than pure maximization strategies82,84,85. However, these observations still lack conclusive neuroscientific explanation80.
In order to approximate the optimal parameters \(\theta ^*\), we use a policy-gradient approach, which iteratively refines the parameters \(\theta\) in the direction of increasing rewards. Reinforcement learning methods that are based on fully sampled trajectories usually suffer from updates with high variance. To reduce this variance and thus accelerate the learning process, we choose an approach that includes two approximators: an actor network and a critic network. These work as follows. Given some state \(s_0\) as input, the actor network outputs the (standardized) mean and standard deviation of as many normal distributions as dimensions of the action space. The individual action components are then sampled from these distributions. To update the actor network weights, we must measure the “desirability” of some action a, given some state s, i.e., how much reward can be expected when starting in this state with this action and subsequently following the current policy. These values are approximated by the critic network.
The architecture of both networks is depicted in Fig. 6. For the sake of a simpler notation, the parameter vector \(\theta\) contains the weights of both networks, however these weights are not shared between the two. These two networks are then coupled with the soft actor-critic (SAC) algorithm6, which has been used successfully in physics-based character motion86: As a policy-gradient method, it can be easily used with a continuous action space such as continuous motor signals – something that is not directly possible with value function methods like DQN5. As an off-policy method that makes use of a replay buffer, it is quite sample-efficient. This is important, since running forward physics simulations in MuJoCo constitutes the major part of the training duration. Moreover, it has been shown that SAC outperforms other state-of-the-art algorithms such as PPO87 or TD388. Supporting the observations in Haarnoja et al.6, we also found our training process to be faster and more robust when using SAC rather than PPO. Moreover, SAC incorporates an automatic adaption of the temperature \(\alpha\) using dual gradient descent, which eliminates the need for manual, task-dependent fine-tuning. In order to obtain an unbiased estimate of the optimal value function, we use Double Q-Learning89, using a separate target critic network. The neural network parameters are optimized with the Adam optimizer90.
States, actions, and scaling factors
Using the MuJoCo implementation of the biomechanical model described above, the states \(s\in \mathcal {S}\subseteq \mathbb {R}^{48}\) in our RL approach include the following information:
-
Joint angle for each DOF \(q \in \mathcal {Q}\) in radians (7 values),
-
Joint velocity for each DOF \(q \in \mathcal {Q}\) in radians/s (7 values),
-
Activations \(\sigma ^{(q)}\) and excitations \(\dot{\sigma }^{(q)}\) for each DOF \(q \in \mathcal {Q}\) (\(2\times 7\) values),
-
Positions of the end-effector and target sphere (\(2\times 3\) values),
-
(positional) Velocities of the end-effector and target sphere (\(2\times 3\) values),
-
(positional) Acceleration of the end-effector (3 values),
-
Difference vector: vector between the end-effector attached to the index finger and the target, pointing towards the target (3 values),
-
Projection of the end-effector velocity towards the target (1 value),
-
Radius of the target sphere (1 value).
We found that in our case, the target velocity (which always equals zero for the considered tasks), the end-effector acceleration, the difference vector, and the projection of the end-effector velocity can be omitted from state space without reducing the quality of the resulting policy. However, we decided to incorporate these observations, as they did not considerably slow down training and might be beneficial for more complex tasks such as target tracking or via-point tasks.
Each component \(a^{(q)}\in \left[ -1,1\right]\) of the action vector \(a=(a^{(q)})_{q\in \mathcal {Q}}\in \mathcal {A}\) \(\subseteq \mathbb {R}^{7}\) is used to actuate some DOF \(q\in \mathcal {Q}\) by applying the torque \(\tau ^{(q)}\) resulting from Eqs. (4)–(6). Note that in addition to these actuated forces, additional active forces (e.g., torques applied to parent joints) and passive forces (e.g., gravitational and contact forces) act on the joints in each time step.
We determine experimentally the maximum torque a human would exert at each DOF in this task as follows. We implemented the Fitts’ Law task described above in a VR environment displayed via the HTC Vive Pro VR headset. We recorded the movements of a single participant performing the task, using the Phasespace X2E motion capture system with a full-body suit provided with 14 optical markers. This study was granted ethical approval by the ethics committee of the University of Bayreuth and followed ethical standards according to the Helsinki Declaration. Written informed consent was received from the participant, which received an economic compensation for participating in the study. Using OpenSim, we scaled the Upper Extremity Dynamic Model to this particular person. We then used OpenSim to perform Inverse Dynamics to obtain the torque sequences that are most likely to produce the recorded marker trajectories. For each DOF \(q \in \mathcal {Q}\), we set the corresponding scaling factor \(g^{(q)}\) to the absolute maximum torque applied at this DOF during the experiment, omitting a small number of outliers from the set of torques, i.e., values with a distance to mean larger than 20 times the standard deviation. The resulting values are shown in Table 1.
Reward function and curriculum learning
The behavior of the policy is determined largely by the reward \(r_n\) that appears in Eq. (9). We designed the reward following Harris and Wolpert1, who argue that there is no rational explanation as to why the central nervous system (CNS) should explicitly try to minimize previously proposed metrics such as the change in torque applied at the joints12, or the acceleration (or jerk) of the end-effector8. They argue that it is not even clear whether the CNS is able to compute, store, and integrate these quantities while executing motions.
Instead, they argue that the CNS aims to minimize movement end-point variance given a fixed movement time, under the constraint of signal-dependent noise. Following Harris and Wolpert1, t his is equivalent to minimizing movement time when the permissible end-point variance is given by the size of the target. This objective is simple and intuitively plausible, since achieving accurate aimed movements in minimal time is critical for the success of many movement tasks. Moreover, it has already been applied to linear dynamics2.
Therefore, the objective of our model is to minimize movement time while reaching a target of given width.
More precisely, our reward function consists only of a time reward, which penalizes every time step of an episode equally:
$$\begin{aligned} r_n = – 100 \Delta t. \end{aligned}$$
(10)
This term provides incentives to terminate the episode (which can only be achieved by reaching the target) as early as possible. Since we apply each control \(a_{n}\) for 10 ms, \(\Delta t\) amounts to 0.01 in our case, i.e., \(r_{n}=-1\) in each time step \(n\in \{0,\dots ,N\}\).
According to our experience, it is possible to learn aimed movements despite the lack of gradient provided by the reward function, provided the following requirements are met. The initial posture needs to be sampled randomly, and the targets need to be large enough at the beginning of the training to ensure that the target is reached by exploration sufficiently often in early training steps to guide the reinforcement learner. However, creating a predetermined curriculum that gradually decreases the target width during training appropriately has proved very difficult. In most cases, the task difficulty either increased too fast, leading to unnatural movements that do not reach the target directly (and often not at all), or progress was slow, resulting in a time-consuming training phase.
For this reason, we decided to use an adaptive curriculum, which adjusts the target width dynamically, depending on the recent success rate. Specifically, we define a curriculum state, which is initialized with an initial target diameter of 60 cm. Every 10K update steps, the current policy is evaluated on 30 complete episodes, for which target diameters are chosen, depending on the current state of the curriculum. Based on the percentage of targets reached within the permitted 1.5 s (success rate), the curriculum state is updated. If the success rate falls below \(70\%\), it is increased by 1 cm; if the success rate exceeds \(90\%\), it is decreased by 1 cm. To avoid target sizes that are larger than the initial width or are too close to zero, we clipped the resulting value to the interval \(\left[ 0.1~\text {cm},~60~\text {cm}\right]\).
At the beginning of each episode, the target diameter is set to the current curriculum state with probability \(1 – \varepsilon\), and sampled uniformly randomly between 0.1 cm and 60 cm with probability \(\varepsilon =0.1\), which has proven to be a reasonable choice. This ensures in particular that all required target sizes occur throughout the training phase, and thus prevents forgetting how to solve “simpler” tasks (in literature, often referred to as catastrophic forgetting; see, e.g., McCloskey et al.91).
Implementation of the reinforcement learning algorithm
The actor and critic networks described in the Reinforcement Learning section consist of two fully connected layers with 256 neurons each, followed by the output layer, which either returns the means and standard deviations of the action distributions (for the actor network) or the state-action value (for the critic network). To improve the speed and stability of learning, we train two separate, but same-structuredidentically structured critic networks and use the minimum of both outputs as the teaching signal for all networks (Double Q-Learning)6,89. In all networks, ReLU92 is used as non-linearity for both hidden layers. The network architectures are depicted in Fig. 6.
The reinforcement learning methods of our implementation are based on the TF-Agents library93. The learning phase consists of two parts, which are repeated alternately: trajectory sampling and policy updating.
In the trajectory sampling part, the target position is sampled from the uniform distribution on a cuboid of 70 cm height, 40 cm width, and 30 cm depth, whose center is placed 50 cm in front of the human body, and 10 cm to the right of the shoulder. The width of the target is controlled by the adaptive curriculum described above. The biomechanical model is initialized with some random posture, for which the joint angles are uniformly sampled from the convex hull of static postures that enables keeping the end-effector in one of 12 targets placed along the vertices of the cuboid described above. The initial joint velocities are uniformly sampled from the interval \(\left[ -0.005~\text {radians/s},~0.005~\text {radians/s}\right]\).
In each step \(n\in \{0, \dots , N-1\}\), given the current state vector \(s_{n}\in \mathcal {S}\) (see description above), an action is sampled from the current policy \(\pi _{\theta }(\cdot \mid s_{n})\). Next, the MuJoCo simulation uses this action to actuate the model joints. It also updates the body posture, and returns both the reward \(r_n\) and the subsequent state vector \(s_{n+1}\). In our implementation, each episode in the learning process contains at most \(N=150\) of such steps, with each step corresponding to 10 ms (allowing movements to be longer than one and a half seconds did not improve the training procedure significantly). If the target is reached earlier, i.e., the distance between end-effector and target center is lower than the radius of the target sphere, and the end-effector remained inside the target for 100 ms, the current episode terminates and the next episode begins with a new target position and width. At the beginning of the training, 10K steps are taken and the corresponding transitions stored in a replay buffer, which has a capacity of 1M steps. During training, only one step is taken and stored per sampling phase.
In the policy updating part, 256 previously sampled transitions \((s_{n}, a_{n}, r_{n}, s_{n+1})\) are randomly chosen from the replay buffer to update both the actor network and the critic network weights. We use a discount factor of \({\gamma =}0.99\) in the critic loss function of SAC. All other parameters are set to the default values of the TF-Agents SAC implementation93.
Both parts of our learning algorithm, the trajectory sampling and the policy update, are executed alternately until the curriculum state, i.e., the current suggested target diameter, falls below 1 cm. With our implementation, this was the case after 1.2M steps, corresponding to about four hours of training time. To evaluate a policy \(\pi _\theta\), we apply the action \(a_{n}^{*}\) with the highest probability under this policy for each time step (i.e., we use the corresponding greedy policy) and evaluate the resulting trajectory. Such an evaluation is done every 10K steps, for which 30 complete episodes are generated using this deterministic policy, and the resulting performance indicators are stored. After the training phase, \(\theta ^{*}\) is set to the latest parameter set \(\theta\), i.e., the final policy \(\pi _{\theta ^{*}}\) is chosen as the latest policy \(\pi _{\theta }\).
An overview of the complete training procedure is given in Fig. 7.
Evaluation
For an evaluation of the trajectories resulting from the learned policy for different target conditions, we designed a discrete Fitts’ Law type task. This task follows the ISO 9241-9 ergonomics standard and incorporates 13 equidistant targets arranged in a circle 50 cm in front of the body and placed 10 cm right of the right shoulder (Fig. 2). As soon as a target is reached and the end-effector remains inside for 100 ms, the next target is given to the learned policy. This also happens after 1.5 s, regardless of whether or not the episode was successful.
Based on the recommendations from Guiard et al.94, we determine different task difficulty conditions by sampling “form and scale”, i.e., the Index of Difficulty (ID) and the distance D between the target centers are sampled independently, instead of using a distance-width grid. We use the Shannon Formulation45 of Fitts’ Law [Eq. (1)] to compute the resulting distance between the initial and target point D, given the target width W and the ID:
$$\begin{aligned} \text {ID} = \log _{2}\left( \frac{D}{W} + 1\right) . \end{aligned}$$
(11)
The used combinations of distance, width, and ID can be found as Supplementary Table S1 online, and the resulting target setup is shown in Fig. 2a.
The model executes 50 movements for each task condition and each direction, i.e., 6500 movements in total. All movements reached the target and remained inside for 100 ms within the given maximum movement time of 1.5 s. Plots for all task conditions and movement directions, together with their underlying data, can be found in a public repository95.
In addition, an adaptive “moving target” mechanism is applied to generate elliptic movements from our learned policy. During training, the policy only learned to reach a given target as fast and accurate as possible—it was never asked to follow a specific path accurately. For this reason, we make use of the following method.
Initially, we place the first target on the ellipse such that \(10\%\) of the complete curve needs to be covered clockwise within the first movement, starting at a fixed initial position ( leftmost point on the ellipse). In contrast to regular pointing tasks, the target already switches as soon as the movement (or rather the projection of the movement path onto the ellipse) covers more than half of this distance. The next target is then chosen so as to again create an incentive to cover the next \(10\%\) of the elliptic curve. Thus, roughly 20 via-points in total are subsequently placed on the ellipse. As shown in Fig. 3a, this indeed leads to fairly elliptic movements.
For our evaluation, we use an ellipse with horizontal and vertical diameters of 15 cm and 6 cm (similar to the ellipse used by Harris and Wolpert1), with its center placed 55 cm in front, 10 cm above, and 10 cm to the right of the shoulder. The task was performed for one minute, with end-effector position, velocity, and acceleration stored every 10 ms.
Comprehensive data for all of these movements can also be found in a public repository95.
Synthesized reaching movement. A policy implemented as a neural network computes motor control signals of simplified muscles at the joints of a biomechanical upper extremity model from observations of the current state of the upper body. We use Deep Reinforcement Learning to learn a policy that reaches random targets in minimal time, given signal-dependent and constant motor noise.
Fitts’ Law type task. (a) The target setup in the discrete Fitts’ Law type task follows the ISO 9241-9 ergonomics standard. Different circles correspond to different IDs and distances between targets. (b) Visualization of our biomechanical model performing aimed movements. Note that for each time step, only the current target (position and radius) is given to the learned policy. (c) The movements generated by our learned policy conform to Fitts’ Law. Here, movement time is plotted against ID for all distances and IDs in the considered ISO task (6500 movements in total).
Elliptic via-point task. Elliptic movements generated by our learned policy conform to the \(\frac{2}{3}\) Power Law. (a) End-effector positions projected onto the 2D space (blue dots), where targets were subsequently placed along an ellipse of 15 cm width and 6 cm height (red curve). (b) Log-log regression of velocity against radius of curvature for end-effector positions sampled with 100 Hz when tracing the ellipse for 60 s.
End-effector trajectories (ID 4). 3D path, projected position, velocity, acceleration, phasespace, and Hooke plots of 50 aimed movements (between targets 7 and 8 shown in Fig. 2a) with ID 4 and a target distance of 35 cm.
End-effector trajectories (ID 2). 3D path, projected position, velocity, acceleration, phasespace, and Hooke plots of 50 aimed movements (between targets 7 and 8 shown in Fig. 2a) with ID 2 and a target distance of 35 cm.
Neuronal network architectures. (a) The actor network takes a state s as input and returns the policy \(\pi _{\theta }\) in terms of mean and standard deviation of the seven normal distributions, from which the components of the action vector are drawn. (b) The critic network takes both state s and action vector a as input and returns the estimated state-action value. Two critic networks are trained simultaneously to improve the speed and stability of learning (Double Q-Learning). Detailed information about the input state components are given in the Methods section.
Reinforcement learning procedure. Before training, the networks are initialized with random weights \(\theta\), and 10 K transitions are generated using the resulting initial policy. These are stored in the replay buffer (blue dashed arrows). During training (red dotted box), trajectory sampling and policy update steps are executed alternately in each step. The targets used in the trajectory sampling part are generated by the curriculum learner, which is updated every 10K steps, based on an evaluation of the most recent (greedy) policy. As soon as the target width suggested by the curriculum learner falls below 1 cm, the training phase is completed and the final policy \(\pi _{\theta ^*}\) is returned (teal dash-dotted arrow).






