Variance Reduction for Deep Q-Learning using Stochastic Recursive Gradient

Dependencies

The following dependencies are required：

gym==0.17.1
matplotlib==3.0.3
numpy==1.18.3
tensorflow==1.15.0

These dependencies can be installed via pip or virtualenv:

1	pip install -r requirements.txt

Usage

Our three tasks are in three folders, for example, the MountainCar task corresponds to the SRG-DQN-mountaincar folder. In the MountainCar task, the main body of DQN is integrated into the main code dqn_main.py. In the remaining two tasks, the main body of DQN is separated into RL_brain.py.

MountainCar

In the mountaincar task, you can run the model with the following command:

1	python3 dqn_main.py

In the ‘main’ part of dqn_main.py, you can choose to run the model in step or episode mode, or you can choose to implement the SVRG (SVR-DQN) or SARAH (SRG-DQN) variance optimizer.

In addition, you can use these command to run fixed anchor point and anchor distance experiments:

1	python3 dqn_svrg_fixedData.py

1	python3 distance_anchor_point

Pendulum

In the pendulum task, you can run the model with the following command:

1	python3 run_Pendulum.py

You can choose to use SVRG (SVR-DQN) or SARAH (SRG-DQN) as the variance optimizer in the optimizer property of the DQN object in run_Pendulum.py.

CartPole

In the cartpole task, you can run the model with the following command:

1	python3 run_CartPole.py

You can choose to use SVRG (SVR-DQN) or SARAH (SRG-DQN) as the variance optimizer in the optimizer property of the DQN object in run_CartPole.py.

Hyperparameters

You can find the details of the experimental settings in Sup-SRG-DQN-NIPS.pdf.

Results

We did the average step reward experiment in the MountainCar and Pendulum tasks, and the average episode reward experiment in the CartPole task. And in order to ensure the reliability of the experiment, we conducted multiple rounds of repeated experiments under the same experimental parameters. Due to the ε-greedy strategy, the results of some experiments may not be accurately reproduced. The following table lists some experimental results based on SRG-DQN:

Task	Step/Episode length	Final Avg-Reward
MountainCar	100,000 steps	0.280
Pendulum	20,000 steps	-0.328
CartPole	800 episodes	61.707