'Evaluation stage of Tensorforce PPO not performing as expected
I have built a custom Reinforcement Learning environment and use PPO to train my agent. The following is a snippet of the training loop.
environment = Environment.create(environment=networkEnvironment, max_episode_timesteps=7)
#rlAgent = Agent.create(agent='ppo', environment=environment, batch_size=64, learning_rate=1e-4, update_frequency=64, saver=dict(directory='model-checkpoint', frequency=3, max_checkpoints=1000))
rlAgent = Agent.load(directory='model-checkpoint', format='checkpoint', environment=environment)
for _ in range(3000):
states = environment.reset()
terminal = False
start_time = time.time()
while not terminal:
actions = rlAgent.act(states=states)
states, terminal, reward = environment.execute(actions)
rlAgent.observe(terminal=terminal, reward=reward)
This loop trains as expected and converges on a near optimal policy. However whenever I want to run an evaluation loop, it doesn't use the expected policy, and performs terribly, selecting a singular action each time. The evaluation loop is as below:
for _ in range(10):
states = environment.reset()
internals = rlAgent.initial_internals()
terminal = False
while not terminal:
actions, internals = rlAgent.act(states=states, internals=internals, independent=True, deterministic=True)
states, terminal, reward = environment.execute(actions)
Some important context, the agent is stopping a multi-stage attack in the environment, therefore the optimal policy is often not a singular action, which the agent understands during training (finds and repeats a good sequence of actions). However during the evaluation, the agent only ever repeats a singular action at each timestep until the episode completes. The action chosen is usually poor and not in line with the 'good' sequence of actions found during training.
Is there something I have missed with setting up the evaluation?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
