'RL + optimization: how to do it better?

I am learning about how to optimize using reinforcement learning. I have chosen the problem of maximum matching in a bipartite graph as I can easily compute the true optimum.

Recall that a matching in a graph is a subset of the edges where no two edges are incident on the same node/vertex. The goal is to find the largest such subset.

I show my full code below but first let me explain parts of it.

num_variables = 1000
g = ig.Graph.Random_Bipartite(num_variables, num_variables, p=3/num_variables)
g_matching = g.maximum_bipartite_matching()
print("Matching size", len([v for v in g_matching.matching if v < num_variables and v != -1]))

This makes a random bipartite graph with 1000 nodes in each of the two sets of nodes. It then prints out the size of the true maximum matching.

In the code below, self.agent_pos is an array representing the current matching found. Its length is the number of edges in the original graph and there is a 1 at index i if edge i is included and a 0 otherwise. self.matching is the set of edges in the growing matching. self.matching_nodes is the set of nodes in the growing matching that is used to check to see if a particular edge can be added or not.

import igraph as ig
from tqdm import tqdm
import numpy as np
import gym
from gym import spaces

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env

class MaxMatchEnv(gym.Env):
    metadata = {'render.modes': ['console']}
    def __init__(self, array_length=10):
        super(MaxMatchEnv, self).__init__()
        # Size of the 1D-grid
        self.array_length = array_length
        self.agent_pos = [0]*array_length
        self.action_space = spaces.Discrete(array_length)
        self.observation_space = spaces.Box(low=0, high=1, shape=(array_length,), dtype=np.uint8)
        self.matching = set()  # set of edges
        self.matching_nodes = set() # set of node ids (ints)
        self.matching_size = len([v for v in g_matching.matching if v < num_variables and v != -1])
        self.best_found = 0
        
    def reset(self):
        # Initialize the array to have random values
        self.time = 0
        #print(self.agent_pos)
        self.agent_pos = [0]*self.array_length
        self.matching = set()
        self.matching_nodes = set()
        return np.array(self.agent_pos)
    
        
    def step(self, action):
        self.time += 1 
        reward = 0
        edge = g.es[action]
        if not(edge.source in self.matching_nodes or edge.target in self.matching_nodes):
            self.matching.add(edge)
            self.matching_nodes.add(edge.source)
            self.matching_nodes.add(edge.target)
            self.agent_pos[action] = 1
            if sum(self.agent_pos) > self.best_found:
                self.best_found = sum(self.agent_pos)
                print("New max", self.best_found)
            reward = 1
        elif self.agent_pos[action] == 1:
            #print("Removing edge", action)
            self.matching_nodes.remove(edge.source)
            self.matching_nodes.remove(edge.target)
            self.matching.remove(edge)
            self.agent_pos[action] = 0
            reward = -1
        done = sum(self.agent_pos) == self.matching_size
        info = {}
        return np.array(self.agent_pos), reward, done, info

    def render(self, mode='console'):
        print(sum(self.agent_pos))

    def close(self):
        pass


if __name__ == '__main__':
 
    num_variables = 1000
    g = ig.Graph.Random_Bipartite(num_variables, num_variables, p=3/num_variables)
    g_matching = g.maximum_bipartite_matching()
    print("Matching size", len([v for v in g_matching.matching if v < num_variables and v != -1]))

    env = make_vec_env(lambda: MaxMatchEnv(array_length=len(g.es)), n_envs=12)

    model = PPO('MlpPolicy', env, verbose=1).learn(10000000)

There are a number of problems with this but the main one is that it doesn't optimize well. This code gives just over 550 and then stops improving where the true optimum is over 900 (it is printed out by the code at the start).

The main question is:

How can this be done better so that it gets to a better matching?

A subsidiary question is, how can I print the best matching found so far? My attempt using self.best_found to maintain the best score does not work as it seems to be reset regularly.

Changes that haven't help

  • Changing PPO for DQN makes only a marginal difference.
  • I tried changing the code so that done is True after 1000 steps.

The change is as follows:

if self.time == 1000:
    done = True
else:
    done = False

Having added print(max(env.get_attr("best_found"))) in place of print("New max", self.best_found) this change to done shows no advantage at all.



Solution 1:[1]

To print the max for each environment you can use get_attr method from stable baselines. More info in their official docs.

For example the lines below will print the max for each of the 12 environments and then the maximum across all environments.

print(env.get_attr("best_found"))
print(max(env.get_attr("best_found")))

Regarding why it does not converge it could be due a bad reward selected, although looking at your reward choice it seems sensible. I added a debug print in your code to see if some step lead to done = True, but it seems that the environment never reaches that state. I think that for the model to learn it would be good to have multiple sequence of actions leading to a state with done = True, which would mean that the model gets to experience the end of an episode. I did not study the problem in your code in detail but maybe this information can help debug your issue.

If we compare the problem with other environments like CartPole, there we have episodes that end with done = True and that helps the model learn a better policy (in your case you could limit the amount of actions per episode instead of running the same episode forever). That could help the model avoid getting stuck in a local optimum as you give it the opportunity to "try again" in a new episode.

Solution 2:[2]

If I'm reading this right, then you aren't performing any hyperparameter tuning. How do you know your learning rate / policy update rate / any other variable parameters are a good fit for the problem at hand?

It looks like stable baselines has this functionality built-in: https://stable-baselines.readthedocs.io/en/master/guide/rl_zoo.html?highlight=tune#hyperparameter-optimization

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Mandias