'modeling reinforcement learning environment with Ray

I have been playing around with the idea of using reinforcement learning on a particular problem in which I am optimizing a raw materials purchasing strategy for a particular commodity. I have created a simple gym environment to show a simplified version of what I'd like to accomplish. The goal is to take in multiple items (in this case 2) and optimize a purchasing strategy for each item so that the sum of days on hand of all items are minimized without running out of either item.

from gym import Env
from gym.spaces import Discrete, Box, Tuple
from gym import spaces
import numpy as np
import random
import pandas as pd
from random import randint

#define our variable starting points


#array of the start quantity for 2 seperate items
start_qty = np.array([10000, 200])
#create the number of simulation weeks
sim_weeks = 1
#set a starting safety stock level------INGORE FOR NOW
#safety_stock = 4003

#create simple demand profile for each item
#demand = np.array([301, 1549, 3315, 0, 1549, 0, 0, 1549, 1549, 1549])
demand = np.array([1800, 45])

#create minimum order and max order quantities for each item
min_ord = np.array([26400, 250])
max_ord = np.array([100000, 100000])
prev_30_usage = np.array([1600, 28])


#how this works is it in the numpy arrays- the stuff in index 0 is the first item's info
# and the stuff in index 1 is the second item's info
class ResinEnv(Env):
    def __init__(self):
        self.action_space = Tuple([Discrete(2), Discrete(2)])
        self.observation_space = Box(low= np.array([-10000000]), high = np.array([10000000]))
        #set the start qty
        self.state = np.array([10000, 200])
        #self.start = start_qty
        #set the purchase length
        self.purchase_length = sim_weeks
        self.min_order = min_ord
    def step(self, action):
        self.purchase_length -= 1
        #apply action 
        self.state[0] -=demand[0]
        self.state[1] -= demand[1]
        #see if we need to buy
        #action is between 0 and 1- round the action to the nearest tenth
        action = np.around(action, decimals = 0)
        
        
        #self.state +=action*self.min_order
        
        np.add(self.state, action* self.min_order, out=self.state, casting="unsafe")
        #self.state += (action*100) + 26400
        #calculate the days on hand from this
        days = self.state/prev_30_usage/7
        
        
        #item_reward1 = action[0]
        #item_reward2 = action[1]
        #calculate reward: right now reward is negative of days_on_hand
        
        #GOING TO NEED TO CHANGE THIS REWARD AT SOME POINT MOVING FORWARD AS IT
        #NEEDS TO TREAT HIGH VOLUME ITEMS AND LOW VOLUME ITEMS THE SAME- THIS IS BIASED AGAINST LOW VOLUME
        if self.state[0] < 0:
            item_reward1 = -10000
        else:
            item_reward1 = days[0]
        if self.state[1]< 0:
            item_reward2 = -10000
        else:
            item_reward2 = days[1]
        
        reward = item_reward1 + item_reward2
        #check if we are out of weeks
        if self.purchase_length<=0:
            done = True
        else:
            done = False
        #reduce the weeks left to purchase by 1 week
        #done = True   
        #set placeholder for info
        info = {}
            
        #return step information
        return self.state, reward, done, info
    def render(self):
        pass
    def reset(self):
        self.state = np.array([10000, 200])
        self.purchase_length = sim_weeks
        self.demand = demand
        self.action_space = Tuple([Discrete(2), Discrete(2)])
        self.min_order= min_ord
        return self.state #, self.purchase_length, self.demand, self.action_space, self.min_order

The environment seems to be functioning just fine as seen with this code:

episodes = 100
for episode in range(1, episodes+1):
    state = env.reset()
    done = False
    score = 0 
    
    while not done:
        #env.render()
        action = env.action_space.sample()
        n_state, reward, done, info = env.step(action)
        score+=reward
    print('Episode:{} Score:{} Action:{}'.format(episode, score, action))

I have attempted to run this through various ways of modeling with no luck and have discovered Ray but can't seem to get that to work either. I was wondering if someone could walk me through the process of modeling in Ray- or help identify any issues with the environment itself that would cause Ray to not work. Any help is greatly appreciated, as I am new to RL and completely stumped.



Solution 1:[1]

I'm new in RL and was search for some code and find yours

it seams you need only define env due to i add this line and it is worked

....
episodes = 100
env = ResinEnv()
for episode in range(1, episodes+1):
state = env.reset()
done = False
score = 0 
....

hope it is useful

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1