'modeling reinforcement learning environment with Ray
I have been playing around with the idea of using reinforcement learning on a particular problem in which I am optimizing a raw materials purchasing strategy for a particular commodity. I have created a simple gym environment to show a simplified version of what I'd like to accomplish. The goal is to take in multiple items (in this case 2) and optimize a purchasing strategy for each item so that the sum of days on hand of all items are minimized without running out of either item.
from gym import Env
from gym.spaces import Discrete, Box, Tuple
from gym import spaces
import numpy as np
import random
import pandas as pd
from random import randint
#define our variable starting points
#array of the start quantity for 2 seperate items
start_qty = np.array([10000, 200])
#create the number of simulation weeks
sim_weeks = 1
#set a starting safety stock level------INGORE FOR NOW
#safety_stock = 4003
#create simple demand profile for each item
#demand = np.array([301, 1549, 3315, 0, 1549, 0, 0, 1549, 1549, 1549])
demand = np.array([1800, 45])
#create minimum order and max order quantities for each item
min_ord = np.array([26400, 250])
max_ord = np.array([100000, 100000])
prev_30_usage = np.array([1600, 28])
#how this works is it in the numpy arrays- the stuff in index 0 is the first item's info
# and the stuff in index 1 is the second item's info
class ResinEnv(Env):
def __init__(self):
self.action_space = Tuple([Discrete(2), Discrete(2)])
self.observation_space = Box(low= np.array([-10000000]), high = np.array([10000000]))
#set the start qty
self.state = np.array([10000, 200])
#self.start = start_qty
#set the purchase length
self.purchase_length = sim_weeks
self.min_order = min_ord
def step(self, action):
self.purchase_length -= 1
#apply action
self.state[0] -=demand[0]
self.state[1] -= demand[1]
#see if we need to buy
#action is between 0 and 1- round the action to the nearest tenth
action = np.around(action, decimals = 0)
#self.state +=action*self.min_order
np.add(self.state, action* self.min_order, out=self.state, casting="unsafe")
#self.state += (action*100) + 26400
#calculate the days on hand from this
days = self.state/prev_30_usage/7
#item_reward1 = action[0]
#item_reward2 = action[1]
#calculate reward: right now reward is negative of days_on_hand
#GOING TO NEED TO CHANGE THIS REWARD AT SOME POINT MOVING FORWARD AS IT
#NEEDS TO TREAT HIGH VOLUME ITEMS AND LOW VOLUME ITEMS THE SAME- THIS IS BIASED AGAINST LOW VOLUME
if self.state[0] < 0:
item_reward1 = -10000
else:
item_reward1 = days[0]
if self.state[1]< 0:
item_reward2 = -10000
else:
item_reward2 = days[1]
reward = item_reward1 + item_reward2
#check if we are out of weeks
if self.purchase_length<=0:
done = True
else:
done = False
#reduce the weeks left to purchase by 1 week
#done = True
#set placeholder for info
info = {}
#return step information
return self.state, reward, done, info
def render(self):
pass
def reset(self):
self.state = np.array([10000, 200])
self.purchase_length = sim_weeks
self.demand = demand
self.action_space = Tuple([Discrete(2), Discrete(2)])
self.min_order= min_ord
return self.state #, self.purchase_length, self.demand, self.action_space, self.min_order
The environment seems to be functioning just fine as seen with this code:
episodes = 100
for episode in range(1, episodes+1):
state = env.reset()
done = False
score = 0
while not done:
#env.render()
action = env.action_space.sample()
n_state, reward, done, info = env.step(action)
score+=reward
print('Episode:{} Score:{} Action:{}'.format(episode, score, action))
I have attempted to run this through various ways of modeling with no luck and have discovered Ray but can't seem to get that to work either. I was wondering if someone could walk me through the process of modeling in Ray- or help identify any issues with the environment itself that would cause Ray to not work. Any help is greatly appreciated, as I am new to RL and completely stumped.
Solution 1:[1]
I'm new in RL and was search for some code and find yours
it seams you need only define env due to i add this line and it is worked
....
episodes = 100
env = ResinEnv()
for episode in range(1, episodes+1):
state = env.reset()
done = False
score = 0
....
hope it is useful
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
