Proper semantics yield better understanding; below lays out the Lab's generalized structure and relations of agents, bodies and environments.
First, some semantics correction of Unity ml-agents is needed. Sine this environment module handles the interface with Unity ml-agents, the correction will happen here.
The motivating problem: Originally, in a single instance of environment sits the Academy, which houses multiple Brains, which can each control multiple "Agents". The Brains can be controlled externally from Unity, e.g. via DQN implementation in PyTorch. However, in Lab, we also call DQN an Agent (different from the Agent inside Unity). Each instance of DQN (Agent) controls a Unity Brain, which can then control multiple Agents (name clash) in Unity, e.g. robot arms. Whereas the multiple arms should be seen as a DQN Agent having many arms, or having an arm in multiple incarnations across space. Hence, we will call Unity Brain's "Agents" as "Bodies", consistent with SLM's need to have a body in environment for embodiment.
Then, the proper semantics is as follow:
- Agent: a single class/instance of the SLM entity, e.g. DQN agent. This corresponds precisely to a single Brain in Unity Academy.
- Environment: a single class/instance of the Unity space, as usual.
- Body: a single incarnation of an Agent in the Environment. A single Agent (Brain) can have multiple bodies in parallel for batch training.
Note that the parallel bodies (identical and non-interacting) of an agent in an environment is equivalent to an agent with a single body existing in multiple copies of the environment. This insight is crucial for the symmetry between Agent and Environment space, and helps generalize further later.
The base case
- 1 agent, 1 environment, 1 body This is the most straightforward case, directly runnable as a common session without any multiplicity resolution.
- 1 agent, 1 environment, multiple bodies This is just the base case ran in batch, where the agent does batch-processing on input and output. Alternatively the bodies could be distinct, such as having inverse rewards. This would be the adversarial case where a single agent self-plays.
- multiple agents, 1 environment, multiple bodies The next extension is having multiple agents interacting in an environment. Each agent can posses 1 body or more as per cases above.
- 1 agent, multiple environments, multiple bodies This is the more novel case. When an agent can have parallel incarnations, nothing restrictst the bodies to be constructed identically or be subject to the same environment. An agent can have multiple bodies in different environments. This can be used for simultaneous multi-task training. An example is to expose an agent's legs to ground for walking, wings to air for flying, and fins for swimming. The goal would be to do generalization or transfer learning on all 3 types of limbs to multiple environments. Then perhaps it would generalize to use legs and wings for swimming too.
Full generalization, multi-agent multi-environment case
- multiple agents, multiple environments, multiple bodies This generalizes all the cases above and allow us to have a neat representation that corresponds to the Agent-Environment product space before. The generalization gives us the 3D space of
Agents x Environments x Bodies. We will call this product space
AEB space. It will be the basis of our experiment design. In AEB space, We have the projections:
- AgentSpace, A: each value in this space is a class of agent
- EnvSpace, E: each value in this space is a class of environment
- BodySpace, B: each value in this space is a body of an agent in an environment (indexed by coordinates (a,e) in AE space)
In a general experiment with multiple bodies, with single or multiple agents and environments, each body instance can be marked with the 3D coordinate
AEB space. Each body is also associated with the body-specific data: observables, actions, rewards, done flags. We can call these the data space, i.e. observable space, action space, reward space, etc.
Control loop generalization
When controlling a session of experiment, execute the agent and environment logic as usual, but the singletons for AgentSpace and EnvSpace respectively. Internally, they shall produce the usual singleton data across all bodies at each point
(a,e,b). When passing the data around, simply flatten the data on the corresponding axis and spread the data. E.g. when passing new states from EnvSpace to AgentSpace, group
state(a,e,b) for each
a value and pass
state(e,b)_a to the right agent
Hence, the experiment session loop generalizes directly from:
def run_episode(self): self.env.clock.tick('epi') reward, state, done = self.env.reset() self.agent.reset(state) while not done: self.env.clock.tick('t') action = self.agent.act(state) reward, state, done = self.env.step(action) self.agent.update(action, reward, state, done) self.agent.body.log_summary() self.save_if_ckpt(self.agent, self.env) def run(self): while self.env.clock.get('epi') <= self.env.max_episode: self.run_episode()
to direct substitutions for singletons with spaces:
def run_all_episodes(self): ''' Continually run all episodes, where each env can step and reset at its own clock_speed and timeline. Will terminate when all envs done are done. ''' all_done = self.aeb_space.tick('epi') reward_space, state_space, done_space = self.env_space.reset() self.agent_space.reset(state_space) while not all_done: all_done = self.aeb_space.tick() action_space = self.agent_space.act(state_space) reward_space, state_space, done_space = self.env_space.step(action_space) self.agent_space.update(action_space, reward_space, state_space, done_space) self.save_if_ckpt(self.agent_space, self.env_space) def run(self): self.run_all_episodes()