To show an AI agent a brand new task, like methods to open a kitchen cabinet, researchers often use reinforcement learning — a trial-and-error process where the agent is rewarded for taking actions that get it closer to the goal.
In lots of instances, a human expert must rigorously design a reward function, which is an incentive mechanism that offers the agent motivation to explore. The human expert must iteratively update that reward function because the agent explores and tries different actions. This will be time-consuming, inefficient, and difficult to scale up, especially when the duty is complex and involves many steps.
Researchers from MIT, Harvard University, and the University of Washington have developed a brand new reinforcement learning approach that doesn’t depend on an expertly designed reward function. As a substitute, it leverages crowdsourced feedback, gathered from many nonexpert users, to guide the agent because it learns to achieve its goal.
While another methods also try to utilize nonexpert feedback, this recent approach enables the AI agent to learn more quickly, despite the proven fact that data crowdsourced from users are sometimes filled with errors. These noisy data might cause other methods to fail.
As well as, this recent approach allows feedback to be gathered asynchronously, so nonexpert users around the globe can contribute to teaching the agent.
“One of the crucial time-consuming and difficult parts in designing a robotic agent today is engineering the reward function. Today reward functions are designed by expert researchers — a paradigm that will not be scalable if we wish to show our robots many alternative tasks. Our work proposes a solution to scale robot learning by crowdsourcing the design of reward function and by making it possible for nonexperts to offer useful feedback,” says Pulkit Agrawal, an assistant professor within the MIT Department of Electrical Engineering and Computer Science (EECS) who leads the Improbable AI Lab within the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL).
In the long run, this method could help a robot learn to perform specific tasks in a user’s home quickly, without the owner needing to indicate the robot physical examples of every task. The robot could explore by itself, with crowdsourced nonexpert feedback guiding its exploration.
“In our method, the reward function guides the agent to what it should explore, as an alternative of telling it exactly what it should do to finish the duty. So, even when the human supervision is somewhat inaccurate and noisy, the agent continues to be capable of explore, which helps it learn a lot better,” explains lead writer Marcel Torne ’23, a research assistant within the Improbable AI Lab.
Torne is joined on the paper by his MIT advisor, Agrawal; senior writer Abhishek Gupta, assistant professor on the University of Washington; in addition to others on the University of Washington and MIT. The research can be presented on the Conference on Neural Information Processing Systems next month.
Noisy feedback
One solution to gather user feedback for reinforcement learning is to indicate a user two photos of states achieved by the agent, after which ask that user which state is closer to a goal. For example, perhaps a robot’s goal is to open a kitchen cabinet. One image might show that the robot opened the cupboard, while the second might show that it opened the microwave. A user would pick the photo of the “higher” state.
Some previous approaches try to make use of this crowdsourced, binary feedback to optimize a reward function that the agent would use to learn the duty. Nevertheless, because nonexperts are more likely to make mistakes, the reward function can grow to be very noisy, so the agent might get stuck and never reach its goal.
“Principally, the agent would take the reward function too seriously. It could attempt to match the reward function perfectly. So, as an alternative of directly optimizing over the reward function, we just use it to inform the robot which areas it needs to be exploring,” Torne says.
He and his collaborators decoupled the method into two separate parts, each directed by its own algorithm. They call their recent reinforcement learning method HuGE (Human Guided Exploration).
On one side, a goal selector algorithm is repeatedly updated with crowdsourced human feedback. The feedback will not be used as a reward function, but fairly to guide the agent’s exploration. In a way, the nonexpert users drop breadcrumbs that incrementally lead the agent toward its goal.
On the opposite side, the agent explores by itself, in a self-supervised manner guided by the goal selector. It collects images or videos of actions that it tries, that are then sent to humans and used to update the goal selector.
This narrows down the realm for the agent to explore, leading it to more promising areas which are closer to its goal. But when there isn’t a feedback, or if feedback takes some time to reach, the agent will continue to learn by itself, albeit in a slower manner. This permits feedback to be gathered infrequently and asynchronously.
“The exploration loop can keep going autonomously, since it is just going to explore and learn recent things. After which once you get some higher signal, it’s going to explore in additional concrete ways. You may just keep them turning at their very own pace,” adds Torne.
And since the feedback is just gently guiding the agent’s behavior, it is going to eventually learn to finish the duty even when users provide incorrect answers.
Faster learning
The researchers tested this method on a lot of simulated and real-world tasks. In simulation, they used HuGE to effectively learn tasks with long sequences of actions, similar to stacking blocks in a specific order or navigating a big maze.
In real-world tests, they utilized HuGE to coach robotic arms to attract the letter “U” and pick and place objects. For these tests, they crowdsourced data from 109 nonexpert users in 13 different countries spanning three continents.
In real-world and simulated experiments, HuGE helped agents learn to realize the goal faster than other methods.
The researchers also found that data crowdsourced from nonexperts yielded higher performance than synthetic data, which were produced and labeled by the researchers. For nonexpert users, labeling 30 images or videos took fewer than two minutes.
“This makes it very promising when it comes to with the ability to scale up this method,” Torne adds.
In a related paper, which the researchers presented on the recent Conference on Robot Learning, they enhanced HuGE so an AI agent can learn to perform the duty, after which autonomously reset the environment to proceed learning. For example, if the agent learns to open a cupboard, the tactic also guides the agent to shut the cupboard.
“Now we will have it learn completely autonomously while not having human resets,” he says.
The researchers also emphasize that, on this and other learning approaches, it’s critical to make sure that AI agents are aligned with human values.
In the long run, they need to proceed refining HuGE so the agent can learn from other types of communication, similar to natural language and physical interactions with the robot. Also they are occupied with applying this method to show multiple agents without delay.
This research is funded, partly, by the MIT-IBM Watson AI Lab.