How much failure is ideal?

Recent research suggests it’s when failure occurs 15 percent of the time, hence the Eighty Five Percent Rule.

We’re constantly told that it’s ‘good’ to fail and that failure is a prerequisite for success. But how much failure is a good thing—that is, before it becomes counterproductive? It’s an extraordinarily difficult question to answer, in part because failure is so hard to quantify.

But recent research published in the journal Nature Communications suggests there is an answer to this question. In “The Eighty Five Percent Rule for Optimal Learning,” Robert Wilson and his collaborators advance the idea that you maximize your rate of learning when your error rate is 15 percent—or more specifically, 15.87 percent.

Wilson & Co. arrived at this number by conducting machine-learning experiments in which they taught computers simple tasks, then determined that the computers learned most effectively when the difficulty of the tasks led to an error rate of a little more than 15 percent.

In the following Failure Interview, Wilson—assistant professor of psychology and cognitive science at the University of Arizona and member of the Neuroscience of Reinforcement Learning and Decision Making (NRD) Lab—discusses the 85 Percent Rule, and where the research goes from here.

How did you get the idea to pursue the research presented in “The Eighty Five Percent Rule”?

The idea came out of a lab meeting where we were talking about mental effort and engagement in tasks, and Johnathan D. Cohenwho is a senior author on “The Eighty Five Percent Rule for Optimal Learning”mentioned this feeling he had that people are most engaged when doing a task when they are [getting] about 85 percent correct. It was a eureka moment when I realized that that level of difficulty could be optimal in the sense that it’s maximizing the rate of learning for a particular kind of learning—gradient descent-type learning.

One way to think of gradient descent learning is as a more sophisticated form of trial-and-error learning. In trial-and-error learning you try something (e.g., change your neural network parameters) and if it improves your performance you stick with it, otherwise you go back.

In gradient descent learning, you can work out the math such that what you try (in terms of how you change the parameters of your neural network) will, on average, improve your performance over time. This kind of learning seems to describe the kind of slow learning that happens in people as we master a skill with practice (e.g., a tennis player perfecting their serve or a dermatologist learning how to discriminate between images of cancerous and non-cancerous moles).

We’re interested in this problem because we’re interested in how people learn but also why people make mistakes. Learning is one of those things where making mistakes and having variability in your behavior can be beneficial.

How did you go about determining the optimal level of training difficulty?

We focused on this gradient descent learning algorithm and a very particular kind of task.

In terms of the task: you are presented with a stimulus and you have to categorize it as category A or category B. A simple example would be Google training one of their neural networks to recognize cats versus dogs. A human example might be a dermatologist learning to classify moles.

Again, it’s a kind of learning that in the human world is going to happen slowly, where you need a lot of experience to get good. A dermatologist learning to classify moles is a good example because a medical school textbook will tell you ‘this is a suspicious mole, this is not.’ But it’s only until you get out in the field and are making these judgments many times with many different examples that you are going to get really good at it. It’s a slow, incremental learning process.

So if you have an equation for the learning algorithm and an equation for what the task is, you can define the point where the gradient descent algorithm learns the fastest—the level of difficulty at which it learns the fastest. The math works out at being 85 percent correct under a certain set of assumptions.

The 85 percent number is just a detail of the math, but the intuition is that if something is too easy and you’re 100 percent correct then there is nothing much to learn. But if something is too hard—there is no information in the feedback you are getting—it’s just too hard for you.

We’ve known about this in the education literature going back to the 1930s, by people like Lev Vygotsky, who said that kids learn best when they are in this zone of proximal difficulty where things are just beyond what they can do—where it’s not so easy that it’s trivial but not so far beyond that they can’t do it.

What are the lessons for education? It seems the lesson is that you should be embracing challenge and risking failure as opposed to pursuing perfection.

I think that’s absolutely right. What we have said is that even for neural networks—where in computers you would think perfection is desirable—if they want to learn the best, and the fastest, they have to be trained in a situation where they are not perfect and making a certain number of mistakes. The big lesson for education is that perfection and focus on 100 percent is a bad thing from the perspective of learning.

What about the lesson for artificial intelligence?

We are excited about these applications as well. It’s always difficult with Ai because the challenges and the resources available are different. Google might not care if we can boost learning by a little bit by correctly setting the difficulty of each cat or dog image when they can show millions of cat and dog images to a network. It almost doesn’t matter at that level.

But there is an interest in more complicated problems—finding the way to train a network that is going to teach it the fastest. So there is some application there but we are in the cognitive science side of neural networks where we take a lot of the same math but apply it to people. It’s different from taking the math and applying it to solve engineering problems as the constraints aren’t always the same. But it’s something we’re excited to look at.

After publishing “The Eighty Five Percent Rule,” where do you go from here?

The first thing—and probably the biggest thing—for us as cognitive scientists/psychologists is to see if we can test this experimentally. It’s difficult to test experimentally in part because there are a lot of other things that drive learning. Things like: intrinsic motivation to learn or how interesting you find the task.

The results make us think that people are going to be using gradient descent learning for certain kinds of tasks. The question is: If we take one of the tasks and we train it at 85 percent, do people actually learn better than if we train them at 50 percent or 65 percent or 95 percent. So that’s definitely the next stage for us.

But then expanding the theory more…. What about more general tasks? What about tasks that aren’t necessarily classification but are things like performing a motor action. That’s another type of task we believe is learned by gradient descent. So we’re expanding the theory in that direction as well.

The one big caveat is that this is all theoretical at this point, so as for direct application to education and human behavior, we don’t know, it has not been tested yet. But we are hopeful it will.