Why So Many Clever Pilot Programs Flunk Out at Scale

Small experiments for solving social problems may seem to work, but at least half of them fall apart when they’re expanded to a larger constituency. Costs are the main explanation, although not the only one.

March 22, 2022 •

DARE worked well in the mid-1980s in a small pilot test with 1,777 schoolchildren in Honolulu. It became a staple of federal social policy. Over 24 years, 43 million kids around the United States participated in the DARE program, at a cost of $600 million a year. Unfortunately, DARE simply didn’t work. It didn’t make a dent in drug abuse.

If I were a rich philanthropist with a penchant for education reform, I might put some of my money into a scheme to improve faltering schools in a big city. I could walk into any of dozens of such schools with a proposition: Any student reaching a specified level of performance — maybe a quarter of those in a given classroom — would receive a gift of $1,000.

I’m confident that this would work: The grades and test scores in that classroom would shoot up. I’m also confident that somebody else could try a different school and get similar results. In other words, this is a replicable experiment. It’s actually been tried, admittedly at a much more modest level, in a scattering of schools around the country.

The one thing it obviously isn’t is multipliable. Let’s say we’re working with the entire school system in New York City. That system has nearly a million kids enrolled in its institutions. Rewarding the top 25 percent of them with $1,000 just once could cost nearly $250 million. Doing it several times would put the price tag well into the billions. It would still be an appealing idea, but no school system could possibly afford it. Lowering the number of winners to 10 percent of the students would still generate continuing costs in the billion-dollar range. Reducing the gift size to $100 would make a huge difference, but it would still place the total cost in the hundreds of millions.

I plead guilty to hitting you with an absurd scheme. Any sensible person will grasp that no big city could afford this. It just doesn’t multiply — or, to use the most common current expression, it wouldn’t “scale.”

I bring this up because governments and nonprofits all over the world try experiments that are equally vulnerable to the inexorable mathematics of scale. Often they don’t stop to do the math until they have wasted quite a bit of money.

Some years ago, for example, a charitable organization decided to bring polio vaccines to 50,000 people in an impoverished region of Zambia. The vaccines were delivered by hovercraft, and they achieved notable success when administered as a pilot program. What the promoters didn’t figure in was the amount of money each hovercraft would cost. Neither the nonprofit nor the government could come close to affording enough of the amphibious vehicles to reach 50,000 people. So the program was simply abandoned.

Another well-documented example involves tax evasion in the Dominican Republic. That nation harbors an unusually large number of tax scofflaws. Notices were sent out to these people threatening them with jail time if they continued to resist making the payments. This actually brought the tax authorities an initial influx of money. But it was unenforceable at scale. The government couldn’t possibly find enough jail cells for all the people who continued to break the law. It had to settle for publicizing the names of the offenders. This wasn’t as effective, although it was at least affordable.

I HAVE DRAWN THE LATTER TWO CASES from a provocative new book called The Voltage Effect by John A. List, an economist at the University of Chicago. List has produced what amounts to an encyclopedia of programs and strategies that succeed in small experiments but fall apart on a larger scale. He quotes one study whose conclusion was that 50 to 90 percent of all pilot programs fail to work at scale. “The belief that an idea is more scalable than it actually is,” List says, “will almost always result in overspending and sunk costs.”

Cost isn’t the only issue. While many if not most pilots fail for reasons of expense, there are some other powerful explanations for well-meaning failures at scale. Perhaps the most famous of these showed itself in the federal anti-drug campaign known as DARE, for Drug Abuse Resistance Education. This is the program that included First Lady Nancy Reagan’s famous instruction: “Just Say No.”

DARE worked gratifyingly well in the mid-1980s in a small pilot test with 1,777 schoolchildren in Honolulu. It became a staple of federal social policy. Over 24 years, 43 million kids around the United States participated in the DARE program, at a cost of $600 million a year.

Unfortunately, DARE simply didn’t work. It didn’t make a dent in drug abuse. In this case, it wasn’t a matter of cost. The problem was that the initial pilot experiment was flawed. The test sample was too small and the guinea pigs in Hawaii weren’t representative of children across the United States; the pilot needed to be tried in a diverse array of places, and it wasn’t. After nearly a quarter-century of failures, Congress finally pulled the plug. But the whole episode demonstrates that when you draw conclusions about a pilot program from a small sample, the sample needs to be representative or you are asking for failure.

List also points to an energy-saving experiment called Opower, which sought to reduce electricity consumption by asking utilities to send customers information about how much energy other local residents were using. “The initial results were stunning,” List tells us, “with big energy savings.” But it flopped when the promoters took it on the road to a much larger constituency. The reason was simple: The utilities that signed up for the first trial were ones already committed to conservation; they didn’t need much persuasion. But they were not typical of most utilities in the United States.

SOCIAL SCIENTISTS TALK ABOUT A SAMPLING PROBLEM they describe by using the acronym WEIRD. That stands for “western, educated, industrialized, rich and democratic.” In practice, this often comes down to sampling students at prestigious American universities. They are rarely representative of the larger populations that researchers are trying to benefit. What they are is easy to find and recruit.

Then there’s the problem of finding the right recruits to expand a program. A successful restaurant that wants to open new outlets needs to find chefs as good as the ones who made the first place a hit. Often they just aren’t available.

More consequentially, school-reform pilots sometimes yield impressive results because they rely on teachers of unusual ability. When the pilot moves to scale, it becomes dependent on less-talented teachers to implement it. The kind of talent that made a pilot successful doesn’t grow on trees, and the program’s effectiveness declines sharply. The moral is clear and needs to be respected: If we are to save education in America, we need to find ways to execute reform with people of ordinary ability. Otherwise it just won’t work.

California found that out in the 1990s when it mandated a statewide reduction in class sizes. This worked well enough when it recruited exceptional teachers to handle the extra classrooms. But the state soon found out that there just weren’t very many of these people. The scores failed to improve at scale. Each of us would like to live in a community where all the teachers are above average. There just aren’t any communities like that.

The most common result is a syndrome that scholars often describe as mission drift. This happened to the federal Head Start program, which proved highly successful in one of its later pilot variations but fell off in quality as it expanded. Taking in fewer qualified teachers, and ministering to more dysfunctional families, it drifted into a much less ambitious, much less effective program.

STILL, WHEN WE BEGIN TO ASSESS the more conspicuous failures to scale, we are driven back to the concept of cost.

Virtually all major federal initiatives, including the moderately successful ones, have turned out to cost much more than the initial projections suggested. The Superfund environmental recovery campaign has cost more than $20 billion, and while by one recent count it has cleaned up about 300 sites, those are just a fraction of the sites the program’s creators were (implausibly) targeting. Today’s mostly private space-exploration ventures, which receive some federal assistance, are destined to come in far over budget.

It is hard to avoid similar concerns about the guaranteed-income experiments that are being launched in diverse areas of the country. According to a recent analysis by Bloomberg CityLab, there have been 20 guaranteed-income pilots initiated since 2018, with more than 5,000 individuals and families receiving between $300 and $1,000 a month. If all these experiments reach their conclusion, they will cost at least $35 million. What does this tell us about the scaling up of these programs on a national level? It tells us that they would be extremely difficult to pay for. Perhaps not impossible under the right circumstances, but difficult enough to warrant serious concerns.

So what’s the ultimate takeaway from so many promising pilot programs that don’t work at scale? Not that we shouldn’t try them, but that we ought to take John List’s warnings and apply them in reverse: Make sure you start with a representative sampling; remember that the program may have to be executed by ordinary people; and don’t hedge about the ultimate cost in order to build a short-term constituency. Follow all those rules, and your good idea will have a chance to succeed at scale. Not a guarantee, but a chance.

Internet Explorer 11 is not supported

Why So Many Clever Pilot Programs Flunk Out at Scale

Small experiments for solving social problems may seem to work, but at least half of them fall apart when they’re expanded to a larger constituency. Costs are the main explanation, although not the only one.