The Trouble With Animal Models

Why did human trials fail

By Andrea Gawrylewski

On October 26, 2006, at the opening day of the Joint World Congress for Stroke in Cape Town, South Africa, disappointing news spread quickly among the attendees: The second Phase III clinical trial for NXY-059 had failed. The drug, a free-radical spin trap agent for ischemic stroke, had been eagerly anticipated as a successful neuroprotective agent for stroke patients. As the drug developer, AstraZeneca, issued a press release reporting the news, e-mails circulated quickly within the stroke research community, many with the subject line, "Have you heard the bad news?"

"We were optimistic that this would be the new stroke drug," says Marc Fisher, director of the stroke program at the University of Massachusetts Medical Center, who was at the conference in Cape Town. "We were...

To the dismay of clinicians and researchers of acute stroke, the compound showed limited efficacy in neuroprotection versus the placebo. Instead, NXY-059 joined the family of more than a dozen failed neuroprotective agents, including glutamate antagonists, calcium channel blockers, anti-inflammatory agents, GABA agonists, opioid antagonists, growth factors, and drugs of other mechanisms. All had reached Phase III clinical trials and failed miserably at doing what their animal model tests had suggested they would: stop the cascade of necrosis in the event of stroke, and protect the remaining viable brain cells.

That NXY-059 had fallen victim to the same fate was particularly disheartening to a stroke roundtable group that had, in 1999, directly addressed the disconnection between animal models for stroke and their counterpart human trials. The group had devised a set of guidelines whose aim was to standardize the path to stroke therapeutics. During its development, NXY-059 had been its poster child. "This drug was being hailed as the first one to follow the standards," says Sean Savitz, assistant professor of neurology at Harvard Medical School. "But it didn't do that."

"It was very disappointing to all of us to have it fail, and it totally failed," says Sid Gilman, director of the Michigan Alzheimer's Disease Research Center in the Department of Neurology at the University of Michigan, and one of the consultants for AstraZeneca on the Phase III clinical design.

If the outcome of the Stroke Acute Ischemic NXY Treatment (SAINT) trials was an anomaly, investigators might have just shrugged it off. But it's not: Nearly half of all molecular entities that come into development fail, according to Janet Woodcock, deputy commissioner of the Food and Drug Administration. "There's no doubt about the absence of an effect [of NYX-059], and that called into question the many other studies in stroke, and how good are the animal models?" says Gilman. "So many agents appeared to be effective in the animal model and failed in human trials."

Because of these failures, hundreds of millions of dollars, and a potential approach to stroke treatment, have disappeared down the drain. The failure of NXY-059 may have stalled the quest for a neuroprotective agent, at least for some time. "This trial has poisoned stroke studies," says Gilman. "I'm doubtful that investors will want to invest in clinical stroke trial[s] for a while." The fault, it appears, may rest in the slipshod use of animal models.

In 1998, Fisher flew from Boston to Germany to help a drug company, along with academics specializing in animal modeling, to examine two sets of clinical trial results for a new stroke treatment that had failed. They wanted to uncover where they had gone wrong. (Fisher declines to disclose the company and trials that were involved.)

Human brain scan of cerebral hemorrhaging four weeks after astroke ? the blue indicates internal bleeding. AstraZeneca also tested NXY-059 in the CHANT clinical trials to treat intracranial hemorrhage, but further development past the phase II trial was culled after the failure of SAINT II.

On his return flight, it occurred to Fisher that the chaos the stroke research field had been facing for years might benefit from the kind of meeting he had just attended: industry and academia collaborating to develop standardized practices. The next year, Fisher convened the first Stroke Therapy Academic Industry Roundtable (STAIR) group that devised a set of recommendations for preclinical and clinical stroke drug development. On the preclinical side, some of the recommendations seemed obvious: The candidate drug should be evaluated in rodents and also higher animal species; blind testing should be performed; tests should be done in both sexes and in varying ages of animals; and all data, both positive and negative, should be published.

Approximately 26 million animals are used for research each year in the United States and European Union, according to estimates by the Research Defense Society in the United Kingdom. However, the number of animal procedures has been reduced by half over the past 30 years, likely due to stricter controls, improvements in animal welfare, and scientific advances.

Still, unlike in human clinical trials, no best-practice standards exist for animal testing. STAIR is the stroke research community's attempt at standardization. NXY-059 was the first neuroprotective agent to be developed under the auspices of the STAIR guidelines, though the implementation of the guidelines may have been just lip service. In particular, as Savitz wrote in an article published online in Experimental Neurology in May, the preclinical testing had several holes, including statistical robustness and the way in which the results were translated into clinical design.1

The main problems, Savitz writes, were randomization and bias. In the initial evaluation of NXY-059 in rat models of focal ischemia, reports didn't say whether researchers had been blinded with regard to drug administration, behavioral testing, and histologic analysis. The results from the rodent study were mixed, showing a range of reduction of cerebral infarction size over a variety of intervals. However positive the results might have been, Savitz notes, the clear lack of statistical robustness calls any result into question. A subsequent report on the effects of NXY-059 in a rabbit embolic model showed a 35% reduction in infarction after 48 hours, but it did not indicate whether statistical analysis, blind testing, physiologic measurements, blood flow monitoring, or behavior assessments had been done.

AstraZeneca maintains that the preclinical animal tests and the clinical phase of SAINT adhered to the STAIR guidelines: "The design of the SAINT trials was sound and well considered in light of the strong evidence for neuroprotection that existed across the models and species tested at the time," according to a statement sent to The Scientist in response to Savitz's paper. Gilman, also editor-in-chief of Experimental Neurology, says he is not aware of any official response being drafted or submitted by AstraZeneca.

"So many agents appeared to be effective in the animal model and failed in human trials."
- Sid Gilman

The statistical troubles that mired some of the NXY-059 preclinical trials are common in animal models. Surveys of papers based on animal models find errors in about half, according to Michael Festing, a recently retired laboratory animal scientist at the UK Medical Research Council and board member of the National Center for Three Rs (NC3Rs ? replacement, refinement, and reduction), an organization that advocates using fewer animals in research and streamlining current animal tests. "Whether those are serious enough that the conclusions are invalid is debatable," Festing says.

Even the innumerous successful cases of animal experimentation that led to effective treatments for high blood pressure, asthma, transplant rejection, and the polio, diphtheria, and whooping cough vaccines were all carried out without standardized testing methods.

"People don't report if studies are randomized," says Ian Roberts, professor of epidemiology at the London School of Hygiene and Tropical Medicine. How animals are selected, or whether assessments were blind, are rarely included in the methods and thus create a potential for bias. "Imagine a cage of 20 rats, and you've got a treatment for some," explains Roberts. "So you stick your hand in a cage, and pull out a rat. The rats that are the most vigorous are hardest to catch, so when you pull out 10 rats, they're the sluggish ones, the tired ones, they're not the same as the ones still in the cage, and they're the control. Immediately there's a difference between the two groups."

The NC3Rs, in cooperation with the National Institutes of Health, is surveying a group of 300 papers, half from the United Kingdom, half from the United States, for their statistical quality in mouse, rat, and primate model studies. Researchers hope that by fall they will have a report describing how well (or not) the studies were randomized and whether they used the correct statistical methods. In an initial pilot study of 12 papers conducted in 2001 for the Medical Research Council, Festing reported: In six of the papers the number of animals used wasn't clear; only two of the papers reported randomization; and only six of the papers specified the sex of the animals tested. (For more on how gender can influence results, see "Why Sex Matters".)

Illustrations by Joelle Bolt

Statistics aren't the only problem. Methodology is arbitrary, replication is lacking, and negative results are often omitted. A report in Academic Emergency Medicine by Vik Bebarta et al. in 2003 showed that animal experiments where randomization and blind testing are not reported are five times more likely to report positive results.2 In a December 2006 paper in the British Medical Journal, Pablo Perel et al. showed that in six clinical trials for conditions including neonatal respiratory distress syndrome, hemorrhage, and osteoporosis, only three of the trials had corresponding animal studies that agreed with clinical results.3 The authors attribute this discrepancy to poor methodology (i.e., bias in the animal models) and the failure of the models to mimic the human disease condition.

The difficulties associated with using animal models for human disease result from the metabolic, anatomic, and cellular differences between humans and other creatures, but the problems go even deeper than that.

When experimenting in animals researchers often use incorrect statistical methods, adopt an arbitrary methodology, and fail to publish negative results.

One of the major criticisms of the NXY-059 testing was the lack of correlation between how the effects of the drug were monitored in animals versus in humans. In the rodent model, researchers induced an ischemic event, administered the drug at various time intervals, and measured the size of the infarction. During the clinical trials, however, the drug's effect was evaluated in stroke patients using behavioral indicators such as the modified Rankin scale and NIH stroke severity (NIHSS) scale. In the primate tests the behavior assessments were based on a food-reward system, showing that NXY-059 did not improve left arm weakness in the aftermath of a stroke. "Even if we accept that NXY-059 does improve arm weakness," writes Savitz, "how would such a finding translate to human acute stroke studies that use the modified Rankin scale and NIHSS scores as primary outcome measures?" Indeed, some consider the two phases of testing, from animal to human, completely out of whack, and that only by statistical fluke was SAINT I, the first clinical trial, deemed a success.

Some say that animal research is best when targeted at specific mechanisms of action. "Animals are better used for understanding disease mechanism and potential new treatments, rather than predicting what will happen in humans," says Simon Festing, executive director of RDS (and son of Michael Festing). RDS is a UK organization that advocates the understanding of animal research in medicine. "The 2001 Nobel Prize in medicine involved sea urchins and yeast, organisms that evolved apart from humans by millions of years," says the younger Festing. "And yet, they are ideal models for studying cell divisions ? research that is being used in cancer therapeutics in humans now."

For specific models of human disease, Simon Festing adds, the farther away from the human species the animal studies get, the less predictive the model will be. For example, researchers studying some conditions, including Parkinson disease, have established a clear animal model. The primate model displays symptoms similar to human symptoms, whereas a mouse model may not be able to show the distinct tremor in the limbs. While this difference in essence relates back to fundamental anatomic variation among the various species, finding the best model is inherently difficult.

"The choice of animals is rather narrow," says Michael Festing. "There are 4,000 species of rodents, but we use only three or four of them. Then there's a shortage of anything that's not rodents, and in some cases we're restricted to dogs and cats ? which are a problem from the ethical point of view ? and primates, also a problem from the ethical point of view. So [choosing the right animal model is] sort of done by default: Eliminate the ones that are not suitable and choose from what's left."

Perhaps because of its abundance and short gestation, the mouse has become the flagship of animal testing, especially useful with genetic modifications, gene knockouts, and knockins. In 2003, NIH launched the Knockout Mouse Project (KOMP) and has awarded more than $50 million with the goal of creating a library of mouse embryonic stem cells lines, each with one gene knocked out.

Nonetheless, even genetically manipulated mice have their problems. The current knockout mouse model for amyotrophic lateral sclerosis (ALS) may be completely wrong, according to John Trojanowski at the University of Pennsylvania School of Medicine. He and colleagues recently showed that two versions of the disease, sporadic and hereditary, are biochemically distinct, and that a different mechanism controls the disease in each case.4 In hereditary ALS the disease is associated with a mutation (SOD-1), whereas the sporadic cases are associated with the TDP-43 protein. Until now, research has focused primarily on SOD-1 knockout mice, with virtually no success in human trials. The new findings relating to the TDP-43 protein suggest that the SOD-1 knockout model for ALS could be wrong. "There was this nagging doubt" about the validity of the current models, Trojanowski says. "And there may be a whole new pathology characteristic, so we need models based on TDP-43."

A recent study at the Massachusetts Institute of Technology shows distinct differences between gene regulation in humans and mouse liver ? particularly how the master regulatory proteins function.5 In a comparison of 4,000 genes in humans and mice, the researchers expected to see identical behavior ? that is, the binding of transcription factors to the same sites in most pairs of homologous genes. However, they found that transcription factor binding sites differed between the species in 41% to 89% of the cases.

Many of the underlying limitations associated with mice models involve the inherent nature of animal testing. The laboratory environment can have a significant effect on test results, as stress is a common factor in caged life. Jeffrey Mogil, a psychology researcher at McGill University in Quebec, demonstrated last year that laboratory mice feel "sympathy pains" for their fellow labmates. In other words, seeing another mouse in distress elevates the amount of distress the onlooker displays. The average researcher, when testing for toxicity effects in mice for example, likely assumes that they are starting at a pain baseline, when in truth the surrounding environment is not benign and can significantly affect results, Mogil says.

Choosing the right animal model is "sort of done by default: Eliminate the ones that are not suitable and choose from what's left." -Michael Festing

In new research, Mogil's group is demonstrating that the very presence of a lab researcher can alter behavior in mice. "The surprising thing is that these effects are visual, not auditory or olfactory," he says. "It's a huge surprise. Most people think [mice] are mostly blind anyway. I'm being convinced that the visual world of the mouse is a lot richer than expected."

Although the failure of NXY-059 may be one insult too many for clinicians and patients eagerly awaiting a neuroprotective agent, some experts feel that this hurdle is far from being the final chapter. Whether they blame weak animal test standardization, poor clinical design, or inadequate statistical analysis, questions often return to the NXY-059 itself as an indicator for the future of neuroprotection. "This drug is known to have antioxidant effects, but it was never shown what its mechanism was on the brain. Early studies were only hinting at possibilities," Savitz says.

In a field where much work is concentrating on nitrone-based spin trap agents, NXY-059 became the parent compound. But it's clear that it wasn't the answer. "The drug probably isn't a good drug to begin with," says Myron Ginsberg, professor of neurology and clinician at the University of Miami School of Medicine. Despite NYX-059's disappointing failure, other neuroprotective options are still in the pipeline. Ginsberg is in the early stages of working on albumin as a neuroprotective therapeutic, and researchers are also considering hypothermia as a way of preserving brain cells after ischemic stroke. "The fact that this drug failed, Ginsberg says, "doesn't say anything about the potential for neuroprotection [in the future.]."

1. S. Savitz, "A critical appraisal of the NXY-059 neuroprotection studies for acute stroke: a need for more rigorous testing of neuroprotective agents in animal models of stroke," Exper Neurol, 205:20?5, May 2007. [PUBMED]
2. V. Bebarta et al., "Emergency medicine research: Does use of randomization and blinding affect the results?" Acad Emerg Med, 10:684?7, 2003. [PUBMED]
3. P. Perel et al., "Comparison of treatment effects between animal experiments and clinical trials: systematic review," Brit Med J, 334:197, Jan. 27, 2007. [PUBMED]
4. I.R. Mackensie et al., "Pathological TDP-43 distinguishes sporadic amyotrophic lateral sclerosis from amyotrophic lateral sclerosis with SOD1 mutations," Ann Neurol, 61:427?34, May 2007. [PUBMED]
5. D.T. Odom et al., "Tissue-specific transcriptional regulation has diverged significantly between human and mouse," Nat Genet, 39:730?2, June 2007. [PUBMED]

Interested in reading more?

Magaizne Cover

Become a Member of

Receive full access to digital editions of The Scientist, as well as TS Digest, feature stories, more than 35 years of archives, and much more!
Already a member?