Where is the toxicology for the twenty-first century?

Following huge demand, we are pleased to propose Dr. Silvetti piece in english. The original italian version is available here.

Where is the “toxicology for the twenty-first century”?

by Massimo Silvetti

The opinion piece entitled “Toxicology for the twenty-first century”, written by Thomas Hartung (Hartung, 2009 ), is widely cited by antivivisectionist associations for its criticisms on animal testing (AT) in toxicology. Published in “Horizons” section (not subject to peer reviewing) of the journal Nature, it can be considered as a milestone for AT detractors. This work is a challenge to the opinion of the majority of scientific community, which considers as necessary and valid the use of AT in toxicology. The aim of our comment is to analyze in detail this opinion piece, above all for what concerns its criticisms to AT. From here on, we will often refer to “Toxicology for the twenty-first century” simply as “opinion piece”.

1. Article summary
The opinion piece opens describing the European protocol REACH (Registration,
Evaluation, Authorisation and Restriction of Chemicals), aimed at evaluating the safety of any chemical compound for human use. According to the author, the main weak point of this protocol is that it is largely based on AT. He suggests that using AT in toxicology (both in chemistry and in pharmacology) typically generates unreliable results. This is because of the differences in physiology between species (included the human one). For this reason, in order to optimize safety, current protocols prescribe testing on several different species, monitoring a wide range of “endpoints”, i.e. clinical variables used to detect possible toxicities (e.g. renal or cardiac functionality). The author considers that these procedures are too strict and they can generate a large quantity of false positives: i.e. they could classify as toxic many non-toxic compounds. Including these procedures in the REACH would have serious consequences, as this would erroneously suggest the withdrawal of non-toxic chemicals that have been on the market for more than thirty years. The last part of the opinion piece proposes that increasing and refining the use of alternative techniques (e.g. toxicogenomics (in vitro) and computer simulations (in silico)) could improve the reliability of toxicological tests, solving the above described problems.

2. Arguments against AT
“Toxicology for the twenty-first century” is mainly based on the negative evaluation of AT reliability in toxicology. This aspect is particularly relevant not only for what concerns the REACH protocol, but also for antivivisectionist associations asserting the scientific uselessness of AT. The author bases his criticisms on the results from five scientific works (references from 3 to 7 in the original article). Here we will perform a comparative analysis between the results reported in “Toxicology for the twenty-first century” and those reported in the original references.
The first cited work consists in a report edited by the NIEHS (National Institute of Environmental Health Sciences) (NIEHS, 2006) where a panel of experts evaluated some in vitro methods aimed at determining the starting doses for in vivo toxicological tests (especially for the Lethal Dose 50%, LD50: the dosage of a compound needed to cause death in 50% of subjects). The opinion piece reports that the average correlation between LD50 in rats and the lethal blood concentration in humans is weak, with a value equal to 0.56. Actually, the value reported in the original article is 0.75, considerably greater than 0.56. More specifically, the NIHES document reports the correlation between LD50 in rats and LC50 (Lethal Concentration 50%, i.e. the blood concentration of a compound that causes death in 50% of subjects) in humans. As the two variables do not measure the same phenomenon (e.g. the LC50 is independent from absorption), it would be difficult to assess a correlation of 0.56 as satisfactory or not. Anyway, this consideration is not the most relevant. Indeed, it is possible to read in the original reference that the reported value 0.56 does not refer to the correlation coefficient, but rather to the coefficient of determination (R2), i.e. the correlation square (pp. ix and 21, original document). Therefore, the cited correlation coefficient has a value of 0.75 and not of 0.56. This means, taken in consideration the differences between LD50 and LC50, that the value is definitively high.
The second cited work (Basketter et al., 2004) deals with the assessment of skin irritation potential of 65 compounds, tested on both rabbits and humans. The opinion piece reports that 40% of compounds resulted irritant for rabbits but not for humans: apparently, a disappointing result. Actually, in order to evaluate correctly this result, we need to consider one crucial aspect: the experimental protocol used for human subjects (4-h patch test) was specifically designed to do not cause serious inflammatory reactions, while the animal protocol was optimized to maximize the sensitivity (in order to rule out the possibility of underestimating irritation potential for humans). If we analyze Table 2 in the original study (page 3), reporting the toxicity results on humans and rabbits, we learn that the sensitivity of the animal test is 97%, i.e. the probability of a false negative (dangerous for human health) is 3%. Hartung implicitly refers to specificity, which indicates the percentage of false positives (irritant for rabbits and not for humans). In summary, the dermatological tests in vivo on rabbits resulted useful for protecting human health (because sensitive) although they can generate false positives. The latter are perfectly explainable by the difference between the two compared protocols: intense exposition to chemicals for rabbits, mild exposition for humans.
The third work (Gottmann et al., 2001) is cited to show the impossibility of generalizing toxicological results from one species to others. This work deals with in vivo reproducibility of the assessment of carcinogenic potential for 121 compounds. It compares the results from general scientific literature with those from the National Cancer Institute/National Toxicology Program (NCI/NTP). Authors reported a concordance of 57% (abstract and p. 511, original document). We do not discuss this specific result, although it is worth noting that the authors themselves declared that other studies had found a very high reproducibility of carcinogenic effects, comprised between 76 and 93% (p. 513, original document). It is more relevant, instead, specifying that the latter work does not deal with the possibility of generalizing toxicological results between species, but rather between databases (NCI/NTP and general literature), at parity of species (mice and rats, see Introduction, p. 509, original work). For this reason, we have to consider as inappropriate the use of this citation for corroborating the hypothesis that AT results cannot be generalized.
The fourth work (Schardein et al., 1985) is cited as a further support against the possibility of generalizing data from one species to others. This work compares the results on teratogenicity(fetal toxicity) of several compounds in many different species, included humans. The opinion piece reports a concordance between species comprised between 53 and 60%, concluding that the results obtained in one species can be hardly generalized to other species. First of all, also this work shows a very high sensitivity of animal tests: when a compound was safe in animals, it was very likely to be safe in humans as well (see Discussion, p. 65). At the same time, the authors reported a certain amount of false positives (teratogenic for animals, but safe for humans). The authors explained these results, analogously with what we observed in the second surveyed study (Basketter et al., 2004), as due to the differences in dosage between human and nonhuman subjects. Indeed, teratogenic tests in vivo typically imply massive administration of compounds, condition that is seldom satisfied in humans (Discussion, p. 65). Here follows the analysis of between species concordance. We were not able to find the values interval on between species concordance (53-60%) reported in the opinion piece. We then analyzed directly the tables reported in the original study. These tables show the teratogenicity results in several animal models, for the following compound groups: a) compounds teratogenic for humans (Table 3, p. 60); b) compounds that, up to now, did not show teratogenic potential for humans (Table 6, p. 63); c) compounds that are likely teratogenic for humans (Table 7, p. 64). Based on the tables analysis, it emerges a between species concordance of 56-75%*. This interval of values is very conservative (i.e. it is very strict), because of the variance due to the many different experimental protocols compared in the analysis. Finally, it is worth noting the obsolescence of this study (published in 1985), that limits the reliability of generalizing its results to current toxicological protocols.
The fifth, and last, article (Olson et al., 2000, unfortunately we did not find an open access version) is cited to support the hypothesis that toxicological tests in pharmacology are not predictive for humans. Based on this study, the opinion piece reports that only 43% of toxic effects in humans were predicted by tests on rodents. The percentage remained low (63%) even when the data from many species were analyzed together. The first thing to report is the citation error on the aggregated-multispecies results: the value is 71% and not 63% (Abstract, original article). Nonetheless, this is not the most important observation. The key point is that this work, as specified by the authors themselves (Discussion, first paragraph), does not deal with the predictive value of AT on humans: “This study did not attempt to assess the predictability of preclinical experimental data to humans. What it evaluated was the concordance between adverse findings in clinical data with data which had been generated in experimental animals (preclinical toxicology).” (p. 65, Discussion). Indeed, this work reports the true positive concordance between toxic effects found during the preclinical phase (AT) of drugs development, and the following clinical phase (conducted on human subjects). In order to read properly these results, it must be considered that preclinical and clinical toxicological data are not independent. Indeed, only the compounds that passed preclinical tests (the safest and most effective) have access to the clinical phase, therefore toxicological data from clinical phase are conditioned by preclinical results (i.e. they are obtained on compounds that resulted safe on animals). This generates an overestimation of false negatives (i.e. safe on animals but toxic on humans) with consequent reduction of concordance between preclinical and clinical data. For this reason, the authors themselves considered the reported concordance as satisfactory (Abstract).

3. Conclusions
The criticisms against AT, expressed in “Toxicology for the twenty-first century”, showed to be based on a questionable (in some cases merely wrong) use of the literature. It must be stressed out that here we surveyed exclusively the literature specifically selected to support the criticisms against AT, therefore we analyzed this issue from a very conservative perspective. As all the considerations on the REACH protocol expressed in the opinion piece are based on these references, the validity of the whole work is compromised. “Toxicology for the twenty-first century” had a good resonance within the academic environment, and a wide resonance for the general public, as it is often cited by antivivisectionist associations as an evidence of AT uselessness. The literature analysis conducted in our comment raises substantial doubts about the opinions on AT and the REACH protocol expressed in Hartung’s article.
For concluding, we take the opportunity of debunking the belief (spread across many websites) that associates the opinion piece we commented here with the alleged statement that AT would be “bad science”. Actually, AT is never criticized so strictly in “Toxicology for the twenty-first century”, and there is not even a similar statement in it. On the contrary, it is clearly specified that current alternatives to AT are not yet mature for a complete replacement: “Even if the use of alternatives to animal studies, such as cell-culture based testing, were feasible, such methods do not have fewer limitations, except for ethical ones.” (p. 210, last paragraph). Finally, the author wisely recommends a refinement of animal models in toxicology, rather than a complete replacement: “The solution to using fewer animals and making better predictions in the mid-term is to design integrated testing strategies.” (p. 210, second paragraph).


Basketter DA, York M, McFadden JP, Robinson MK, Determination of skin irritation potential in the human 4-h patch test. Contact Dermatitis, 51, 1–4, 2004.
Gottmann E, Kramer S, Pfahringer B, Helma C, Data quality in predictive toxicology: reproducibility of rodent carcinogenicity experiments. Environmental Health Perspectives, 109, 509-514, 2001.
Harung T., Toxicology for the twenty-first century. Nature 460, 208–212, 2009.
NIEHS (National Institute of Environmental Health Sciences), The Use of In Vitro Basal Cytotoxicity Test Methods For Estimating Starting Doses For Acute Oral Systemic Toxicity Testing. 2006.
Olson H et al., Concordance of the Toxicity of Pharmaceuticals in Humans and in Animals. Regulatory Toxicology and Pharmacology, 32, 56–67, 2000
Schardein JL, Schwetz BA, Kenel MF, Species Sensitivities and Prediction of Tetratogenic Potential. Environmental Health Perspectives, 61, 55-67, 1985.
* Algorithm for between species concordance computation:
For each table we computed the average concordance between all the possible pairs of species. Example: Table 3 displays the teratogenicity results of 15 compounds (rows) in 10 species (columns). Therefore, we have 45 possible pairs of species (binomialbinomial). We then computed the average concordance within pairs. When we measured the concordance within each pair, we excluded the rows with missing data (teratogenicity data missing for at least one of the two species). We excluded from the analysis the pairs having less than 3 data points (less than 3 rows).

Massimo Silvetti (Ph.D.) works as post-doctoral research fellow at Ghent University (Belgium). He deals with neuroscience, more specifically he is experienced in computer simulations of neural circuits (computational neuroscience) and in neuroimaging (functional MRI). He authored several articles published in international journals and he co-authored two books about neural modeling and neuroimaging. Currently, Dr. Silvetti’s interests are focused on the brain reward system, and on the neuro-pathogenesis of attention deficit-hyperactivity disorder (ADHD).

Potrebbero interessarti anche...