Introduction: The Design and Statistical Analysis of Animal Experiments. ILAR
43 (4): 191.
One of the main problems with research is poor statistical training and a general reluctance of biologists to do mathematics. Good experimental design may make statistical analysis more accessible and less intimidating. Animal experiments are done to discover something about the biology of the species, strain, or sex being studied and indirectly to infer something about humans or other target species. The use of animal models involves the following steps: 1) choice of suitable model based on current knowledge about disease processes in target species and potential model organisms; 2) experiments to indicate how model responds to applied treatments; and 3) relevance of the results for target species. Ethical issues should be considered first when planning animal studies. One should ponder Russell and Birch's "3Rs": 1) whether the animal model can be replaced by a less or nonsentient alternative; 2) refining the experiment to minimize pain and distress for each animal; and 3) reducing the number of animals to the minimum required to achieve the experimental scientific objectives. All animal models are subject to biological variation as a result of genetic and nongenetic variation and the interaction between them. Good experimental design aims to control this variation so that it does not obscure any treatment effect, with statistical analysis being designed to extract all useful information and take into account any remaining variation. One should review likely sources of variation when first setting up studies. One of the most serious causes of interindividual variation is clinical or subclinical disease. Less well known are some of the effects of animal sources, husbandry, housing (including environmental enrichment), diet, and social interaction among grouped animals.
It is important to survey the literature, clarify research questions, select model systems and available skills, and review equipment and staffing. One should also identify collaborators and involve a statistician if needed. Choosing treatments and outcomes and formulating a tentative design should be done early on. Determining the appropriate number of animals needed is particularly important. Statistical power and sample size calculations provide some guidance. There are numerous experimental variables that include 1) "ancillary" variables (i.e. species, strain, sex, husbandry, model preparation, environmental aspects); 2) variable like diurnal variation, observer differences, and measurement error; and 3) variable such as differences between animals. Many methods are available for dealing with these variables.
Most experiments use simple designs such as completely randomized, randomized block, and simple factorial designs. Advanced factorial design offers one method of assessing which, among a large number of controllable variables, are most important in determining the response and which can be ignored. Many combinations of factors can be tried without using excessive animal numbers. The use of animals in acute toxicity tests to determine a median lethal dose (LD50) is controversial because it may involve excessive suffering. Improved methods, such as the "up-and-down" method, have been developed that use fewer animals by relying on a sequential procedure. The "fixed dose procedure" and the "acute toxic class" methods have similar objectives of classifying chemical toxicity but they do not result in LD50 estimates.
Design, analysis, and interpretation of biomedical experiments are best performed with the aid of a good statistical textbook, dedicated statistical software, and advice from a statistician.
1. Name some sources of nongenetic biological variation in animal models.
2. Name the 3Rs. Who gets credit for developing the 3Rs?
3. Name 4 different kinds of animal models.
4. What are some ways to deal with the numerous variables which can occur in any experiment?
1. Chance developmental effects, social hierarchy, and unequal exposure to environmental influences.
2. Replacement, Refinement, Reduction. Russell and Birch get credit for developing the 3Rs in 1959.
3. Spontaneous: Occurs naturally (i.e. mutation) and is similar to a disease condition of man. Experimental: Induced. Production of a disease or condition by a variety of means to produce an animal model that mimics a condition or disease in man (i.e. drug study, surgical model). Negative: A negative or "non" model. This animal is resistant to the specified disease or condition. Its value is in determining why it is resistant to a disease or condition. Orphan: Animal disease or condition for which no comparable disease exists in man. May include diseases that show a dissimilar pathogenic mechanism. Or, vice versa, human diseases for which there are no good animal models.
4. 1) Include variables in the experimental design by using randomized block or Latin square designs; 2) use these variables in the statistical analysis in screening for outliers; 3) use techniques such as covariance analysis to increase statistical power; or 4) simply record these variables so other people know exactly what has been done.
Control of Variability. ILAR 43 (4): 194.
Variability in experiments that feature live animals is attributable to one of three sources:
1. Variability introduced by the experimenter Arises from a) how procedures are conducted and b) lack of precision of with measurements; together, these often make variability introduced by the experimenter the most predominant source of variation in experimental results. Variability in conduct of procedures is minimized by "ensuring performance in a uniform, precise, and robust manner" - e.g., fresh syringes/needles used for each animal, single rather than multiple doses delivered with each syringe, spillage avoided, sterile technique used, animals handled carefully/deftly to minimize stress, standardized procedures used with surgical techniques, and residual variability measured (blood loss volume, weight of excised tissue, precise anatomical location of interventions) and accounted for during statistical analysis. Although some instruments and techniques (e.g., bands in electrophoretic gels, optical density) yield imprecise measurements due to influences/factors beyond the control of the scientist, variability due to imprecise measurement is usually minimized by ensuring that appropriate measurement techniques are correctly conducted, and by making multiple observations of variables of interest. Observations are inherently imprecise; techniques to achieve consistency include cross-checking observer performance against consistent test standards, estimating interobserver variability, and minimizing it through training and performance comparisons; "Variance components" analysis can quantify within- and between-subject variability, and make it possible to estimate how many repeat observations are needed to attain a given level of precision. Ethological investigators must choose which measurements of behavior are most appropriate; traditional approaches sacrifice holistic interpretation in favor of collecting discreet behavioral components in order to gain confidence through application of statistical analysis, yet analytical outcomes for such small components of behavior may not adequately describe the behavior of interest. When assessing several types of behavior, multivariate analysis can be used to determine which -if any- types are correlated. Behavior patterns may be broken down into quantifiable components (frequency, duration, intensity) that vary with time and in relation to each other, as well as to some trigger event, but binary assessment is difficult due to observer differences; activity meters, while of no help with interpretation yield robust numerical data for statistical analysis and are especially useful in free-ranging conditions. Alternatively, behaviorists often place subjects in situations (T or Y mazes, conditioning chambers, discriminant learning apparati, etc.) so the behavior of interest can be reliably prompted and the animal's responses robustly recorded. Population variance is typically less for whole animal physiological and pharmacological studies as many of the variables inherent in behavioral and ethological observations are controlled; strain/sex/age/weight are often independent of the study, anesthesia may isolate subjects from conscious contact with the environment, and radiotelemetry allows data collection from conscious but unrestrained and relatively unstressed animals. Collection of data over time allows estimation of reliability, and repeat measurement designs can both reduce the number animals used and enhance the relevance of resulting data with normal physiological conditions. To cultivate reproducible experimental information and to provide assurance of the quality of work performed, pharmacological investigations are often conducted using good laboratory practice (GLP) standards. The GLP standard of experimentation- introduced by the FDA in the early 1970's to ensure reliability of data submitted for registration of medicinal compounds in the United States - has led to greater precision in the conduct of experimental procedures and a considerable reduction in the variance of measurements collected in scientific studies.
2. Inherent variability between animals This type of variability arises from differences in genetic makeup and variables in the macro- and microenvironment that affect animals as they develop; primary areas include source of the animal, animal care routine employed, exposure to different environments, and health status. Specific examples: genetic drift in founder stock of different vendors, differences in rearing practices, obscure effects (such as intrauterine position) responsible for environmentally related sex rations alterations in offspring, physiological changes caused by differences in temperature, humidity, light, sound, position of the cage in the rack/room, type of bedding used, individual vs. social housing, subclinical disease, etc. Variability of this type is controlled by choosing/using homogeneous populations (e.g. SPF animals from one source, housed identically, handled by the same caretaker, etc.), by using randomization when assigning subjects to experimental groups, and/or by grouping and accounting for such sources of variability in the data analysis (analysis of covariance, or ANCOVA). Even if conditioned prior to experimental use, if homogenous populations are not used, the source of the animals should be factored into the experimental design and appropriate controls run to account for differences in adaptation. KEY CONCEPT: Greater population variability means greater numbers of animals must be used to detect a given biological change (i.e., sensitivity)... controlling for this source of variability allows for use of fewer subjects (reduction).
3. Induced variability resulting from interaction between characteristics of the animal and its environment Compounded variability arises from the interplay between inherent difference between animals and variables associated with the experimental protocol or with how the animals are kept and handled. It happens because animals respond physiologically, biochemical, and behaviorally to environmental stress or stressors; it is not easily measured and is the most difficult to predict and control of the three sources of variability. Examples of environmentally induced variability include development of social hierarchies in response to environmental enrichment and/or group housing (with variation in size and/or longevity of established hierarchies), group competition arising from return of individuals removed for procedures, stress of "normally" social species individually housed, etc. Some control is achieved by using stable (smaller) groups in which the hierarchy is fully established, and by avoiding changes in group composition when possible. Examples of interaction with experimental routines include variation in response to different caretakers, variation resulting from the time of day procedures are conducted (circadian rhythms), variation associated with the timing of experimental work relative to husbandry procedures (cage change-outs, etc), and position in the dosing sequence. Methods to control this type of variability include using randomized block designs for dosing, ensuring appropriate levels of training for those conducting procedures, using centralized facilities and/or professional staff, acclimatizing animals, using positive reinforcement (rewards) for animals immediately after conduct of procedures, etc. KEY CONCEPT: this type of variability is often synergistic and may result in a population variance that is substantially higher than would be predicted if the effects were simply additive. As a result, the number of animals required to conduct a study of a given statistical power, and to look for a certain size of treatment effect, might be greatly increased.
1. Three sources of variability in experiments that feature live animals?
c. _____________________________________________________________ _____________________________________________________________
2. Which of the three sources of variability is the most predominant source of variation in experiments that feature live animals?
3. Which of the three sources of variability is the most difficult to predict and control?
4. T/F: It is often appropriate to regard different observers as independent variables and to include them in any analysis as a check on interobserver consistency.
1. a) variability introduced by the investigator,
b) inherent variability between animals, and
c) induced variability resulting from interaction between characteristics of the animal and its environment.
2. variability introduced by the investigator
3. induced variability resulting from interaction between characteristics of the animal and its environment
Practical Aspects of Experimental Design in Animal Research. ILAR 43 (4): 202.
Good science requires use of experimental designs and standard laboratory practices that produce scientifically valid and reproducible data. In an ideal world, once an idea for a research project has been conceived, the literature is reviewed and experts in the field consulted in order to refine the problem statement and gather background information; null and alternative hypotheses addressing the problem statement are formulated and then the specific design of the experiment is developed. Practical considerations impacting the experimental design include identifying the most appropriate animal model to address the experimental question, defining necessary control groups, randomly assigning animals to treatment/control groups, determining the number of animals needed per group, evaluating the logistics of the actual performance of the animal experiments, determining the most appropriate statistical analyses, and identifying potential collaborators experienced in the area of study. Failure to address all aspects of experimental design may lead to false conclusions and waste both time and resources, thus the authors recommend the following sequence when designing/conducting experimental studies: 1. Conduct a complete literature search and consult experts with relevant experience to become as familiar as possible with the topic BEFORE designing the experiment. 2. First generate a specific question and/or hypothesis, then design experiment(s) to address it. 3. Consult a biostatistician during the design phase of the experiment, not after performing it. 4. Use appropriate controls to ensure proper isolation and evaluation of variable(s) of interest; use of more than one type of control is often required. 5. Use small pilots initially to generate preliminary data and work out techniques and procedures; follow with larger scale experiments to generate statistical significance. 6. Review experimental results to modify the original question/hypothesis and procedures, generate new questions, and then repeat the process (steps 1-6) as necessary.
EXPERIMENTAL DESIGN: INITIAL STEPS Literature Search- Primary goals of the literature search are to learn of pertinent studies and methods, identify appropriate animals models and eliminate unnecessary duplication of research. The literature search should include current and past journal articles and textbooks available in journal databases/indexes (MEDLINE, TOXLINE, PUBMED, NCBI, AGRICOLA, etc.), information available through Internet sources, and consultation with experts in the field. Lest ye forget, the literature search is also an important component of IACUC protocol submission, for it is through the literature search that PI's provide evidence that their projects are not duplicative, alternatives to the use of animals are not available, and potentially painful procedures are justified. Problem Statement, Objectives, Hypotheses- the problem statement, objectives and hypotheses are fundamental components of experimental design that arise from using the Scientific Method, the four basic tenets of which are to observe and describe a scientific phenomena, formulate a problem statement and hypothesis, use the hypothesis to predict outcomes for new observations, and then test the hypothesis. The problem statement should include the issue that will be addressed experimentally and its significance (e.g., potential application to human or animal health, improved understanding of biological processes). Objectives should be stated in a general description of the overall goals for the proposed experiment and the specific questions being addressed. Hypotheses should include two distinct and clearly defined outcomes (e.g., a null and an alternate hypothesis) for each proposed experiment. The null hypothesis states there is no difference between experimental groups, while the alternate hypothesis states that a real difference between experimental groups exists. Acceptance/rejection of these hypotheses leads to the formulation of new questions/hypotheses and ideas for improving the experimental design. Identification of Animal Model- Identifying the most appropriate animal model to address the experimental question is likely the most critical step in designing animal experiments. Specific recommendations: (1) use the lowest animal species on the phylogenic scale (i.e., replacement), (2) use animals with species- and/or strain-specific characteristics beneficial for the proposed study, (3) consider costs associated with acquiring/maintaining the animal model during the experiment, (4) identify potential sources of the animal model through the literature search, networking with colleagues within the selected field of study, and/or contacting commercial vendors or government-supported repositories of animal models, (5) consult with laboratory animal veterinarians before final determination of the animal model. Identification of Potential Collaborators- Proposed procedures may require recruitment of collaborators (coinvestigators, consultants, technical support staff) with additional expertise to ensure proper sample acquisition data validity. Beyond providing services involving highly technical procedures or expensive equipment, institutional core facilities often prove a source for developing intramural collaboration.
DESIGN OF THE ANIMAL EXPERIMENT Research Plan- The research plan is a description of the experimental manipulations required to address the problem statement, objectives and hypotheses. It should specify the experimental variables to be manipulated, suitable test parameters, appropriate methods for sample acquisition and generation of test data, and methods to be used for data analysis, including appropriate statistical tests. Other aspects to address may include lifespan of the animal model (for chronic studies), anticipated progression of disease in the animal model (appropriate sampling time points), personnel time/costs associated with performing the experiments, agent administration methods, identification of potential hazards and appropriate precautions to minimize risk, and development of S.O.P.'s. Practicality of the overall project and the time frame for data collection and evaluation are determined during this stage of the development process. Experimental Unit- The experimental unit is the entity under study; it must be considered for estimating error of variance, or standard error for statistical analysis. N Factor: Experimental Group Size- Consult a statistician to determine appropriate experimental group size and numbers to generate statistically significant results. Controls- Control animals minimize the impact of extraneous (genetic, environmental, infectious agents) variables and/or account for the possible presence of unwanted variables. Positive controls (change from normal expected) provide a standard for comparing severity among experimental groups and an experimental method quality control measure (by demonstrating response detection). Negative controls (no change from normal expected) control for false positives that might result from unknown variables adversely affecting the animals in the experiment. A sham control mimics a procedure/treatment without the actual use of the procedure or test substance (e.g., a placebo). A vehicle control is used to determine whether the vehicle alone causes any effects in studies using vehicle (e.g., saline or mineral oil) in delivery of the experimental compound. A comparative control involves using a positive control for a known agent in direct comparison with a new/unknown agent. Randomization- Randomization helps ensure that underlying variables do not result in skewed data for each experimental group. Methods include randomly selecting subjects for sequential group assignment, using random number tables or computer-generated numbers/sampling.
EXPERIMENTAL DESIGN: FINAL CONSIDERATIONS Experimental Protocol Approval- IACUC approval of an animal care and use protocol is required if the species used are covered under the Animal Welfare Act (regardless of funding source), the research is supported by the National Institutes of Health and involves the use of vertebrate species, or if the animal care program is accredited by the Association for the Assessment and Accreditation of Laboratory Animal Care International (AAALAS). Approval must be obtained prior to animal purchase or experimentation and is required before submission of a grant proposal by some funding agencies. Projects featuring use of hazardous materials may require approval from other intramural oversight committees (e.g., Biosafety Committee overseeing use of infectious agents or recombinant DNA; Radiation Safety Committee for use of irradiation or radioisotopes). Personnel- Animal welfare regulations and PHS policy mandate that personnel involved in a research project must be appropriately qualified and/or trained in the methods they will be performing for that project. Regardless of where the training occurs, the institution where the research will be performed is responsible for ensuring this training. Pilot Studies- Pilot studies -often used with new procedures or to test new compounds- generate preliminary data and allow perfection of procedures/techniques prior to large-scale experimentation. Supportive preliminary data may increase the probability of funding for a proposal and may provide indication to an IACUC to support a full study. Data Entry and Analysis- The researcher is ultimately responsible for collecting, entering, and analyzing the data correctly, and quality assurance procedures to identify data entry errors should be developed and incorporated into the experimental design before data analysis. Typical methods are to directly compare raw (original) data for individual animals with the data entered into the computer, or with compiled data for the group as a whole (to identify "outliers," or data that deviates significantly form the rest of the members of a group). Review- Scientific peer review should be routinely pursued throughout the experimental design process to ensure the generation of valid, reproducible, and publishable data. Layers of review which help detect flaws in the developing or final experimental design include scientific peers and review of the scientific literature, IACUC review, grant funding agencies, and in the final stage, critical evaluation by scientific peer-reviewed journals.
1. According to the authors, what is likely the most critical step in designing animal experiments?
2. The literature search should provide evidence to address what three specific IACUC concerns?
3. What are the four basic steps of the Scientific Method?
4. Given the problem statement "Which diet causes more weight gain in rats: diet A or diet B?" What is the null hypothesis? The alternate hypothesis? Example of an untestable hypothesis?
5. Five types of controls? What type is a placebo?
6. Three experimental protocol situations for which an IACUC approved animal care and use protocol is required?
7. Who is responsible for ensuring that personnel involved in a research project are appropriately qualified and/or trained in the methods they will be performing for that project?
1. Identifying the most appropriate animal model to address the experimental question.
2. (1) the project is not duplicative (2) alternatives to the use of animals are not available (3) potentially painful procedures are justified.
3. (1) observe and describe a scientific phenomena (2) formulate a problem statement and hypothesis (3) use the hypothesis to predict outcomes for new observations (4) test the hypothesis
4. null hypothesis: "rats on diet A will gain the same amount of weight as rats on diet B" alternate hypothesis: "rats on diet A will gain more weight than rats on diet B, or vice versa" nontestable hypothesis: "rats on diet A will look better than rats on diet B" - ("better" is not defined in the problem statement, thus the hypothesis is not testable)
5. positive, negative, sham, vehicle, and comparative; a placebo is a sham control
6. (1) if the species used is covered under the Animal Welfare Act (regardless of funding source) (2) if the research is supported by NIH and involves the use of vertebrate species (3) if the animal care program is AAALAS accredited
7. the institution where the research will be performed is responsible for ensuring the training
Sample Size Determination. ILAR 43 (4): 207.
Scientists who use animals in research must justify the number of animals to be used. The committees that review these proposals that involve animal use in the research must review the justification to ensure the appropriateness of the number of animals to be used. This article is written for IACUC members, veterinarians and researchers who are asked to provide statistical calculations for the proposed number of animals to be used in their projects. The article discusses the statistical bases for estimating the number of animals (sample size) needed for several classes of hypotheses. Several types of experiments that an investigator might propose are described and the methods of computing sample size for situation where it is possible to do such a computation. Pilot and Exploratory experiments. It is not possible to compute a sample size for some types of experiments because prior information is lacking, or because the success of an experiment is highly variable (e.g. producing a new transgenic animal). Pilot experiments are designed to explore new research areas and to determine whether variables are measurable with sufficient precision to be studied under different experimental conditions. They can also be used to check the logistics of a proposed experiment. A pilot experiment can be performed to provide a rough idea of the standard deviation and the magnitude of an anticipated response. Such estimates can then be used to compute the sample size for further experiments. Exploratory experiments are performed to generate new hypothesis that can then be more formally tested. The usual aim of these experiments is to look for patterns of response, often using many dependent variables. Formal hypothesis testing and the generation of p values are relatively unimportant with this sort of experiment. Data collected from exploratory experiments can then be used in sample size calculations to compute the number of animals that will be needed to test hypothesis generated by the exploratory experiments. Experiments based on success or failure of a desired goal. It is difficult to estimate the number of animals required in experiments based on success or failure of a desired goal, because the chance of success of the experimental procedure has considerable variability. For example, the production of transgenic animals by gene insertion find variability in 1) the success of incorporation into the cell's genome, 2) the implantation of the transferred cell, 3) the random integration of the DNA into the genome, 4) the expression as a function of integration, 5) the mouse strain response to such manipulations, and 6) different gene's rate of incorporation into the genome. Large numbers of animals are typically needed for this type of experiment. However, knock-out and knock-in mice produced by homologous recombination results in much less variability in the results and fewer animals may have to be produced. The numbers of animals required for these types of experiments are usually estimated by experience instead of any formal statistical calculation. Experiments to test a formal hypothesis. Most animal experiments involve formal testing of hypothesis. It is possible to estimate animal numbers needed for these types of experiments if a few pieces of information are available. Generally there are three types of variables that a PI may measure: Dichotomous variable (rate of proportion of yes/no outcome, e.g. occurrence of a disease); Continuous variable (a continuous measure of a physiologic function like a concentration of a substance in body fluid, or blood pressure) and the time to occurrence of an event (e.g. appearance of a disease, clinical signs or death). Once a hypothesis has been reduced to one or two important questions, computation of sample size has a certain chance or probability of detecting (with statistical significance) an effect. Generally speaking, the smaller the size of the difference a PI wishes to detect, or the larger the population variability, the larger the sample size must be to detect significant differences. Assuming animals are to be randomly assigned to the various test groups and maintained in the same environment to avoid bias, then only three or four factors must be known or estimated to calculate sample size: 1) The effect size or the difference between 2 groups. The magnitude of effect the PI wishes to detect must be stated quantitatively and is unique to a particular experiment. 2) The population standard deviation - usually available via pilot study, data obtained from previous work, scientific literature. Again, unique to a particular experiment. 3) The desired power of the experiment to detect the hypothesized effect is usually and arbitrarily set at 0.8 or 0.9 (80-90% chance of finding statistical significance). Usually fixed by convention. 4) The significance level. The probability that a positive finding is due to chance alone is denoted as a and is usually chosen to be 0.05 or 0.01 (no more than 5 or 1%). Usually fixed by convention. The article then discusses, and walks through simple methods of estimating animal numbers needed for the various types of variables and experiments. In the authors opinion, investigator's tend to err on the side of using too few animals rather than too many, resulting in a study that has too little power to detect meaningful or biologically significant result. The general thrust of the article is that although analysis of the final set of data may involve sophisticated statistical models, sample size calculations can usually be performed using much simpler methods. The aim of the calculation is to estimate the number of animals needed for a study. Generally that number is rounded up to yield an adequate number of animals for the study.
1. When are pilot studies used?
2. When are exploratory studies used?
3. Why is it difficult to estimate numbers of animals needed in experiments based on success or failure of a desired goal?
4. What are the three types of variables that an investigator may measure?
5. What are the four factors that should be known to calculate sample size?
1. They can be used to determine whether variables are measurable with sufficient precision to be studied under different experimental conditions; to check the logistics of a proposed experiment; or to provide a rough idea of the standard deviation and the magnitude of an anticipated response. Such estimates can then be used to compute the sample size for further experiments.
2. They are used to look for patterns of response, often using many dependent variables. Data collected from exploratory experiments can then be used in sample size calculations to compute the number of animals that will be needed to test hypothesis generated by the exploratory experiments.
3. Because the chance of success of the experimental procedure has considerable variability, and must be based on experience rather than any formal statistical evaluation.
4. Dichotomous variable (rate of proportion of yes/no outcome, e.g. occurrence of a disease); Continuous variable (a continuous measure of a physiologic function like a concentration of a substance in body fluid, or blood pressure) and the time to occurrence of an event (e.g. appearance of a disease, clinical signs or death).
5. The effect size or the difference between 2 groups, the population standard deviation, the desired power of the experiment, and the significance level.
Role of Ancillary Variables in the Design, Analysis, and Interpretation of Animal Experiments. ILAR 43 (4): 214.
Take home message- Every experimental protocol should insure that uncontrolled or ancillary variables are recorded and linked with the individual animals. All observed or readily available variables should be recorded. Statistical significant conclusions are strengthen when there is no difference in the ancillary variables. Ancillary Variables (defined for this paper, may be defined differently elsewhere)- array of variable that may be collected in addition to the defined or primary experimental response variable but they may not be directly related to it. Recorded but not used in the design, nor incorporated in the formal analysis. Global variable- relevant for comparison of different experiments (sex/ strain/ food supply) Tendency to discard variables that are not explicitly incorporated into the experimental design, use of them may improve analysis and interpretation, assess randomization process, identify outliers, generate new hypotheses and increase generality of findings. Use of Ancillary variables in Experimental Design Completely randomized design- ancillary variables can help to identify reasons for significant variability between groups. Extreme example- control and treatment groups randomly assigned from group of 6 mixed sex mice. All males randomly end up in treatment group, all females in control. Covariate Analysis- Variables which are not controllable but may affect response of interest. Example- group of animals with a defined age of treatment are of different weight- could use experimental data to calculate the actual effect of initial weight on measured response. Repeated measure- even when a gain in weight over experimental period is easily summarized from daily body weights, the daily body weights may provide relevant data. Ancillary Variables in Exploratory Analysis All issues around procedure should be recorded (order of dosing, location of cage, how animals are identified , experimenter, weights, other noted observations). Incorporation of ancillary variable into analysis is neither appropriate nor desirable but are useful in exploring where the variability between groups may originate. Use of Ancillary variable to Detect Outliers Ancillary variable may provide a way of identifying some outliers with an explanation of their cause. Example- response in organ weight is of interest. One animal within a group has relatively low organ weight . Ancillary variable of daily body weights shows that the animal in question lost weight instead of gaining weight over experimental period. Animal can then be labeled an outlier and data discarded. Example- RBC and HGB have a standard relationship, so that one could be considered ancillary to the other. In on group of animals that relationship is different in 1 animal, therefore data form that one animal is considered an outlier. Use of Ancillary variable to assess the randomization process If randomization process has been properly carried out and no additional nonrandom factors then groups of animals should not differ significantly with respect to the ancillary variable that are not effected by treatment. Based on statistical theory a significant difference with a p of 0.05 for 5% of randomized assignments should be found, but that should still prompt investigation. Use of Ancillary variable to assess consistency of Observation Example- by identifying which staff makes observation, consistency, accuracy and precision of observational scoring systems can be analyzed. Use of Ancillary variable to explain differences between experiments Example- significant differences in results noted between laboratories using a standard protocol. All laboratories were using animals within established age range, but when actual weights and ages were compared differences where noted. Every experimental protocol should insure that uncontrolled or ancillary variables are recorded and linked with the individual animals. All observed or readily available variables should be recorded. Statistical significant conclusions are strengthen when there is no difference in the ancillary variables.
Use of Factorial Designs to Optimize Animal Experiments and Reduce Animal Use. ILAR 43 (4): 223.
Some basic definitions- Factorial Experimental Designs (FED)- more than one independent variable is varied at a time, in a structured way Treatment effect- difference between means of control and treatment Factors- variable investigated for their influence on treatment effect could be direct animal related characteristics (sex, strain, age), environmental aspects (cage size), protocol specific (dose level timing , route) One Variable at a time (OVAT)- vary one factor of interest in turn, keeping all other factors fixed. The authors explore ways that studies can be designed using FEDs to determine which factors and what level of these factors will maximize the difference between treated and control groups, so as to increase sensitivity or reduce animal numbers for theses experiments. The simplest FED is a 2 x 2 design (example-Vehicle or Treated, male or female) Male Female Vehicle a c Treated b d Estimate for treatment effect is ((a+c)- (b+d))/2, estimate for sex effect is ((a+b)- (c+d))/2 If response to the treatment of the two sexes is different there is said to be an interaction between the two factors, estimated as ((a-c)-(b-d))/2, commonly seen in biomedical sciences as synergism, enhancement or potentiation. Above principles are easily expanded to more factors (4 dose groups among 2 sexes 4 x 2) FEDs advantages over OVATs- each animal contribute to understanding effects of all factors, explore the ways factors interact, more efficiently use of resources, avoids incremental changes to multiple studies. Full factorial designs are appropriate for studies where a small number of factors are considered at 2 levels (example- a 4 x2 study would have 8 groups to be studied), with larger groups (7 x 2, 128 groups) fractional factorial designs become more appropriate (using only some fraction of all the possible groups). Fractional FED provide a balanced subset of the total possible groups while maximizing the information on factors explored. Authors recommend use of expert statistical advice when considering the design. Any interaction between a factor in the experiment and the treatment may imply a scope for improving the sensitivity of the experiment. Example- Comparison of CD-1 and CBA mice in their WBC response to chloramphenicol- strain and treatment effect have an interaction, with the CBA mice showing a greater effect, so if the aim is to screen compounds that reduce WBC the same way chloramphenicol does, the use of CBA mice would allow smaller sample sizes or greater sensitivity to detect a change. Implementation of FED - Define objectives, chart the procedure, brainstorm all factors that could cause variation n the response of interest, prioritize the factors for inclusion in the first FED, determine number of animals to be used (factors x level x number per group), consider practicality. - Ensure all relevant data, observations and deviation are recorded. - Analyze the data and pay attention to the robustness (extent to which results are influenced by minor variations that can not be adequately controlled) FEDs can allow refinement of studies so that sample size can be reduced for equivalent quality of information. FEDs are best used on robust projects as the factor analyses becomes inaccurate if the relatively small groups are influenced by uncontrolled variation, additionally within group variably can have a large impact on sample size estimate and FEDs can not easily study these as sample size is small.
Alternative Methods for the Median Lethal Dose (LD50) Test: The Up and Down Procedure for Acute Oral Toxicity. ILAR 43 (4): 233.
Bottomline: This article describes an improved up and down procedure for acute oral toxicity testing. It improves performance of acute testing for applications that use the classic LD50 while achieving significant reductions in animal use. It combines sequential dosing with sophisticated computer assisted computational methods during execution and calculation of tests. It has a drawback that it doesnt provide information about the dose response curve. The up and down procedure is a staircase design which is a form of sequential testing. The design permits trials to converge rapidly on the region of interest (ie. LD50). Acute toxicity testing measures the adverse effects that occur within a short time of administration after a single dose of a chemical. The results of such test serve as a basis for hazard classification and labeling of chemicals. It can provide information for comparison of toxicity and dose-response among chemicals. Acute toxicity tests were originally designed to provide robust characterization of the dose response curve by using several animals at each of three to five doses to measure tolerance to potentially lethal doses of chemicals in a test population. Animal were observed for 14 days with observation of the onset, nature, severity, and reversibility of toxicity. Ideally, the test would include doses close to the LD13 and LD87 to obtain the best information. Traditionally these tests would use 50 or more animals per chemical. In 1985, Bruce was the first to apply this test to acute oral toxicity. It had previously been used for munitions testing in 1948. Essentially the UDP test animals one at a time in a staircase fashion. The next dose level depends on the results of the previous animal. Tests show that poor choice of starting dose could lead to inefficiency. In 1999, the OECD (Organization for economic co-operation and development) called for a revision. The revised test was evaluated in 2001 and has improved prediction of the point estimate of lethality and confidence intervals for chemicals with wide variability of response even when approximate LD50 and dose response slope are not known. The new test has a flexible stopping rule based on an index related to the statistical error. 1) three consecutive animals survive at the upper bound of dosing where the upper bound of testing is the highest dose given to the animals 2) five reversals occur in any six consecutive animals 3) at lest four animals have followed the first reversal and two likelihood ratios which compare the LD50 estimate with the LD50 values above and below exceed a critical value of 2.5. In the absence of initial information indicating likely slope or LD50, the UDP guideline recommends a default starting dose of 175 mg/kg and the use of half log units of dose corresponding to a progression of 3.2. In addition, a software program has been developed by the EPA to assist the user in setting test doses, to determine when the stopping rules have been satisfied and to calculate the LD50 and the confidence interval. Confidence intervals provide information on the repeatability of the estimate. A class of methods called profile likelihood may be used to calculate confidence intervals consistent with experimental data. Traditional LD50 tests use calculated confidence intervals. However, in the UDP the algorithm used to compute confidence interval is not exact but approximate. Depending on the results of the test, three different types of confidence interval estimates of the true LD50 can be used. Hazard classification criteria for labeling of chemicals with agreed upon oral LD50 use cut points of 5, 50,300,2000 and 5000 mg/kg. Default UDP guidelines start with 175 mg/kg with a dose progression of 3.2x dose. Using this system, if an agent is misclassified, it will most often be assigned a more toxic category. The limit test is performed to determine whether the LD50 is above or below a particular level. The new test has been designed to classify either 2000 or 5000 mg/kg and uses only 5 animals. With this system the classification deteriorates when the LD50 approaches the limit dose.
1) What does OECD stand for?
2) What is the Up and Down procedure designed to test?
1) Organization for economic co-operation and development
2) Designed to estimate LD50 with far fewer animals than the original LD50 test procedure
Guidelines for the Design and Statistical Analysis of Experiments Using Laboratory Animals. ILAR 43 (4): 244.
For economic but even more ethical reasons, researchers utilizing laboratory animals are obligated to design experiments well, to execute those efficiently, to analyze results precisely, to present outcomes clearly, and to interpret results correctly. Surveys of published papers reveal that many fall short of this ideal. The authors' aim is to help investigators to perform research efficiently and humanely. In their opinion, animals should only be used if the scientific objectives are valid, there are no other alternatives, and the cost to the animals is not excessive whereby the "3Rs" of Russell and Burch (1959) play a key role. Pertaining to "reduction", animal numbers should be a minimum to be able to achieve the scientific objectives of the study recognizing that important biological effects may be missed if too few animals are used.
It is important to describe research in a way that it is repeatable elsewhere and the article should state a) the objectives of the research and/or the hypotheses to be tested;
b) the reason for choosing the particular animal model;
c) the species, strain, source, and type of animal used;
d) the details of each separate experiment being reported, including the study design and
the number of animals;
e) the statistical methods used for analysis.
Experiments, Surveys, and Animal Models:
A confirmatory experiment is a procedure for collecting scientific data on the response to an intervention in a systematic way to maximize the chance of answering a question correctly. It normally involves formal testing of one or more pre-specified hypotheses.
An exploratory experiment usually provides material for the generation of new hypotheses. It involves looking for patterns in the data with less or no emphasis on formal testing of hypotheses.
There is frequently some overlap between the two types of experiments. Experiments include experimental procedures that are under the control of the experimenter.
A survey, in contrast, is an observational study used to find associations between variables that the scientist cannot usually control. Any association may or may not be due to a causal relation.
Experiments should be well planned and include the statistical methods used to assess the results.
Pilot studies can be used to test the logistics of a proposed experiment, provide estimates of the means and standard deviations, and help with the power analysis.
Laboratory animals are nearly always used as models of humans or other species. A model is meant to be a mimic or surrogate and not necessarily identical to the subject (target) being modeled. It must have certain characteristics that resemble the target, but can be very different in other ways. The validity of an animal model as a predictor of human response depends on how closely the model resembles humans for the specific characters being investigated. The validity of any model must be considered on a case-by-case basis.
Models should be sensitive to the experimental treatments by responding well, with minimal variation among subjects treated alike. There is a need to control variation. Uncontrolled variation, whether caused by infection, genetics, environmental or age heterogeneity, reduces the power of an experiment.
The most common formal experimental designs are completely randomized, randomized block, and factorial designs. Latin square, crossover, repeated measures, split-plot, incomplete block, and sequential designs are also used.
Animals are randomly assigned to treatments in completely randomized design. Advantages are simplicity and tolerance of unequal numbers in each group, whereas it cannot take account of heterogeneity of experimental material or variation.
In randomized compete block designs the experiment is split into a number of "mini-experiments". Experiments with this design are often more powerful, precise, and take account of some natural structure of the experimental material but their benefits depend on correct analysis, using usually a two-way ANOVA without interaction.
Factorial experiments have more than one type of treatment or independent variable. These designs are extremely powerful at the cost of increased complexity in the statistical analysis.
Depending on the objectives of the study, a well-designed experiment avoids bias and is sufficiently powerful. To prevent mistakes in its execution it should not be too complicated.
Each experiment involves a number of experimental units, which should also be the unit of statistical analysis. Individual animals, animals in a cage subjected to the same treatment, or an animal for a certain period of time assigned to treatments X, Y, and Z sequentially may be the experimental unit. Split-plot experimental designs have more that one type of experimental unit, e.g. two animals in a cage receiving a definite diet are one experimental unit and at the same time each mouse is an experimental unit by itself since each one will have a different vitamin injected.
Randomization means that assignment to the treatment occurs in a way that each experimental unit has a known, often equal, probability of receiving it. The only way to achieve randomization is to use an objective procedure, such as a table of random numbers, or drawing of numbers out of a bowl. Statistical packages for computers are available, which will produce random numbers within a specified range. Randomization is essential to avoid bias.
Another valid measure in the experimental design to avoid bias is blinding where animals, samples, and treatments are coded. Hence, the person who assesses the results does not know the allocation of the animals. It is especially necessary if there is any subjective element in evaluating the results.
Confirmatory experiments normally have one or a few outcomes of interest, also known as dependent variables. Exploratory experiments often involve many outcomes. Quantitative data (e.g. measurements) are better than qualitative data (e.g. counts). In some studies, scores such as 0, +, ++, +++ are used. Converting scores to numerical values with means and standard deviations is inappropriate.
Experiments usually include controls. Negative controls may be untreated animals or those treated with a placebo. Surgical studies may involve sham-operated controls. Positive controls are sometimes used to ensure that the experimental protocols were actually capable of detecting an effect.
Beside treatment variables there may be a number of random variables that are uncontrollable yet may need to be taken into account in designing an experiment and analyzing the results, e.g. circadian rhythms, different people making measurements etc.
An experiment that is too small may miss biologically important effects, whereas an experiment that is too large wastes animals.
A power analysis is the most common way of determining sample size. The appropriate size depends on a mathematical relation between (1) effect sizes of interest, (2) standard deviation,
(3) chosen significance level, (4) chosen power, (5) alternative hypothesis, (6) sample size.
The formulae are complex. However, several statistical packages are available. A number of web sites also provide free power analysis, e.g. http://ebook.stat.ucla.edu/cgi-bin/engine.cgi.
When only two groups are to be compared, the effect size is the difference in means or proportions expressed in "D" a unitless number. It is often convenient to communicate the effect size "D" in units of standard deviations by dividing through by the standard deviation.
The standard deviation will usually be the square root of the error mean square from an analysis of variance conducted on a previous experiment. When no previous study is available, a pilot study may be used.
Significance level is the chance of obtaining a false-positive result due to sampling error also known as Type I error. It is usually set at 5%.
The power of an experiment is the chance that it will detect the specified effect size for the given significance level and standard deviation and be considered statistically significant. It usually ranges from 80 to 95%. Note that 1-power is the chance of a false-negative result, also known as Type II error.
The alternative hypothesis is usually that two means or proportions differ.
The sample size is usually what needs to be determined, so all of the other quantities mentioned above should be specified. However, the sample size can also be fixed and the aim is to determine the power of effect size, given sample size.
The general aim is to extract all of the useful information present in the data in a way that it can be interpreted, taking account of biological variability and measurement error. It is particularly useful in preventing unjustified claims. But it is also possible for an effect to be statistically significant but of little or no biological importance.
Raw data and data entered into statistical software should be studied for consistency and any obvious transcription errors. "Outliers" should not be discarded unless there is independent evidence that the observation is incorrect. Exclusion of any observations should be stated explicitly. A clear distinction must be made between missing data (e.g. an animal dying prematurely) and data with a value of zero. In all cases, the number and treatment groups of any animals that die should be noted in the published paper or report.
Quantitative data are often summarized in terms of the mean, "n" (the number of subjects), and the standard deviation as a measure of variation. The median, n, and the interquartile ranges (e.g. the 25th and 75th centiles) may be preferable for data that are clearly skewed. Quantitative data can be analyzed using "parametric" methods, such as the t-test for one or two groups or the ANOVA for several groups, or using nonparametric methods such as the Mann-Whitney test. Parametric tests are usually more versatile and powerful.
Where several observations can be made on an experimental unit, it may be important to find out whether precision could be increased more effectively by using more experimental units or more observations within each unit. Those observations are said to be "nested" within the experimental units, and several levels of nesting are possible. A nested ANOVA is usually used to estimate the "components of variance" associated with each level of nesting.
Transformation can be used to normalize data that are skewed, or that are otherwise abnormal in distribution. Depending on the data scale transformation, logarithmic transformation, logit transformation, or square root transformation may be employed.
Student's t-test is used to compare two group means. When there are two or more groups, and particularly with a more complex design of the experiment, the ANOVA can be used initially. When the ANOVA results are significant either post-hoc comparisons or orthogonal contrasts should be employed to study differences among individual means. Post-hoc comparison methods are Dunett's test for comparing each mean with the control, Tukey's test, Fisher's protected least-significant difference test, Newman-Keuls test, and several others for comparing all means.
When there are several dependent variables, each can be analyzed separately. However, if the variables are correlated, the analyses will not be independent of one another. A multivariate statistical analysis such as principal components analysis could be considered in such cases.
Data on experimental subjects are sometimes collected serially. Appropriate summary measures such as the mean of the observations, the slope of a regression line fitted to each individual, the time to reach a peak or the area under the curve, depending on the type of observed response, offer a good alternative to a repeated measures ANOVA. Latter is better avoided since results would be difficult to interpret.
If there is no equal variation in each treatment group and no transformation is available to correct this non-normality a nonparametric test can be used to compare the equality of population means or medians. To compare two groups, the Wilcoxon rank sum test and the Mann-Whitney test and for comparison of more than two groups the Friedman test are available.
The correlation coefficient, also known as product-moment correlation or pearson correlation, is used to assess the strength of the linear relation between two numerical variables A and B. The usual hypothesis test is that the correlation is zero. The use of the correlation of ranks may be more appropriate when a nonlinear relation is given to avoid that it will result in a low correlation although the two variables are strongly associated.
Regression analysis can be used to quantify the relation between two continuous variables X and Y, where variation in X is presumed to cause variation in Y. The usual statistical test in regression analysis is of the null hypothesis that there is no linear relation between X and Y.
Categorical data consist of counts of the number of units with given attributes. When these attributes have no natural order (e.g. strain or breed of animals) they are described as "nominal". They are called "ordinal" when they have a natural order such as low, medium, and high levels or scores. When there are two categories, the data are called "binary". When categorical data are presented as proportions or percentages they should be accompanied by a confidence interval or standard error.
Presentation of the Results
Valuable hints for the presentation of results are given, e.g. that lack of statistical significance should not be used to claim that an effect does not exist since other influences such as sample size may be responsible. There are important comments regarding the graphical presentation of data provided, which should be taken in consideration when publishing experimental results.
1. What should be stated in an article that would allow other researchers to repeat the experiment?
a) hypothesis and justification for choosing the particular animal model
b) species, strain, source, type, and number of animals used
c) details of each separate experiment, including study design
d) statistical methods utilized for analysis
e) a) and c)
f) all of the above
2. Match the following terms and definitions!
a) a procedure for collecting scientific data on the response to an intervention in a systematic way to maximize the chance of answering a question correctly
b) a procedure to test the logistics of a proposed experiment, provide estimates of the means and standard deviations, and help with the power analysis
c) an observational study used to find associations between variables that the scientist cannot usually control
d) a procedure that provides material for the generation of new hypotheses, involving looking for patterns in the data with less or no emphasis on formal testing.
i) exploratory experiment
ii) pilot study
iv) confirmatory experiment
3. What could cause uncontrolled variation in an animal model, which reduces the power of the experiment?
4. All of the listed below could be considered an experimental unit, except:
a) individual animals, single housed
b) individual animals, group housed in the same cage and subjected to the same treatment
c) a group of two animals, housed in the same cage and subjected to the same treatment
d) a group of three or more animals, housed in the same cage and subjected to the same treatment.
5. What does Type I error mean?
a) the chance of obtaining a false-positive result due to sampling error
b) the chance of obtaining a false-negative result
c) the difference in means or proportions expressed in "D", a unitless number
d) the square root of the error mean square from an analysis of variance.
6. Which statistical test is the most appropriate to use in the analysis of the type of data described below? Note that more than one test could match with the same situation.
a) Student's t-test
b) ANOVA (analysis of variance)
c) Wilcoxon rank sum test
d) Friedman test
e) Mann-Whitney test
i) two groups have to be compared with unequal variation in each treatment group and no approximate normality of the residuals, no transformation is available to correct the heterogeneity of variance and/or non-normality
ii) two or more data sets with equal variation in each group have to be compared, particularly useful with randomized block or more complex study design
iii) two or more groups have to be compared where the data sets of each group have no equal variation and no approximate normality of the residuals, furthermore no transformation is available to correct the heterogeneity of variance and/or non-normality
iv) a maximum of two data sets with equal variation in each treatment group and approximate normality of the residuals have to be compared.
2. a) and iv
b) and ii
c) and iii
d) and i
3. Uncontrolled variation could be caused by infection, genetics, environmental or age
4. b) because animals within the same cage cannot be assigned to different treatment groups, so that they are not statistically independent. Hence, they have to be considered one experimental unit.
5. a) It is also called the significance level. It is usually set at 5%.
6. a) and iv
b) and ii
c) and i
d) and iii
e) and i