In 1925, Ronald Fisher mentions the two-way ANOVA in his celebrated book, Statistical Methods for Research Workers. In 1934, Frank Yates published procedures for the unbalanced case. Since then, an extensive literature has been produced. The topic was reviewed in 1993 by Yasunori Fujikoshi. In 2005, Andrew Gelman proposed a different approach of ANOVA, viewed as a multilevel model.
Data set
Let us imagine a data set for which a dependent variable may be influenced by two factors which are potential sources of variation. The first factor has levels and the second has levels. Each combination defines a treatment, for a total of treatments. We represent the number of replicates for treatment by, and let be the index of the replicate in this treatment. From these data, we can build a contingency table, where and, and the total number of replicates is equal to. The experimental design is balanced if each treatment has the same number of replicates,. In such a case, the design is also said to be orthogonal, allowing to fully distinguish the effects of both factors. We hence can write, and.
Model
Upon observing variation among all data points, for instance via a histogram, "probability may be used to describe such variation". Let us hence denote by the random variable which observed value is the -th measure for treatment. The two-way ANOVA models all these variables as varying independently and normally around a mean,, with a constant variance, : Specifically, the mean of the response variable is modeled as a linear combination of the explanatory variables: where is the grand mean, is the additive main effect of level from the first factor, is the additive main effect of level from the second factor and is the non-additive interaction effect of treatment from both factors. Another equivalent way of describing the two-way ANOVA is by mentioning that, besides the variation explained by the factors, there remains some statistical noise. This amount of unexplained variation is handled via the introduction of one random variable per data point,, called error. These random variables are seen as deviations from the means, and are assumed to be independent and normally distributed:
Assumptions
Following Gelman and Hill, the assumptions of the ANOVA, and more generally the general linear model, are, in decreasing order of importance: