Library of Methods and Models > Descriptive Statistics and Summary Statistics > Analysis of Variance

Analysis of Variance

Analysis of variance (ANOVA) is used to examine how one or several qualitative variables (factors) affect one dependent quantitative variable.

Single-Factor Analysis of Variance

Single-factor analysis of variance tests the hypothesis on equality of the means of several general populations. For example, if it is required to find out whether the input dependent variable x affects the output dependent variable y. In this example the input variable x has discrete values, while the output variable y is a continuous random value with the probabilistic nature caused by the presence of the additive noise e.

Single-factor analysis of variance is based on the following assumptions:

In each observation, ei has a normal distribution with the zero mean and finite variance.
For each i the variance ei is a constant.

Consider a procedure of calculating single-factor analysis of variance. Let x take k various values, or, in other words, the factor x has k levels. Let there be n observations of the output value y at each level. Then the results can be shown as a table where the columns are levels of the factor x, and the rows are observations of y):

№ of observation	Levels of the input factor x
	1	2	…	j	…
1	y₁₁	y₁₂	…	y_1j	…
2	y₂₁	y₂₂	…	y_2j	…
…	…	…	…	…	…
i	y_i1	y_i2	…	y_ij	…
…	…	…	…	…	…
n	y_n1	y_n2	…	y_nj	…

If levels of the x factor do not affect the mean for y, all the observations are a sample of the same general population (provided that the conditions listed above are satisfied). Then the variance of the general population can be estimated in two independent ways: using average values y for each of the levels x, or as arithmetic mean of the estimates of variances y for each of the levels x. The first estimate is known as the estimate of variance for the levels S²_Lv, the second one estimates error variance S²_Err.

Where:

y_.j. Mean of the j-th level.
y_... Common mean.

If the levels of the x factor do not affect the mean, the ratio F = S²_Lv/S²_Err follows the Fisher distribution rule. Characteristics of this distribution depend on the number of degrees of freedom for the estimates S²_Lv and S²_Err (number of degrees of freedom for the numerator ν₁=(k-1) and the denominator ν₂=k*(n-1)). For each specified significance level α there is always such a critical value F_crit, that can be exceeded by F (if the levels x do not affect it) with the probability not greater than α. This means that if after the data calculation the calculated value of F statistics exceeds corresponding F_crit, the data contradicts the hypothesis on the equality of means y for all levels of x. If F<F_crit, the data does not contradict the hypothesis, and the levels of x are assumed to not affect the mean for y.

Two-Factor Analysis of Variance

Two-factor analysis of variance tests the hypothesis on the equality of the means of the controlled output parameter y with various levels of the two factors.

In this model x₁ and x₂ input variables have distinct states and y output variable is continuous random value, the likelihood nature of which is based on the existence ofe additive noise.

Two-factor analysis of variance is based on the following assumptions:

In each observation, ei has a normal distribution with the zero mean and finite variance.
For each i the variance ei is a constant.

Consider a procedure of two-factor analysis of variance. Suppose x₁ takes k different values or x₁ factor has k levels, x₂ takes m different values of x₂ factor has m levels. Let there be n observations of the output variable y at each of the level combinations. Then the results can be displayed as the following table:

Levels of the input factor x₂	Levels of the input factor x₁
	1	2	…	j	…
1	y₁₁₁ … y_11n	y₁₂₁ … y_12n	…	y_1j1 … y_1jn	…
2	y₂₁₁ _…y_21n	y₂₂₁ _… y_22n	…	y_2j1 _…y_2jn	…
…	…	…	…	…	…
i	y_i11 _… y_i1n	y_i21 _…y_i2n	…	y_ij1 _… y_ijn	…
…	…	…	…	…	…
m	y_m11 _… y_m1n	y_m21 _…y_m2n	…	y_mj1 _… y_mjn	…

If the levels of factors x₁ and x₂ do not affect the y mean, all observations are the sample from the same general population (provided that the assumptions listed above are satisfied). Then the variance of the general population can be estimated in the following independent ways: using average values y for each of the factor levelsx₁ or x₂, or as the arithmetic mean of the estimates of variances x for each of the levels x₁ or x₂. As in single-factor analysis of variance, the first estimate is named the estimate of variance of the levels S²_Lv, the second one is named the estimate of the error variance S²_Err.

For the first and the second factor:

Where:

y_.j.. Mean by j-th level of the first factor.
y_i... Mean by i-th level of the second factor.
y_.... Common mean.

The estimate of error variance is calculated as follows:

Where:

y_ij.. The mean value y at j-th level of the first factor and i-th level of the second factor.

Two factors enable the use of one more variance estimate, that is, the interaction:

If there is no influence of levels of factors x₁ and x₂ on the mean, relations F₁=S²_Lv1/S²_Err, F₂=S²_Lv2/S²_Errand F_Int=S²_Int/S²_Err follow the rule of Fisher distribution. Characteristics of this distribution depend on the number of degrees of freedom of the estimates S²_Lv1, S²_Lv2, S²_Int and S²_Err (number of degrees of freedom for the numerator ν₁=(k-1), ν₂=(m-1), ν_Int=(m-1)*(k-1) and the denominator ν_Err=m*k*(n-1) ). For any specified significance level α there is also a critical value F_crit, which can be exceeded by F with no influence of factor levels x₁, x₂ and their interaction x₁*x₂ with likelihood not greater than α. This means that if after the data calculation the calculated value of F statistic exceeds the corresponding F_crit, the data contradicts the hypothesis on the equal y means for all factor levels x₁, x₂ and their mutual effectx₁*x₂. If F<F_crit, then the data does not contradict this hypothesis and it is considered that levels do not influence the y mean.