To come in
Speech therapy portal
  • Spelling grid for elementary school course Spelling 1 examples
  • VLOOKUP in physics: we analyze tasks with a teacher Reshu exam vpr physics 11
  • VLOOKUP world around methodical development on the outside world (grade 4) on the topic VLOOKUP surround the world 4kl tasks lessons
  • Particles: Examples, Functions, Fundamentals, Spelling
  • Tsybulko oge Russian language 36 buy
  • Oge Russian language Tsybulko
  • Multivariate analysis of variance. Analysis of variance. Multivariate Designs: Multivariate ANOVA and Covariance Analysis

    Multivariate analysis of variance.  Analysis of variance.  Multivariate Designs: Multivariate ANOVA and Covariance Analysis

    ANOVA

    Course work by discipline: "System Analysis"

    Artist student gr. 99 ISE-2 Zhbanov V.V.

    Orenburg State University

    Faculty information technologies

    Department of Applied Informatics

    Orenburg-2003

    Introduction

    Purpose of work: to get acquainted with such a statistical method as analysis of variance.

    Analysis of variance (from the Latin Dispersio - dispersion) is a statistical method that allows you to analyze the influence of various factors on the variable under study. The method was developed by the biologist R. Fischer in 1925 and was used initially to evaluate experiments in crop production. Later, the general scientific significance of the analysis of variance for experiments in psychology, pedagogy, medicine, etc.

    The purpose of analysis of variance is to test the significance of the difference between means by comparing variances. The variance of the measured trait is decomposed into independent terms, each of which characterizes the influence of a particular factor or their interaction. Subsequent comparison of such terms makes it possible to assess the significance of each studied factor, as well as their combination / 1 /.

    If the null hypothesis is true (about the equality of means in several groups of observations selected from the general population), the estimate of the variance associated with within-group variability should be close to the estimate of the between-group variance.

    When conducting market research, the question of comparability of results often arises. For example, when conducting surveys about the consumption of a product in different regions of the country, it is necessary to draw conclusions about how much the survey data differ or do not differ from each other. It makes no sense to compare individual indicators, and therefore the procedure for comparison and subsequent assessment is carried out according to some averaged values ​​and deviations from this averaged estimate. The variation of the trait is studied. The variance can be taken as a measure of variation. The variance σ 2 is a measure of variation, defined as the average of the squared deviations of a feature.

    In practice, problems of a more general nature often arise - the problem of checking the significance of differences in the means of samples of several populations. For example, it is required to assess the impact of various raw materials on the quality of products, to solve the problem of the effect of the amount of fertilizers on the yield of agricultural products.

    Sometimes analysis of variance is used to establish the homogeneity of several populations (the variances of these populations are the same by assumption; if analysis of variance shows that the mathematical expectations are the same, then in this sense the populations are homogeneous). Homogeneous aggregates can be combined into one and thereby obtain more complete information about it, therefore, more reliable conclusions / 2 /.

    1 Analysis of variance

    1.1 Basic concepts of ANOVA

    In the process of observing the object under study, the qualitative factors change arbitrarily or in a given way. The specific implementation of the factor (for example, a certain temperature regime, the chosen equipment or material) is called the factor level or the processing method. The ANOVA model with fixed levels of factors is called model I, and the model with random factors is called model II. By varying the factor, you can investigate its influence on the magnitude of the response. Currently, the general theory of analysis of variance has been developed for models I.

    Depending on the number of factors that determine the variation of the effective trait, analysis of variance is subdivided into univariate and multivariate.

    The main schemes for organizing the source data with two or more factors are:

    Cross-classification, typical for models I, in which each level of one factor is combined when planning an experiment with each gradation of another factor;

    Hierarchical (nested) classification, characteristic of model II, in which each random, randomly chosen value of one factor corresponds to its own subset of the values ​​of the second factor.

    If the dependence of the response on qualitative and quantitative factors is simultaneously investigated, i.e. factors of mixed nature, then the analysis of covariance is used / 3 /.

    Thus, these models differ from each other in the method of choosing the factor levels, which, obviously, primarily affects the possibility of generalizing the obtained experimental results. For analysis of variance in one-way experiments, the difference between these two models is not so significant, but in multivariate analysis of variance it can turn out to be very important.

    When conducting analysis of variance, the following statistical assumptions must be fulfilled: regardless of the level of the factor, the response values ​​have a normal (Gaussian) distribution law and the same variance. This equality of variances is called homogeneity. Thus, a change in the processing method affects only the position of the response random variable, which is characterized by the mean or median. Therefore, all observations of the response belong to the shift family of normal distributions.

    The ANOVA technique is said to be "robust". The term used by statisticians means that these assumptions can be violated to some extent, but nevertheless, the technique can be used.

    When the law of distribution of the response values ​​is unknown, nonparametric (most often rank) methods of analysis are used.

    Analysis of variance is based on dividing variance into parts or components. The variation caused by the influence of the factor underlying the grouping is characterized by the intergroup variance σ 2. It is a measure of the variation of individual means for groups around the total average and is determined by the formula:

    ,

    where k is the number of groups;

    n j is the number of units in the j-th group;

    Private average for the j-th group;

    The overall average for the aggregate of units.

    The variation due to the influence of other factors is characterized in each group by the intragroup variance σ j 2.

    .

    There is a relationship between the total variance σ 0 2, the intra-group variance σ 2 and the inter-group variance:

    σ 0 2 = + σ 2.

    Intra-group variance explains the influence of factors unaccounted for when grouping, and inter-group variance explains the influence of grouping factors on the group mean / 2 /.

    1.2 One-way ANOVA

    The one-factor dispersion model has the form:

    x ij = μ + F j + ε ij, (1)

    where x ij is the value of the studied variable obtained on i-th level factor (i = 1,2, ..., m) with the jth serial number (j = 1,2, ..., n);

    F i - the effect due to the influence of the i-th level of the factor;

    ε ij is a random component, or a disturbance caused by the influence of uncontrollable factors, i.e. variation of the change within a particular level.

    The main prerequisites for analysis of variance:

    The mathematical expectation of the disturbance ε ij is equal to zero for any i, i.e.

    M (ε ij) = 0; (2)

    The perturbations ε ij are mutually independent;

    The variance of the variable x ij (or the perturbation ε ij) is constant for

    any i, j, i.e.

    D (ε ij) = σ 2; (3)

    The variable x ij (or the perturbation ε ij) has normal law

    distribution N (0; σ 2).

    The influence of factor levels can be either fixed or systematic (model I) or random (model II).

    Suppose, for example, it is necessary to find out whether there are significant differences between batches of products in terms of some quality indicator, i.e. check the influence on the quality of one factor - a batch of products. If all batches of raw materials are included in the study, then the influence of the level of such a factor is systematic (model I), and the conclusions obtained are applicable only to those individual batches that were involved in the study. If we include only a randomly selected part of the parties, then the influence of the factor is random (model II). In multifactorial complexes, a mixed model III is possible, in which some factors have random levels, while others are fixed.

    Let there be m batches of products. From each batch, n 1, n 2, ..., n m products were selected, respectively (for simplicity, it is assumed that n 1 = n 2 = ... = n m = n). The values ​​of the quality indicator of these products are presented in the observation matrix:

    x 11 x 12 ... x 1n

    x 21 x 22 ... x 2n

    ………………… = (x ij), (i = 1,2,…, m; j = 1,2,…, n).

    x m 1 x m 2 ... x mn

    It is necessary to check the materiality of the influence of batches of products on their quality.

    If we assume that the elements of the rows of the observation matrix are numerical values random variables X 1, X 2, ..., X m, expressing the quality of products and having a normal distribution law with mathematical expectations, respectively a 1, a 2, ..., a m and the same variances σ 2, then this task is reduced to checking the zero hypothesis H 0: a 1 = a 2 = ... = a m, carried out in the analysis of variance.

    Averaging over any index is indicated by an asterisk (or a dot) instead of an index, then the average quality index of products i-th party, or the group average for the i-th level of the factor, will take the form:

    where i * is the average across the columns;

    Ij is an element of the observation matrix;

    n is the sample size.

    And the overall average:

    . (5)

    The sum of the squares of the deviations of observations x ij from the total mean ** looks like this:

    2 = 2 + 2 +

    2 2 . (6)

    Q = Q 1 + Q 2 + Q 3.

    The last term is equal to zero

    since the sum of the deviations of the values ​​of the variable from its mean is equal to zero, i.e.

    2 =0.

    The first term can be written as:

    The result is the identity:

    Q = Q 1 + Q 2, (8)

    where - total, or full, sum of squares of deviations;

    - the sum of the squares of the deviations of the group means from the total average, or the intergroup (factorial) sum of the squares of the deviations;

    - the sum of the squares of the deviations of observations from the group means, or the intragroup (residual) sum of the squares of the deviations.

    Expansion (8) contains the main idea of ​​the analysis of variance. As applied to the problem under consideration, equality (8) shows that the total variation of the quality indicator, measured by the sum of Q, consists of two components - Q 1 and Q 2, which characterize the variability of this indicator between batches (Q 1) and the variability within batches (Q 2), characterizing the same variation for all parties under the influence of unaccounted factors.

    Analysis of variance does not analyze the sums of squares of deviations themselves, but the so-called mean squares, which are unbiased estimates of the corresponding variances, which are obtained by dividing the sums of squares of deviations by the corresponding number of degrees of freedom.

    The number of degrees of freedom is defined as the total number of observations minus the number of equations connecting them. Therefore, for the mean square s 1 2, which is an unbiased estimate of the intergroup variance, the number of degrees of freedom k 1 = m-1, since when calculating it, m group means are used, linked by one equation (5). And for the mean square s22, which is an unbiased estimate of the intragroup variance, the number of degrees of freedom is k2 = mn-m, since when calculating it, all mn observations connected with each other by m equations (4) are used.

    Thus:

    If we find the mathematical expectations of the mean squares and substitute the expression xij (1) in their formulas through the model parameters, we get:

    (9)

    since taking into account the properties of the mathematical expectation

    a

    (10)

    For model I with fixed levels of factor F i (i = 1,2, ..., m) - the values ​​are not random, therefore

    M (S) = 2 / (m-1) + σ 2.

    The hypothesis H 0 will take the form F i = F * (i = 1,2, ..., m), i.e. the influence of all levels of the factor is the same. If this hypothesis is true

    M (S) = M (S) = σ 2.

    For a random model II, the term F i in expression (1) is a random quantity. Denoting it by the variance

    we obtain from (9)

    (11)

    and, as in model I

    Table 1.1 presents general form calculating values ​​using analysis of variance.

    Table 1.1 - Basic table of ANOVA

    Dispersion components

    Sum of squares

    Number of degrees of freedom

    Medium square

    Mean square expectation

    Intergroup

    Intragroup

    The hypothesis H 0 will take the form σ F 2 = 0. If this hypothesis is true

    M (S) = M (S) = σ 2.

    In the case of a one-factor complex for both model I and model II, the mean squares S 2 and S 2 are unbiased and independent estimates of the same variance σ 2.

    Consequently, testing the null hypothesis H 0 was reduced to testing the significance of the difference between the unbiased sample estimates S and S of the variance σ 2.

    The hypothesis H 0 is rejected if the actually calculated value of the statistics F = S / S is greater than the critical F α: K 1: K 2 determined at the significance level α with the number of degrees of freedom k 1 = m-1 and k 2 = mn-m, and is accepted if F< F α: K 1: K 2 .

    Fischer's F-distribution (for x> 0) has the following density function (for = 1, 2, ...; = 1, 2, ...):

    where are the degrees of freedom;

    Г - gamma function.

    With regard to this problem, the refutation of the hypothesis H 0 means the presence of significant differences in the quality of products of different batches at the considered level of significance.

    To calculate the sums of squares Q 1, Q 2, Q it is often convenient to use the following formulas:

    (12)

    (13)

    (14)

    those. the averages themselves, generally speaking, do not have to be found.

    Thus, the one-way ANOVA procedure is to test the hypothesis H 0 that there is one group of homogeneous experimental data versus the alternative that there are more than one such groups. Homogeneity refers to the similarity of means and variances in any subset of the data. In this case, the variances can be both known and unknown in advance. If there is reason to believe that the known or unknown variance of measurements is the same over the entire set of data, then the task of one-way analysis of variance is reduced to the study of the significance of the difference in the means in the groups of data / 1 /.

    1.3 Multivariate dispersion analysis

    It should be noted right away that there is no fundamental difference between multivariate and univariate analysis of variance. Multivariate analysis does not change the general logic of analysis of variance, but only slightly complicates it, since, in addition to taking into account the effect on the dependent variable of each of the factors separately, their joint action should also be assessed. Thus, what is new that multivariate analysis of variance brings to data analysis concerns mainly the ability to assess interfactor interaction. Nevertheless, it is still possible to assess the influence of each factor separately. In this sense, the procedure of multivariate analysis of variance (in the version of its computer use) is undoubtedly more economical, since in just one launch it solves two problems at once: the influence of each of the factors and their interaction is assessed / 3 /.

    The general scheme of a two-factor experiment, the data of which is processed by analysis of variance, is as follows:



    Figure 1.1 - Scheme of a two-factor experiment

    Data subjected to multivariate analysis of variance are often labeled according to the number of factors and their levels.

    Assuming that in the considered problem of the quality of different m batches of products were made on different t machines and it is required to find out whether there are significant differences in the quality of products for each factor:

    A - a batch of products;

    B - machine.

    The result is a transition to the problem of two-way analysis of variance.

    All data are presented in Table 1.2, in which, in rows - the levels of A i of factor A, in columns - the levels of B j of factor B, and in the corresponding cells, tables are the values ​​of the product quality indicator x ijk (i = 1,2, ... , m; j = 1,2, ..., l; k = 1,2, ..., n).

    Table 1.2 - Indicators of product quality

    x 11l,…, x 11k

    x 12l,…, x 12k

    x 1jl,…, x 1jk

    x 1ll,…, x 1lk

    x 2 1l,…, x 2 1k

    x 22l,…, x 22k

    x 2jl,…, x 2jk

    x 2ll,…, x 2lk

    x i1l,…, x i1k

    x i2l,…, x i2k

    x ijl,…, x ijk

    x jll,…, x jlk

    x m1l,…, x m1k

    x m2l,…, x m2k

    x mjl,…, x mjk

    x mll,…, x mlk

    The two-factor dispersion model has the form:

    x ijk = μ + F i + G j + I ij + ε ijk, (15)

    where x ijk is the observation value in the cell ij with the number k;

    μ - overall average;

    F i - the effect due to the influence of the i-th level of factor A;

    G j - the effect due to the influence of the j-th level of factor B;

    I ij - the effect due to the interaction of two factors, i.e. deviation from the average over observations in cell ij from the sum of the first three terms in model (15);

    ε ijk is the disturbance caused by the variation of the variable within a separate cell.

    It is assumed that ε ijk has a normal distribution law N (0; с 2), and all mathematical expectations F *, G *, I i *, I * j are equal to zero.

    Group averages are found by the formulas:

    In the cell:

    by line:

    by column:

    overall average:

    Table 1.3 shows a general view of the calculation of values ​​using analysis of variance.

    Table 1.3 - Basic table of ANOVA

    Dispersion components

    Sum of squares

    Number of degrees of freedom

    Middle squares

    Intergroup (factor A)

    Intergroup (factor B)

    Interaction

    Residual

    Testing of null hypotheses HA, HB, HAB about the absence of influence on the considered variable of factors A, B and their interaction AB is carried out by comparing the ratios,, (for model I with fixed levels of factors) or ratios,, (for random model II) with the corresponding tabular values F - Fisher - Snedecor criterion. For mixed model III, hypotheses are tested for factors with fixed levels in the same way as in model II, and for factors with random levels, as in model I.

    If n = 1, i.e. with one observation in a cell, then not all null hypotheses can be tested, since the Q3 component falls out of the total sum of squares of deviations, and with it the mean square, since in this case there can be no question of the interaction of factors.

    From the point of view of computational techniques to find the sums of squares Q 1, Q 2, Q 3, Q 4, Q, it is more expedient to use the formulas:

    Q 3 = Q - Q 1 - Q 2 - Q 4.

    Deviation from the basic assumptions of the analysis of variance - the normality of the distribution of the studied variable and the equality of variances in the cells (if it is not excessive) - does not significantly affect the results of the analysis of variance with an equal number of observations in the cells, but it can be very sensitive with an unequal number of observations. In addition, with an unequal number of observations in the cells, the complexity of the ANOVA apparatus sharply increases. Therefore, it is recommended to plan a scheme with an equal number of cases in the cells, and if missing data are found, then compensate them with the average values ​​of other cases in the cells. In this case, however, artificially entered missing data should not be taken into account when calculating the number of degrees of freedom / 1 /.

    2 Application of ANOVA in various processes and studies

    2.1 Using analysis of variance in the study of migration processes

    Migration is a complex social phenomenon that largely determines the economic and political aspects of society. The study of migration processes is associated with the identification of factors of interest, satisfaction with working conditions, and an assessment of the influence of the obtained factors on the intergroup movement of the population.

    λ ij = c i q ij a j,

    where λ ij - the intensity of transitions from the original group i (exit) to the new j (entry);

    c i - ability and ability to leave group i (c i ≥0);

    q ij - attractiveness new group compared with the original (0≤q ij ≤1);

    a j - availability of group j (a j ≥0).

    ν ij ≈ n i λ ij = n i c i q ij a j. (16)

    In practice, for an individual, the probability p of transition to another group is small, and the number of the considered group n is large. In this case, the law of rare events operates, that is, the limit ν ij is the Poisson distribution with the parameter μ = np:

    .

    With increasing μ, the distribution approaches normal. The transformed quantity √ν ij can be considered normally distributed.

    If we take the logarithm of expression (16) and make the necessary changes of variables, we can obtain a model of analysis of variance:

    ln√ν ij = ½lnν ij = ½ (lnn i + lnc i + lnq ij + lna j) + ε ij,

    X i, j = 2ln√ν ij -lnn i -lnq ij,

    X i, j = C i + A j + ε.

    The C i and A j values ​​provide a two-way ANOVA model with one observation per cell. The inverse transformation from C i and A j calculates the coefficients c i and a j.

    When conducting an analysis of variance, the following values ​​should be taken as the values ​​of the effective indicator Y:

    X = (X 1,1 + X 1,2 +: + X mi, mj) / mimj,

    where mimj is the estimate of the mathematical expectation X i, j;

    X mi and X mj - respectively the number of exit and entry groups.

    Factor I levels are mi output groups, factor J levels are mj input groups. Mi = mj = m is assumed. The problem arises of testing the hypotheses H I and H J about the equalities mathematical expectations values ​​of Y at levels I i and at levels J j, i, j = 1,…, m. Testing the hypothesis H I is based on a comparison of the values ​​of the unbiased variance estimates s I 2 and s o 2. If the hypothesis H I is true, then the value F (I) = s I 2 / s o 2 has the Fisher distribution with the numbers of degrees of freedom k 1 = m-1 and k 2 = (m-1) (m-1). For a given level of significance α, a right-sided critical point x pr, α cr is found. If numerical value F (I) the number of values ​​falls into the interval (x pr, α cr, + ∞), then the hypothesis H I is rejected and it is believed that factor I affects the effective trait. The degree of this influence, according to the results of observations, is measured by the selective coefficient of determination, which shows what proportion of the variance of the effective trait in the sample is due to the influence of factor I. If F (I) is

    2.2 Principles of mathematical and statistical analysis of biomedical research data

    Depending on the task, the volume and nature of the material, the type of data and their relationships, there is a choice of methods of mathematical processing at the stages of both preliminary (to assess the nature of distribution in the studied sample) and final analysis in accordance with the objectives of the study. An extremely important aspect is to check the homogeneity of the selected observation groups, including control groups, which can be carried out either by expert means, or by methods of multivariate statistics (for example, using cluster analysis). But the first step is to write a questionnaire that provides a standardized description of the characteristics. Especially when conducting epidemiological studies, where there is a need for unity in the understanding and description of the same symptoms by different doctors, including taking into account the ranges of their changes (severity). In the case of significant differences in the registration of the initial data (subjective assessment of the nature of pathological manifestations by various specialists) and the impossibility of bringing them to a single form at the stage of collecting information, then the so-called covariant correction can be carried out, which implies the normalization of the variables, i.e. elimination of abnormalities of indicators in the data matrix. "Consensus of opinions" is carried out taking into account the specialty and experience of doctors, which then allows them to compare the results of the examination with each other. For this, multivariate analysis of variance and regression can be used.

    Signs can be of the same type, which is rare, or of different types. This term means their various metrological assessment. Quantitative or numerical characteristics are those measured on a certain scale and in scales of intervals and ratios (I group of characteristics). Qualitative, rank, or scoring are used to express medical terms and concepts that do not have numerical values ​​(for example, the severity of a condition) and are measured on a scale of order (II group of signs). Classification or nominal (for example, profession, blood group) are those measured in the scale of names (III group of signs).

    In many cases, an attempt is made to analyze an extremely large number of features, which should help to increase the information content of the sample presented. However, the selection of useful information, that is, the selection of features is an absolutely necessary operation, since to solve any classification problem, information must be selected that contains information useful for this task. If this is not carried out for some reason by the researcher independently or there are no sufficiently substantiated criteria for reducing the dimension of the feature space for meaningful reasons, the fight against information redundancy is carried out by formal methods by assessing the information content.

    Analysis of variance makes it possible to determine the influence of different factors (conditions) on the trait (phenomenon) under study, which is achieved by decomposing the total variability (variance expressed as the sum of squares of deviations from the total mean) into individual components caused by the influence of various sources of variability.

    Analysis of variance is used to investigate the threat of a disease in the presence of risk factors. The concept of relative risk examines the relationship between patients with a particular disease and those without it. The value of the relative risk makes it possible to determine how many times the probability of getting sick increases if it is present, which can be estimated using the following simplified formula:

    where a is the presence of a feature in the study group;

    b - the absence of a feature in the study group;

    c - the presence of a sign in the comparison group (control);

    d - no sign in the comparison group (control).

    The attributive risk indicator (rA) is used to estimate the proportion of morbidity associated with a given risk factor:

    ,

    where Q is the frequency of the risk marker in the population;

    r "is the relative risk.

    Identification of factors contributing to the onset (manifestation) of the disease, i.e. risk factors can be carried out in various ways, for example, by assessing the information content with subsequent ranking of signs, which, however, does not indicate the cumulative effect of the selected parameters, in contrast to the use of regression, factor analyzes, methods of the theory of pattern recognition, which make it possible to obtain "symptom complexes" of risk factors. In addition, more sophisticated methods make it possible to analyze indirect links between risk factors and diseases / 5 /.

    2.3 Soil biotesting

    Various pollutants entering the agrocenosis can undergo various transformations in it, thus increasing their toxic effect. For this reason, methods for the integral assessment of the quality of the components of the agrocenosis turned out to be necessary. The studies were carried out on the basis of multivariate analysis of variance in an 11-field grain-herb-row crop rotation. The experiment studied the influence of the following factors: soil fertility (A), fertilization system (B), plant protection system (C). Soil fertility, fertilization system and plant protection system were studied in doses of 0, 1, 2 and 3. Basic options were represented by the following combinations:

    000 - the initial level of fertility, without the use of fertilizers and plant protection products from pests, diseases and weeds;

    111 - the average level of soil fertility, the minimum dose of fertilizer, biological protection of plants from pests and diseases;

    222 - the initial level of soil fertility, the average dose of fertilizers, chemical protection of plants from weeds;

    333 - a high level of soil fertility, a high dose of fertilizers, chemical protection of plants from pests and diseases.

    The options were studied where only one factor is presented:

    200 - fertility:

    020 - fertilizers;

    002 - plant protection products.

    And also options with a different combination of factors - 111, 131, 133, 022, 220, 202, 331, 313, 311.

    The aim of the study was to study the inhibition of chloroplasts and the coefficient of instantaneous growth, as indicators of soil pollution, in various versions of a multifactorial experiment.

    The inhibition of phototaxis of duckweed chloroplasts was studied in different soil horizons: 0-20, 20-40 cm. Analysis of the variability of phototaxis in different variants of the experiment showed a significant effect of each of the factors (soil fertility, fertilization system and plant protection system). The share in the total dispersion of soil fertility was 39.7%, fertilizer systems - 30.7%, plant protection systems - 30.7%.

    To study the combined effect of factors on the inhibition of chloroplast phototaxis, various combinations of experimental options were used: in the first case - 000, 002, 022, 222, 220, 200, 202, 020, in the second case - 111, 333, 331, 313, 133, 311 , 131.

    The results of two-way analysis of variance indicate a significant effect of interacting fertilizer systems and plant protection systems on differences in phototaxis for the first case (the share in the total variance was 10.3%). For the second case, a significant influence of the interacting soil fertility and the fertilizer system was found (53.2%).

    Three-factor analysis of variance showed in the first case a significant influence of the interaction of all three factors. The share in the total variance was 47.9%.

    The coefficient of instantaneous growth was studied in various variants of experiments 000, 111, 222, 333, 002, 200, 220. The first stage of testing was before the introduction of herbicides on winter wheat crops (April), the second stage - after the application of herbicides (May) and the last - on the moment of cleaning (July). The precursors are sunflower and grain corn.

    The emergence of new fronds was observed after a short lag phase with a period of total doubling of the wet weight of 2–4 days.

    In the control and in each variant, on the basis of the results obtained, the coefficient of instantaneous growth of the population r was calculated, and then the time of doubling the number of fronds (t double) was calculated.

    t double = ln2 / r.

    The calculation of these indicators was carried out in dynamics with the analysis of soil samples. Analysis of the data showed that the time for doubling the duckweed population before tillage was the shortest in comparison with the data after tillage and at the time of harvesting. In the dynamics of observations, the response of the soil after the application of the herbicide and at the time of harvesting is of greater interest. First of all, interaction with fertilizers and the level of fertility.

    Sometimes, getting a direct response to the introduction of chemicals can be complicated by the interaction of the drug with fertilizers, both organic and mineral. The data obtained made it possible to trace the dynamics of the response of the introduced preparations, in all variants with chemical means of protection, where there is a suspension of the growth of the indicator.

    The data of one-way analysis of variance showed a significant effect of each indicator on the growth rate of duckweed at the first stage. At the second stage, the effect of differences in soil fertility was 65.0%, in the fertilizer system and plant protection system - 65.0% each. The factors showed significant differences in the average for the coefficient of instantaneous growth of variant 222 and variants 000, 111, 333. At the third stage, the share in the total variance of soil fertility was 42.9%, fertilization systems and plant protection systems - 42.9% each. There was a significant difference in the mean values ​​of options 000 and 111, options 333 and 222.

    The studied soil samples from the field monitoring variants differ from each other in terms of inhibition of phototaxis. The influence of fertility factors, the system of fertilizers and plant protection products with shares of 30.7 and 39.7% were noted in one-factor analysis, with two-factor and three-factor analysis - the combined influence of factors was registered.

    The analysis of the results of the experiment showed insignificant differences between the soil horizons in terms of the inhibition of phototaxis. Differences are marked by mean values.

    In all variants, where there are plant protection products, changes in the position of chloroplasts and a suspension of the growth of duckweed / 6 / are observed.

    2.4 Influenza Causes Increased Histamine Production

    Researchers at the Children's Hospital in Pittsburgh, USA, have received the first evidence that histamine levels increase in acute respiratory viral infections. Despite the fact that it was previously assumed that histamine plays a role in the onset of symptoms of acute respiratory infections of the upper respiratory tract.

    Scientists were interested in why many people use antihistamines for self-treatment of "colds" and rhinitis, which in many countries are included in the OTC category, ie. available without a doctor's prescription.

    The aim of this study was to determine whether the production of histamine increases with experimental infection caused by the influenza A virus.

    Influenza A virus was injected intranasally into 15 healthy volunteers, and then the infection was monitored. Every day during the course of the disease, the volunteers collected a morning portion of urine, and then the determination of histamine and its metabolites was carried out, and the total amount of histamine and its metabolites excreted per day was calculated.

    The disease developed in all 15 volunteers. Analysis of variance confirmed a significantly higher level of histamine in the urine on days 2-5 of viral infection (p<0,02) - период, когда симптомы «простуды» наиболее выражены. Парный анализ показал, что наиболее значительно уровень гистамина повышается на 2 день заболевания. Кроме этого, оказалось, что суточное количество гистамина и его метаболитов в моче при гриппе примерно такое же, как и при обострении аллергического заболевания.

    The results of this study are the first direct evidence that histamine levels increase in acute respiratory infections / 7 /.

    Analysis of variance in chemistry

    Analysis of variance is a set of methods for determining dispersion, i.e., characteristics of particle sizes in dispersed systems. Analysis of variance includes various methods for determining the size of free particles in liquid and gaseous media, the size of pore channels in fine-porous bodies (in this case, instead of the concept of dispersion, an equivalent concept of porosity is used), as well as the specific surface area. Some of the methods of analysis of variance make it possible to obtain a complete picture of the distribution of particles by size (volume), while others give only an averaged characteristic of dispersion (porosity).

    The first group includes, for example, methods for determining the size of individual particles by direct measurement (sieve analysis, optical and electron microscopy) or by indirect data: the sedimentation rate of particles in a viscous medium (sedimentation analysis in a gravitational field and in centrifuges), the magnitude of electric current pulses, arising from the passage of particles through a hole in a non-conductive partition (conductometric method).

    The second group of methods combines the assessment of the average size of free particles and the determination of the specific surface area of ​​powders and porous bodies. The average particle size is found by the intensity of the scattered light (nephelometry), using an ultramicroscope, diffusion methods, etc., the specific surface - by the adsorption of gases (vapors) or dissolved substances, by gas permeability, dissolution rate, and other methods. Below are the limits of applicability of various methods of analysis of variance (particle sizes in meters):

    Sieve analysis - 10 -2 -10 -4

    Sedimentation analysis in a gravitational field - 10 -4 -10 -6

    Conductometric method - 10 -4 -10 -6

    Microscopy - 10 -4 -10 -7

    Filtration method - 10 -5 -10 -7

    Centrifugation - 10 -6 -10 -8

    Ultracentrifugation - 10 -7 -10 -9

    Ultramicroscopy - 10 -7 -10 -9

    Nephelometry - 10 -7 -10 -9

    Electron microscopy - 10 -7 -10 -9

    Diffusion method - 10 -7 -10 -10

    Analysis of variance is widely used in various fields of science and industrial production to assess the dispersion of systems (suspensions, emulsions, sols, powders, adsorbents, etc.) with a particle size from several millimeters (10 -3 m) to several nanometers (10 -9 m) / 8 /.

    2.6 The use of direct deliberate suggestion in the waking state in the method of training physical qualities

    Physical training is the fundamental aspect of sports training, since, to a greater extent than other aspects of training, it is characterized by physical loads affecting the morphological and functional properties of the body. The success of technical training, the content of the athlete's tactics, the realization of personal properties in the process of trainings and competitions depend on the level of physical fitness.

    One of the main tasks of physical training is the education of physical qualities. In this regard, there is a need to develop pedagogical means and methods that allow taking into account the age characteristics of young athletes, preserving their health, not requiring additional time and at the same time stimulating the growth of physical qualities and, as a consequence, sportsmanship. The use of verbal heteroinfluence in the training process in groups of initial training is one of the promising areas of research on this problem.

    Analysis of the theory and practice of the implementation of the suggestive verbal heteroinfluence revealed the main contradictions:

    Proof of the effective use of specific methods of verbal heteroinfluence in the training process and the practical impossibility of their use by the trainer;

    Recognition of direct deliberate suggestion (hereinafter PPI) in the waking state as one of the main methods of verbal heteroinfluence in the pedagogical activity of a trainer and the lack of theoretical substantiation of the methodological features of its use in sports training, and in particular in the process of upbringing physical qualities.

    In connection with the revealed contradictions and insufficient development, the problem of using the system of methods of verbal heteroinfluence in the process of educating the physical qualities of athletes predetermined the goal of the study - to develop rational purposeful methods of PPV in the waking state, contributing to the improvement of the process of educating physical qualities on the basis of assessing the mental state, manifestation and dynamics of physical qualities judokas of groups of initial training.

    In order to test and determine the effectiveness of the experimental methods of PPV in the training of physical qualities of judokas, a comparative pedagogical experiment was carried out, in which four groups took part - three experimental and one control. In the first experimental group (EG), the PPV M1 technique was used, in the second - the PPV M2 technique, in the third, the PPV M3 technique. In the control group (CG), PPV techniques were not used.

    To determine the effectiveness of the pedagogical influence of the PPV methods in the process of training physical qualities among judokas, a one-factor analysis of variance was carried out.

    The degree of influence of the methodology PPV M1 in the process of education:

    Endurance:

    a) after the third month it was 11.1%;

    Speed ​​Abilities:

    a) after the first month - 16.4%;

    b) after the second - 26.5%;

    c) after the third - 34.8%;

    a) after the second month - 26.7%;

    b) after the third - 35.3%;

    Flexibility:

    a) after the third month - 20.8%;

    a) after the second month of the main pedagogical experiment, the degree of influence of the methodology was 6.4%;

    b) after the third - 10.2%.

    Consequently, significant changes in the indicators of the level of development of physical qualities using the PPV M1 technique were found in speed abilities and strength, the degree of influence of the technique in this case is greatest. The smallest degree of influence of the methodology was found in the process of upbringing of endurance, flexibility, coordination abilities, which gives grounds to speak of insufficient effectiveness of the use of the PPV M1 methodology in the upbringing of these qualities.

    The degree of influence of the methodology PPV M2 in the process of education:

    Endurance

    a) after the first month of the experiment - 12.6%;

    b) after the second - 17.8%;

    c) after the third - 20.3%.

    Speed ​​Abilities:

    a) after the third month of training sessions - 28%.

    a) after the second month - 27.9%;

    b) after the third - 35.9%.

    Flexibility:

    a) after the third month of training sessions - 14.9%;

    Coordination skills - 13.1%.

    The obtained result of one-way analysis of variance for this EG allows us to conclude that the PPV M2 method is the most effective in training endurance and strength. It is less effective in the process of developing flexibility, speed and coordination abilities.

    The degree of influence of the methodology PPV M3 in the process of education:

    Endurance:

    a) after the first month of the experiment, 16.8%;

    b) after the second - 29.5%;

    c) after the third - 37.6%.

    Speed ​​Abilities:

    a) after the first month - 26.3%;

    b) after the second - 31.3%;

    c) after the third - 40.9%.

    a) after the first month - 18.7%;

    b) after the second - 26.7%;

    c) after the third - 32.3%.

    Flexibility:

    a) after the first - there are no changes;

    b) after the second - 16.9%;

    c) after the third - 23.5%.

    Coordination abilities:

    a) there are no changes after the first month;

    b) after the second - 23.8%;

    c) after the third - 91%.

    Thus, one-way analysis of variance showed that the use of the PPV M3 methodology in the preparatory period is most effective in the process of upbringing physical qualities, since there is an increase in the degree of its influence after each month of the pedagogical experiment / 9 /.

    2.7 Relief of acute psychotic symptoms in schizophrenic patients with an atypical neuroleptic

    The aim of the study was to study the possibility of using rispolept for the relief of acute psychoses in patients with a diagnosis of schizophrenia (paranoid type according to ICD-10) and schizoaffective disorder. At the same time, the indicator of the duration of preservation of psychotic symptoms under the conditions of pharmacotherapy with rispoleptom (main group) and classical neuroleptics was used as the main criterion under study.

    The main objectives of the study were to determine the indicator of the duration of psychosis (the so-called net psychosis), which was understood as the preservation of productive psychotic symptoms from the moment the antipsychotics began to be used, expressed in days. This indicator was calculated separately for the group taking risperidone, and separately for the group taking classical antipsychotics.

    Along with this, the task was set to determine the proportion of the reduction of productive symptoms under the influence of risperidone in comparison with classical antipsychotics at different periods of therapy.

    A total of 89 patients (42 men and 47 women) with acute psychotic symptoms in the framework of the paranoid form of schizophrenia (49 patients) and schizoaffective disorder (40 patients) were studied.

    The first episode and the duration of the disease up to 1 year were recorded in 43 patients, while in the rest of the cases, at the time of the study, there were subsequent episodes of schizophrenia with a disease duration of more than 1 year.

    Rispoleptum therapy was received by 29 people, among whom there were 15 patients with the so-called first episode. 60 people received classical antipsychotic therapy, including 28 people with the first episode. The dosage of rispolepta varied in the range from 1 to 6 mg per day and averaged 4 ± 0.4 mg / day. Risperidone was taken exclusively by mouth after meals, once a day in the evening.

    Therapy with classical antipsychotics included the use of trifluoperazine (triftazine) in a daily dose of up to 30 mg intramuscularly, haloperidol in a daily dose of up to 20 mg intramuscularly, triperidol in a daily dose of up to 10 mg orally. The vast majority of patients took classical antipsychotics as monotherapy for the first two weeks, after which they switched, if necessary (while maintaining delusional, hallucinatory or other productive symptoms) to a combination of several classical antipsychotics. At the same time, a neuroleptic with a pronounced elective anti-delusional and anti-hallucinatory affect (for example, haloperidol or triftazine) remained as the main drug; a drug with a distinct hypnosedative effect was added to it in the evening (chlorpromazine, tizercin, chlorprothixene in doses up to 50-100 mg / day) ...

    In the group taking classical antipsychotics, the administration of anticholinergic correctors (parkopan, cyclodol) was provided in doses up to 10-12 mg / day. Correctors were prescribed in case of distinct extrapyramidal side effects in the form of acute dystonia, drug parkinsonism and akathisia.

    Table 2.1 presents data on the duration of psychosis in the treatment with rispolept and classical antipsychotics.

    Table 2.1 - Duration of psychosis ("net psychosis") during treatment with rispolept and classical antipsychotics

    As follows from the data in the table, when comparing the duration of psychosis during therapy with classical neuroleptics and risperidone, there is an almost twofold reduction in the duration of psychotic symptoms under the influence of rispolept. It is significant that this value of the duration of psychosis was not influenced by either the seizure sequence factors or the nature of the picture of the leading syndrome. In other words, the duration of psychosis was determined exclusively by the factor of therapy, i.e. depended on the type of drug used, regardless of the sequence number of the attack, the duration of the disease and the nature of the leading psychopathological syndrome.

    In order to confirm the obtained patterns, a two-way analysis of variance was carried out. At the same time, the interaction of the therapy factor and the seizure number (stage 1) and the interaction of the therapy factor and the nature of the leading syndrome (stage 2) were taken into account in turn. The results of analysis of variance confirmed the influence of the therapy factor on the duration of psychosis (F = 18.8) in the absence of the influence of the seizure number factor (F = 2.5) and the factor of the type of psychopathological syndrome (F = 1.7). It is important that the combined effect of the therapy factor and the seizure number on the duration of psychosis was also absent, as well as the combined effect of the therapy factor and the factor of psychopathological syndrome.

    Thus, the results of analysis of variance confirmed the effect of only the factor of the used antipsychotic. Rispolept unambiguously led to a decrease in the duration of psychotic symptoms in comparison with traditional antipsychotics by about 2 times. In principle, this effect was achieved despite the oral administration of rispolept, whereas classical antipsychotics were used in the majority of patients parenterally / 10 /.

    2.8 Warping of fancy yarns with roving effect

    At the Kostroma State Technological University, a new structure of shaped thread with variable geometric parameters has been developed. This raises the problem of processing fancy yarns in preproduction. This study was devoted to the warping process on the following issues: the choice of the type of tensioner that gives the minimum tension spread and tension equalization, threads of different linear density along the width of the warp shaft.

    The object of research is a linen shaped thread of four variants of linear density from 140 to 205 tex. The work of tensioning devices of three types was investigated: porcelain washer, two-zone NS-1P and one-zone NS-1P. An experimental study of the tension of the scurrying yarns was carried out on an SP-140-3L warping machine. The warping speed, the mass of the brake discs corresponded to the technological parameters of the yarn warping.

    To study the dependence of the tension of the shaped thread on the geometric parameters during warping, an analysis was carried out for two factors: X 1 - the diameter of the effect, X 2 - the length of the effect. The output parameters are tension Y 1 and fluctuation in tension Y 2.

    The obtained regression equations are adequate to the experimental data at a significance level of 0.95, since the Fisher's calculated criterion for all equations is less than the tabular one.

    To determine the degree of influence of factors X 1 and X 2 on the parameters Y 1 and Y 2, an analysis of variance was carried out, which showed that the diameter of the effect has a greater influence on the level and fluctuation of tension.

    Comparative analysis of the obtained tensograms showed that the minimum tension spread during warping of this yarn is provided by the NS-1P two-zone tensioning device.

    It was found that with an increase in linear density from 105 to 205 tex, the NS-1P device gives an increase in the tension level only by 23%, while the porcelain washer - by 37%, the single-zone NS-1P by 53%.

    When forming warp rolls, which include shaped and "smooth" threads, it is necessary to individually adjust the tensioning device using the traditional method / 11 /.

    2.9 Concomitant pathology with complete loss of teeth in elderly and senile people

    Epidemiologically studied complete loss of teeth and concomitant pathology of the elderly population living in nursing homes on the territory of Chuvashia. The survey was carried out by means of a dental examination and filling in statistical cards of 784 people. The results of the analysis showed a high percentage of complete loss of teeth, aggravated by the general pathology of the body. This characterizes the examined category of the population as a group of increased dental risk and requires a revision of the entire system of their dental care.

    In older people, the incidence rate is twice, and in old age, it is six times higher than the incidence rate of people of younger ages.

    The main diseases of the elderly and senile age are diseases of the circulatory system, nervous system and sensory organs, respiratory organs, digestive organs, bones and organs of movement, neoplasms and trauma.

    The aim of the study is to develop and obtain information about concomitant diseases, the effectiveness of dental prosthetics and the need for orthopedic treatment in elderly and senile people with complete loss of teeth.

    A total of 784 people aged 45 to 90 years were examined. The ratio of women to men is 2.8: 1.

    Evaluation of the statistical relationship using the Pearson rank correlation coefficient made it possible to establish the mutual influence of missing teeth on concomitant morbidity with a reliability level of p = 0.0005. Elderly patients with complete loss of teeth suffer from diseases characteristic of old age, namely, cerebral atherosclerosis and hypertension.

    Analysis of variance showed that in the conditions under study, the specificity of the disease plays a decisive role. The role of nosological forms in different age periods ranges from 52-60%. The greatest statistically significant effect on the absence of teeth is exerted by diseases of the digestive system and diabetes mellitus.

    In general, the group of patients aged 75-89 years was characterized by a large number of pathological diseases.

    In this study, a comparative study of the incidence of comorbidities among elderly and senile patients with complete loss of teeth, living in nursing homes, was carried out. A high percentage of missing teeth was revealed among people of this age category. In patients with complete edentulousness, comorbidity characteristic of this age is observed. Atherosclerosis and hypertension were the most common among the examined persons. The influence of such diseases as diseases of the gastrointestinal tract and diabetes mellitus on the state of the oral cavity was statistically significant; the proportion of other nosoological forms was in the range of 52-60%. The use of analysis of variance did not confirm the significant role of gender and place of residence on indicators of oral health.

    Thus, in conclusion, it should be noted that the analysis of the distribution of concomitant diseases in persons with complete absence of teeth in old and old age showed that this category of citizens belongs to a special group of the population that should receive adequate dental care within the existing dental systems / 12 / ...

    3 Analysis of variance in the context of statistical methods

    Statistical methods of analysis are a methodology for measuring the results of human activity, that is, translating qualitative characteristics into quantitative ones.

    The main steps in statistical analysis:

    Drawing up a plan for collecting initial data - the values ​​of the input variables (X 1, ..., X p), the number of observations n. This step is performed with active planning of the experiment.

    Obtaining initial data and entering them into a computer. At this stage, arrays of numbers (x 1i, ..., x pi; y 1i, ..., y qi), i = 1, ..., n, are formed, where n is the sample size.

    Primary statistical data processing. At this stage, a statistical description of the considered parameters is formed:

    a) construction and analysis of statistical dependencies;

    b) correlation analysis is designed to assess the significance of the influence of factors (X 1, ..., X p) on the response Y;

    c) analysis of variance is used to assess the impact on the response Y of non-quantitative factors (X 1, ..., X p) in order to select the most important among them;

    d) regression analysis is designed to determine the analytical dependence of the response Y on the quantitative factors X;

    Interpretation of the results in terms of the task / 13 /.

    Table 3.1 shows the statistical methods used to solve analytical problems. The corresponding cells of the table contain the frequencies of applying statistical methods:

    Mark "-" - the method is not applied;

    "+" Mark - the method is applied;

    "++" label - the method is widely used;

    The "+++" label - the application of the method is of particular interest / 14 /.

    Analysis of variance, like the Student's t-test, allows you to assess the differences between sample means; however, unlike the t-test, there are no restrictions on the number of compared means. Thus, instead of asking the question of the difference between the two sample means, one can assess whether two, three, four, five, or k means differ.

    Analysis of variance allows you to deal with two or more independent variables (signs, factors) simultaneously, evaluating not only the effect of each of them separately, but also the effects of interaction between them / 15 /.


    Table 3.1 - Application of statistical methods in solving analytical problems

    Analytical challenges arising in business, finance and management

    Descriptive statistics methods

    Methods for checking statistical hypotheses

    Regression analysis methods

    Analysis of variance

    Multivariate analysis methods

    Discriminant Analysis Methods

    cluster-leg

    Analysis methods

    survivability

    Analysis methods

    and forecast

    time series

    Horizontal (temporal) analysis tasks

    Tasks of vertical (structural) analysis

    Tasks of trend analysis and forecasting

    Tasks of the analysis of relative indicators

    Comparative (spatial) analysis tasks

    Factor Analysis Tasks

    The Pareto principle is applicable to most complex systems, according to which 20% of factors determine the properties of the system by 80%. Therefore, the primary task of the researcher of the simulation model is to filter out insignificant factors, which makes it possible to reduce the dimension of the model optimization problem.

    Variance analysis evaluates the deviation of observations from the overall mean. Then the variation is broken down into parts, each of which has its own reason. The residual part of the variation that cannot be associated with the conditions of the experiment is considered its random error. To confirm the significance, a special test is used - F-statistics.

    ANOVA determines if there is an effect. Regression analysis allows you to predict the response (the value of the objective function) at some point in the parameter space. The immediate task of the regression analysis is to estimate the regression coefficients / 16 /.

    Too large sample sizes make it difficult to perform statistical analyzes, so it makes sense to reduce the sample size.

    Using analysis of variance, you can identify the significance of the influence of various factors on the variable under study. If the influence of a factor turns out to be insignificant, then this factor can be excluded from further processing.

    Macroeconometricians must be able to solve four logically different problems:

    Description of the data;

    Macroeconomic forecast;

    Structural Inference;

    Policy analysis.

    Describing data means describing the properties of one or more time series and communicating these properties to a wide range of economists. Macroeconomic forecasting means predicting the course of an economy, usually for two to three years or less (mainly because forecasting over longer horizons is too difficult). Structural inference means testing whether the macroeconomic data is consistent with a particular economic theory. Macroeconometric policy analysis takes place in several directions: on the one hand, it assesses the impact on the economy of a hypothetical change in policy instruments (for example, the tax rate or short-term interest rate), on the other hand, assesses the impact of a change in policy rules (for example, the transition to a new monetary policy regime). An empirical macroeconomic research project may include one or more of these four objectives. Each problem should be solved in such a way that correlations between time series are taken into account.

    In the 1970s, these problems were solved using a variety of methods, which, if evaluated from a modern point of view, were inadequate for several reasons. To describe the dynamics of a separate series, it was enough just to use one-dimensional time series models, and to describe the joint dynamics of two series - spectral analysis. However, there was no generally accepted language suitable for the systematic description of the joint dynamic properties of several time series. Economic forecasts were made either using simplified autoregressive moving average (ARMA) models or using large structural econometric models popular at the time. Structural inference was based either on small single-equation models, or on large models that were identified by poorly justified exclusionary constraints and usually did not include expectations. Policy analysis based on structural models was dependent on these identifying assumptions.

    Finally, the rise in prices in the 1970s was viewed by many as a major failure of the large models that were being used to make policy recommendations at the time. That is, it was the right time for the emergence of a new macroeconometric design that could solve these many problems.

    In 1980, such a construction was created - vector autoregression (VAR). At first glance, VAR is nothing more than a generalization of one-dimensional autoregression to a multidimensional case, and each equation in VAR is nothing more than a simple least squares regression of one variable on the lagging values ​​of itself and other variables in VAR. But this seemingly simple tool made it possible to systematically and internally consistently capture the rich dynamics of multivariate time series, and the statistical toolkit that accompanies VAR turned out to be convenient and, which is very important, it was easy to interpret.

    There are three different VAR models:

    The reduced form is VAR;

    Recursive VAR;

    Structural VAR.

    All three are dynamic linear models that connect the current and past values ​​of the vector Y t of an n-dimensional time series. The reduced form and recursive VARs are statistical models that do not use any economic considerations other than the choice of variables. These VARs are used to describe data and forecast. Structural VAR includes constraints derived from macroeconomic theory, and this VAR is used for structural inference and policy analysis.

    The reduced form VAR expresses Y t as a distributed lag of past values ​​plus a serially uncorrelated error term, that is, it generalizes one-dimensional autoregression to the case of vectors. The mathematically reduced form of the VAR model is a system of n equations that can be written in matrix form as follows:

    where  is an n l vector of constants;

    A 1, A 2, ..., A p are n n matrices of coefficients;

     t, is an nl vector of serially uncorrelated errors, which are assumed to have a mean of zero and a covariance matrix.

    Errors  t, in (17) are unexpected dynamics in Y t, remaining after taking into account the linear distributed lag of past values.

    It is easy to estimate the parameters of the reduced VAR form. Each of the equations contains the same regressors (Y t – 1, ..., Y t – p), and there are no mutual constraints between the equations. Thus, the efficient estimate (maximum likelihood method with full information) is simplified to the usual OLS applied to each of the equations. The error covariance matrix can be consistently estimated by the sample covariance matrix of the residuals obtained from the OLS.

    The only subtlety is to determine the lag length p, but this can be done using an information criterion such as AIC or BIC.

    At the level of matrix equations, recursive and structured VAR look the same. These two VAR models explicitly take into account simultaneous interactions between the elements of Y t, which is reduced to adding a simultaneous term to the right-hand side of equation (17). Accordingly, recursive and structured VARs are both represented in the following general form:

    where  is a vector of constants;

    B 0, ..., B p - matrices;

     t - errors.

    The presence of the matrix B 0 in the equation means the possibility of simultaneous interaction between n variables; that is, B 0 allows you to make sure that these variables, related to the same point in time, are defined together.

    Recursive VAR can be evaluated in two ways. The recursive structure gives a set of recursive equations that can be estimated using OLS. An equivalent estimation method is that the reduced form equations (17), considered as a system, are multiplied on the left by the lower triangular matrix.

    The method for estimating the structural VAR depends on how B 0 is identified. The partial information approach entails the use of single equation estimation techniques, such as two-step least squares. The well-informed approach entails the use of estimation techniques for multiple equations, such as the three-step least squares method.

    It should be remembered that there are many different types of VARs. The given VAR form is unique. This order of variables in Y t corresponds to a single recursive VAR, but there are n in total! such orders, i.e. n! various recursive VARs. The number of structural VARs - that is, sets of assumptions that identify simultaneous relationships between variables - are limited only by the ingenuity of the researcher.

    Since the matrices of the estimated VAR coefficients are difficult to interpret directly, the results of the VAR estimation are usually represented by some functions of these matrices. To such statistics of the forecast error decomposition.

    Decompositions of the variance of the forecast error are calculated mainly for recursive or structural systems. This expansion of the variance shows how important the error in the j-th equation is in explaining unexpected changes in the i-th variable. When the VAR errors are uncorrelated by equations, the variance of the forecast error for h periods ahead can be written as the sum of the components resulting from each of these errors / 17 /.

    3.2 Factor analysis

    In modern statistics, factor analysis is understood as a set of methods that, on the basis of real-life connections of attributes (or objects), make it possible to identify latent generalizing characteristics of the organizational structure and the mechanism of development of the phenomena and processes under study.

    The concept of latency in the definition is key. It means the implicitness of the characteristics revealed using the methods of factor analysis. First, we are dealing with a set of elementary features X j, their interaction presupposes the presence of certain reasons, special conditions, i.e. the existence of some hidden factors. The latter are established as a result of generalization of elementary features and act as integrated characteristics, or features, but of a higher level. Naturally, not only trivial signs X j, but also the observed objects N i themselves, can correlate, therefore, the search for latent factors is theoretically possible both according to characteristic and object data.

    If objects are characterized by a sufficiently large number of elementary features (m> 3), then another assumption is also logical - about the existence of dense clusters of points (features) in the space of n objects. In this case, the new axes no longer generalize the signs X j, but objects n i, respectively, and the latent factors F r will be recognized by the composition of the observed objects:

    F r = c 1 n 1 + c 2 n 2 + ... + c N n N,

    where c i is the weight of the object n i in the factor F r.

    Depending on which of the considered above type of correlation - elementary features or observable objects - is investigated in factor analysis, R and Q are distinguished - technical methods of data processing.

    The name of the R-technique is a volumetric analysis of data on m features, as a result of which r linear combinations (groups) of features are obtained: F r = f (X j), (r = 1..m). Analysis based on proximity (connection) data of n observed objects is called Q-technique and allows one to determine r linear combinations (groups) of objects: F = f (n i), (i = l .. N).

    At present, in practice, more than 90% of problems are solved using the R-technique.

    The set of methods for factor analysis is currently quite large, there are dozens of different approaches and data processing techniques. In order to be guided in research by the correct choice of methods, it is necessary to understand their features. Let's divide all methods of factor analysis into several classification groups:

    Principal component method. Strictly speaking, it is not classified as factor analysis, although it has a lot in common with it. What is specific is, first, that in the course of computational procedures all principal components are simultaneously obtained and their number is initially equal to the number of elementary features. Second, the possibility of a complete decomposition of the dispersion of elementary features is postulated, in other words, its full explanation through latent factors (generalized features).

    Factor analysis methods. The dispersion of elementary features is not fully explained here, it is recognized that part of the dispersion remains unrecognized as a characteristic. The factors are usually singled out sequentially: the first, explaining the largest share of the variation of elementary features, then the second, explaining a smaller part of the variance, the second after the first latent factor, the third, etc. The process of identifying factors can be interrupted at any step if a decision is made about the adequacy of the proportion of the explained variance of elementary features or taking into account the interpretability of latent factors.

    It is advisable to further divide the methods of factor analysis into two classes: simplified and modern approximating methods.

    Simple methods of factor analysis are mainly related to the initial theoretical development. They have limited capabilities in identifying latent factors and approximating factor solutions. These include:

    One-factor model. It allows identifying only one general latent and one characteristic factor. For possibly existing other latent factors, it is assumed that they are insignificant;

    Bifactor model. Allows the influence on the variation of elementary features not one, but several latent factors (usually two) and one characteristic factor;

    Centroid method. In it, correlations between variables are considered as a bundle of vectors, and the latent factor is geometrically represented as a balancing vector passing through the center of this bundle. : The method makes it possible to single out several latent and characteristic factors, for the first time it becomes possible to correlate the factorial solution with the initial data, i.e. to solve the approximation problem in the simplest form.

    Modern approximating methods often assume that the first, approximate solution has already been found in one of the ways, with subsequent steps this solution is optimized. Methods are difficult to calculate. These methods include:

    Group method. The solution is based on pre-selected groups of elementary features;

    Principal factor method. Closest to the method of principal components, the difference lies in the assumption of the existence of specific features;

    Maximum likelihood, minimum residuals, a-factor analysis, canonical factor analysis, all optimizing.

    These methods make it possible to consistently improve previously found solutions based on the use of statistical techniques for evaluating a random variable or statistical criteria, and involve a large amount of laborious calculations. The most promising and convenient for working in this group is the maximum likelihood method.

    The main task, which is solved by various methods of factor analysis, including the method of principal components, is the compression of information, the transition from a set of values ​​for m elementary features with an amount of information n x m to a limited set of elements of the factor mapping matrix (m x r) or a matrix of latent values factors for each observed object of dimension n x r, and usually r< m.

    Factor analysis methods also make it possible to visualize the structure of the studied phenomena and processes, which means to determine their state and predict their development. Finally, the data of factor analysis provide the basis for identifying the object, i.e. solving the problem of image recognition.

    Factor analysis methods have properties that are very attractive for their use as part of other statistical methods, most often in correlation-regression analysis, cluster analysis, multivariate scaling, etc. / 18 /.

    3.3 Pairwise regression. Probabilistic nature of regression models.

    If we consider the problem of analyzing food expenditures in groups with the same income, for example, $ 10,000 (x), then this is a deterministic value. But Y - the share of this money spent on food - is random and can change from year to year. Therefore, for each i-th individual:

    where ε i is a random error;

    α and β are constants (theoretically), although they can vary from model to model.

    Prerequisites for Pairwise Regression:

    X and Y are linearly related;

    X is a non-random variable with fixed values;

    - ε - errors are normally distributed N (0, σ 2);

    - .

    Figure 3.1 shows a pairwise regression model.

    Figure 3.1 - Pairwise regression model

    These premises describe the classic linear regression model.

    If the error has a non-zero mean, the original model will be equivalent to the new model and another intercept, but with a zero mean for the error.

    If the prerequisites are met, then the OLS estimates and are effective linear unbiased estimates

    If we denote:

    then the mathematical expectation and variances of the coefficients and will be as follows:

    Odds covariance:

    If then they are also distributed normally:

    It follows that:

    The β variation is completely determined by the ε variation;

    The higher the variance of X, the better the β estimate.

    The total variance is determined by the formula:

    The variance of the deviations in this form is an unbiased estimate and is called the standard error of the regression. N-2 - can be interpreted as the number of degrees of freedom.

    Analyzing deviations from the regression line can provide a useful measure of how well the estimated regression reflects actual data. A good regression is one that explains a large proportion of the variance in Y, and vice versa, a bad regression does not track most of the variation in the original data. It is intuitively clear that any additional information will improve the model, that is, reduce the unexplained portion of the variation Y. To analyze the regression model, the dispersion is decomposed into components, the determination coefficient R 2 is determined.

    The ratio of the two variances is distributed over the F-distribution, that is, if we check for statistical significance the difference between the variance of the model and the variance of the residuals, we can conclude that R 2 is significant.

    Testing the hypothesis that the variances of these two samples are equal:

    If hypothesis H 0 (about the equality of variances of several samples) is true, t has an F-distribution with (m 1, m 2) = (n 1 -1, n 2 -1) degrees of freedom.

    Having calculated F - the ratio as the ratio of two variances and comparing it with the table value, we can conclude about the statistical significance of R 2/2 /, / 19 /.

    Conclusion

    Modern applications of analysis of variance cover a wide range of problems in economics, biology, and technology and are usually interpreted in terms of the statistical theory of revealing systematic differences between the results of direct measurements performed under certain changing conditions.

    Thanks to the automation of analysis of variance, a researcher can conduct various statistical studies using a computer, while spending less time and effort on data calculations. Currently, there are many software packages that implement the analysis of variance apparatus. The most common are such software products as:

    Most of the statistical methods are implemented in modern statistical software products. With the development of algorithmic programming languages, it became possible to create additional blocks for processing statistical data.

    Analysis of variance is a powerful modern statistical method for processing and analyzing experimental data in psychology, biology, medicine and other sciences. It is very closely related to a specific methodology for planning and conducting experimental research.

    Analysis of variance is used in all areas of scientific research where it is necessary to analyze the influence of various factors on the variable under study.

    Bibliography

    1 Kremer N.Sh. Probability theory and mathematical statistics. M .: Unity - Dana, 2002.-343s.

    2 Gmurman V.E. Theory of Probability and Mathematical Statistics. - M .: Higher school, 2003.-523s.

    4 www.conf.mitme.ru

    5 www.pedklin.ru

    6 www.webcenter.ru

    7 www.infections.ru

    8 www.encycl.yandex.ru

    9 www.infosport.ru

    10 www.medtrust.ru

    11 www.flax.net.ru

    12 www.jdc.org.il

    13 www.big.spb.ru

    14 www.bizcom.ru

    15 Gusev A.N. Analysis of variance in experimental psychology. - M .: Educational-methodical collector "Psychology", 2000.-136s.

    17 www.econometrics.exponenta.ru

    18 www.optimizer.by.ru

    ANOVA

    1. The concept of analysis of variance

    ANOVA is an analysis of the variability of a trait under the influence of any controlled variable factors. In foreign literature, analysis of variance is often referred to as ANOVA, which translates as Analysis of Variance.

    Analysis of variance problem consists in isolating variability of a different kind from the general variability of the trait:

    a) variability due to the action of each of the investigated independent variables;

    b) variability due to the interaction of the studied independent variables;

    c) random variability due to all other unknown variables.

    The variability due to the action of the studied variables and their interactions correlates with the random variability. An indicator of this ratio is Fisher's F criterion.

    The formula for calculating the criterion F includes estimates of variances, that is, the parameters of the distribution of a feature, therefore, criterion F is a parametric criterion.

    The more the variability of the trait is due to the studied variables (factors) or their interaction, the higher empirical values ​​of the criterion.

    Zero the hypothesis in the analysis of variance will say that the average values ​​of the investigated effective trait in all gradations are the same.

    Alternative the hypothesis will assert that the average values ​​of the effective trait in different gradations of the investigated factor are different.

    Analysis of variance allows us to state a change in a trait, but it does not indicate direction these changes.

    Let us begin our consideration of the analysis of variance with the simplest case, when the action is investigated only one variable (one factor).

    2. One-way analysis of variance for unrelated samples

    2.1. Method purpose

    The method of univariate analysis of variance is used in cases where changes in the effective trait are investigated under the influence of changing conditions or gradations of any factor. In this version of the method, each of the gradations of the factor is influenced by various sample of subjects. There must be at least three gradations of the factor. (There may be two gradations, but in this case we will not be able to establish nonlinear dependencies and it seems more reasonable to use simpler ones).

    A nonparametric version of this type of analysis is the Kruskal-Wallis H test.

    Hypotheses

    H 0: Differences between factor gradations (different conditions) are no more pronounced than random differences within each group.

    H 1: Differences between factor gradations (different conditions) are more pronounced than random differences within each group.

    2.2. Limitations of univariate analysis of variance for unrelated samples

    1. One-way analysis of variance requires at least three gradations of a factor and at least two subjects in each gradation.

    2. The productive attribute should be normally distributed in the studied sample.

    True, it is usually not indicated whether we are talking about the distribution of a feature in the entire surveyed sample or in that part of it that makes up the dispersion complex.

    3. An example of solving the problem by the method of one-way analysis of variance for unrelated samples by the example:

    Three different groups of six subjects received lists of ten words. To the first group, words were presented at a low speed of -1 word in 5 seconds, to the second group at an average speed - 1 word in 2 seconds, and to the third group at a high speed - 1 word per second. It was predicted that the performance will depend on the speed of presentation of words. The results are presented in Table. 1.

    Number of words played Table 1

    No. of the subject

    low speed

    average speed

    high speed

    total amount

    H 0: Differences in the volume of word reproduction between groups are no more pronounced than random differences inside each group.

    H 1: Differences in the volume of word reproduction between groups are more pronounced than random differences inside each group. Using the experimental values ​​presented in Table. 1, we will establish some values ​​that will be necessary to calculate the criterion F.

    The calculation of the basic values ​​for one-way analysis of variance is presented in the table:

    table 2

    Table 3

    Workflow in univariate ANOVA for unrelated samples

    The notation SS, which is often found in this and subsequent tables, is an abbreviation for "sum of squares". This abbreviation is most often used in translated sources.

    SS fact means the variability of the feature due to the action of the investigated factor;

    SS total- the general variability of the feature;

    S CA- variability due to unaccounted factors, "random" or "residual" variability.

    MS- "mean square", or mathematical expectation of the sum of squares, the averaged value of the corresponding SS.

    df - the number of degrees of freedom, which, when considering nonparametric criteria, we designated the Greek letter v.

    Conclusion: H 0 is rejected. Accepted H 1. Differences in the volume of word reproduction between groups are more pronounced than random differences within each group (α = 0.05). So, the speed of presentation of words affects the volume of their reproduction.

    An example of solving the problem in Excel is presented below:

    Initial data:

    Using the command: Service-> Data Analysis-> One-way ANOVA, we get the following results:

    A person can learn his abilities only by trying to apply them. (Seneca)

    ANOVA

    Introductory overview

    In this section, we will look at the basic methods, assumptions, and terminology of analysis of variance.

    Note that in the English-language literature, ANOVA is usually referred to as analysis of variation. Therefore, for brevity, below we will sometimes use the term ANOVA (An alysis o f va riation) for conventional analysis of variance and the term MANOVA for multivariate analysis of variance. In this section, we will sequentially consider the main ideas of analysis of variance ( ANOVA), analysis of covariance ( ANCOVA), multivariate analysis of variance ( MANOVA) and multivariate analysis of covariance ( MANCOVA). After a brief discussion of the merits of contrast analysis and posterior tests, consider the assumptions on which ANOVA is based. Towards the end of this section, the advantages of the multivariate approach for repeated measurements analysis are explained over the traditional univariate approach.

    Key ideas

    The purpose of the analysis of variance. The main purpose of the analysis of variance is to investigate the significance of the difference between the means. Chapter (Chapter 8) provides a brief introduction to statistical significance research. If you are just comparing the means of two samples, ANOVA will give the same result as normal t- a test for independent samples (if two independent groups of objects or observations are compared) or t- criterion for dependent samples (if two variables are compared on the same set of objects or observations). If you are not familiar enough with these criteria, we recommend that you refer to the introductory overview of the chapter (chapter 9).

    Where does the name come from ANOVA? It may seem strange that the procedure for comparing means is called analysis of variance. In fact, this is because when examining the statistical significance of the difference between means, we are actually analyzing variances.

    Splitting the sum of squares

    For a sample of size n, the sample variance is calculated as the sum of the squared deviations from the sample mean, divided by n-1 (sample size minus one). Thus, for a fixed sample size n, the variance is a function of the sum of squares (deviations), denoted, for brevity, SS(from the English Sum of Squares - Sum of Squares). Analysis of variance is based on dividing (or splitting) the variance into parts. Consider the following dataset:

    The means of the two groups are significantly different (2 and 6, respectively). Sum of squares of deviations inside each group is 2. Adding them, we get 4. If we now repeat these calculations excluding group membership, that is, if you calculate SS based on the total mean of these two samples, we get 28. In other words, variance (sum of squares) based on intragroup variability results in much lower values ​​than calculated based on total variability (relative to the total mean). The reason for this, obviously, lies in the significant difference between the means, and this difference between the means explains the existing differences between the sums of squares. Indeed, if we use the module to analyze the given data ANOVA, the following results will be obtained:

    As you can see from the table, the total sum of squares SS= 28 divided by the sum of squares due to intragroup variability ( 2+2=4 ; see the second row of the table) and the sum of squares due to the difference in mean values. (28- (2 + 2) = 24; see the first row of the table).

    SS mistakes andSS effect. Intragroup variability ( SS) is usually called the variance errors. This means that usually in an experiment it cannot be predicted or explained. On the other side, SS effect(or between-group variability) can be explained by the difference between the means in the studied groups. In other words, belonging to a certain group explains intergroup variability, because we know that these groups have different mean values.

    Significance check. The basic ideas of statistical significance testing are discussed in the chapter Basic concepts of statistics(chapter 8). This chapter also explains the reasons why many criteria use the ratio of explained versus unexplained variance. An example of such use is analysis of variance itself. Significance testing in analysis of variance is based on comparing variance due to intergroup variation (called mean square effect or MSthe effect) and variance due to intra-group spread (called mean square error or MSerror). If the null hypothesis is correct (equality of the means in the two populations), then a relatively small difference in the sample means can be expected due to random variability. Therefore, under the null hypothesis, the intragroup variance will practically coincide with the total variance calculated without taking into account the membership group. The resulting intragroup variances can be compared using F- a criterion that checks whether the ratio of variances is actually significantly greater than 1. In the above example F- the criterion shows that the difference between the means is statistically significant.

    Basic logic of analysis of variance. Summing up, we can say that the purpose of ANOVA is to test the statistical significance of the difference between the means (for groups or variables). This check is done using variance analysis, i.e. by dividing the total variance (variation) into parts, one of which is due to random error (that is, intragroup variability), and the second is associated with the difference in mean values. The last component of the variance is then used to analyze the statistical significance of the difference between the means. If this difference is significant, the null hypothesis is rejected and an alternative hypothesis about the existence of a difference between the means is accepted.

    Dependent and independent variables. Variables whose values ​​are determined using measurements during the experiment (for example, the score gained during testing) are called dependent variables. Variables that can be controlled in the experiment (for example, teaching methods or other criteria that allow you to divide observations into groups) are called factors or independent variables. These concepts are described in more detail in the chapter Basic concepts of statistics(chapter 8).

    Multivariate ANOVA

    In the simple example above, you could immediately compute the t-test for independent samples using the appropriate module option Basic statistics and tables. The obtained results, naturally, coincide with the results of the analysis of variance. However, analysis of variance contains flexible and powerful technical tools that can be used for much more complex research.

    Lots of factors. The world is inherently complex and multidimensional. Situations when a certain phenomenon is fully described by one variable are extremely rare. For example, if we are trying to learn how to grow large tomatoes, factors related to plant genetic structure, soil type, light, temperature, etc. should be considered. Thus, there are many factors to deal with in a typical experiment. The main reason why the use of analysis of variance is preferable to repeated comparison of two samples at different levels of factors using t- criterion is that the analysis of variance is more efficient and, for small samples, it is more informative.

    Factor management. Suppose that in the above example of analyzing two samples, we add another factor, for example, Floor- Gender... Let each group consist of 3 men and 3 women. The plan for this experiment can be presented in the form of a 2 by 2 table:

    Experiment. Group 1 Experiment. Group 2
    Men2 6
    3 7
    1 5
    The average2 6
    Women4 8
    5 9
    3 7
    The average4 8

    Before doing the calculations, you will notice that in this example, the total variance has at least three sources:

    (1) random error (intragroup variance),

    (2) treatment variability, and

    (3) variability due to the sex of the objects of observation.

    (Note that there is another possible source of variability - interaction of factors which we will discuss later). What happens if we don't include floorgender as a factor in the analysis and calculate the usual t-criterion? If we calculate the sums of squares, ignoring floor -gender(i.e., combining objects of different genders into one group when calculating the intragroup variance, thus obtaining the sum of squares for each group equal to SS= 10, and the total sum of squares SS= 10 + 10 = 20), then we get a larger value of the intragroup variance than in a more accurate analysis with additional division into subgroups by semi - gender(in this case, the intra-group means will be equal to 2, and the total intra-group sum of squares is equal to SS = 2 + 2 + 2 + 2 = 8). This difference is due to the fact that the average value for men - males less than the average for women -female and this difference in means increases the total within-group variability if gender is not taken into account. Controlling the variance of the error increases the sensitivity (power) of the test.

    This example shows another advantage of the analysis of variance over the conventional one. t-criterion for two samples. Analysis of variance allows you to study each factor by controlling the values ​​of the remaining factors. This, in fact, is the main reason for its greater statistical power (smaller sample sizes are required to obtain meaningful results). For this reason, analysis of variance, even on small samples, gives statistically more significant results than simple t- criterion.

    Interaction effects

    There is another advantage of using ANOVA over conventional analysis. t- criterion: analysis of variance allows you to detect interaction between factors and therefore allows the study of more complex models. To illustrate, consider another example.

    Main effects, pairwise (two-factor) interactions. Suppose that there are two groups of students, and the psychologically students of the first group are tuned in to fulfill the assigned tasks and are more purposeful than the students of the second group, which consists of lazier students. We split each group in half at random and ask one half of each group to have a hard task and the other an easy one. Then we measure how hard the students work on these assignments. Average values ​​for this (fictional) study are shown in the table:

    What conclusion can be drawn from these results? Can we conclude that: (1) students work harder on a difficult task; (2) Do motivated students work harder than lazy ones? None of these statements reflect the nature of the systematic nature of the averages shown in the table. Analyzing the results, it would be more correct to say that only purposeful students work harder on complex tasks, while only lazy students work harder on easy tasks. In other words, the nature of the students and the complexity of the assignment interacting among themselves affect the effort expended. That's an example pair interaction between the nature of the students and the complexity of the assignment. Note that statements 1 and 2 describe main effects.

    Higher order interactions. While pairwise interactions are still relatively easy to explain, higher-order interactions are much more difficult to explain. Imagine that in the example considered above, one more factor is introduced floor -Gender and we got the following table of average values:

    What conclusions can now be drawn from the results obtained? Average plots make complex effects easy to interpret. The ANOVA module allows you to build these graphs with just one click.

    The image in the graphs below represents the studied three-factor interaction.

    Looking at the graphs, we can say that women have an interaction between the character and the difficulty of the test: motivated women work on a difficult task more intensely than on an easy one. In men, the same interaction is reversed. It can be seen that the description of the interaction between factors is becoming more confusing.

    General way of describing interactions. In general, the interaction between factors is described as a change in one effect under the influence of another. In the example considered above, two-factor interaction can be described as a change in the main effect of a factor characterizing the complexity of the problem under the influence of a factor describing the character of the student. For the interaction of the three factors from the previous paragraph, we can say that the interaction of two factors (the complexity of the problem and the character of the student) changes under the influence sexGender. If the interaction of four factors is studied, we can say that the interaction of three factors changes under the influence of the fourth factor, i.e. there are different types of interactions at different levels of the fourth factor. It turned out that in many areas, the interaction of five or even more factors is not unusual.

    Complex plans

    Intergroup and intragroup plans (repeated measures plans)

    When comparing two different groups, it is usually used t- criterion for independent samples (from the module Basic statistics and tables). When comparing two variables on the same set of objects (observations), use t-criterion for dependent samples. For analysis of variance, it is also important whether the samples are dependent or not. If there are repeated measurements of the same variables (under different conditions or at different times) for the same objects, then they talk about the presence repeated measurements factor(also called intragroup factor, since the intragroup sum of squares is calculated to assess its significance). If different groups of objects are compared (for example, men and women, three strains of bacteria, etc.), then the difference between the groups is described intergroup factor. The methods for calculating the significance criteria for the two described types of factors are different, but their general logic and interpretation are the same.

    Inter- and intra-group plans. In many cases, the experiment requires both the intergroup factor and the repeated measures factor to be included in the design. For example, the math skills of female and male students are measured (where floor -Gender-intergroup factor) at the beginning and at the end of the semester. The two dimensions of each student's skill form the within-group factor (repeated measures factor). Interpretation of main effects and interactions for between-group factors and repeated measures factors are the same, and both types of factors can apparently interact with each other (for example, women acquire skills during the semester, while men lose them).

    Incomplete (nested) plans

    In many cases, the interaction effect can be neglected. This occurs either when it is known that there is no interaction effect in the population, or when the implementation of complete factorial plan is impossible. For example, the effect of four fuel additives on fuel consumption is being studied. Four vehicles and four drivers are selected. Full factorial the experiment requires that every combination - additive, driver, car - appear at least once. This requires at least 4 x 4 x 4 = 64 test groups, which is too time-consuming. In addition, there is hardly any interaction between the driver and the fuel additive. With this in mind, a plan can be used Latin squares, which contains only 16 test groups (four additives are designated A, B, C and D):

    Latin squares are described in most books on experimental design (eg, Hays, 1988; Lindman, 1974; Milliken and Johnson, 1984; Winer, 1962) and will not be discussed in detail here. Note that Latin squares are notnfull plans in which not all combinations of factor levels are involved. For example, driver 1 drives car 1 only with additive A, driver 3 drives car 1 only with additive C. Factor levels additives ( A, B, C and D) are nested in table cells automobile x driver - like eggs in their nests. This mnemonic rule is useful for understanding nature. nested or nested plans. Module ANOVA provides easy ways to analyze these types of plans.

    Analysis of Covariance

    Main idea

    In chapter Key ideas the idea of ​​factor management and how the inclusion of additive factors can reduce the sum of the squared errors and increase the statistical power of the design were briefly discussed. All this can be extended to variables with a continuous set of values. When such continuous variables are included in the plan as factors, they are called covariates.

    Fixed covariates

    Suppose you are comparing the math skills of two groups of students who studied in two different textbooks. Let's also assume that there is intelligence quotient (IQ) data for each student. It can be assumed that IQ is related to math skills and use this information. For each of the two groups of students, the correlation coefficient between IQ and math skills can be calculated. Using this correlation coefficient, it is possible to distinguish the proportion of variance in groups explained by the influence of IQ and the unexplained proportion of variance (see also Basic concepts of statistics(chapter 8) and Basic statistics and tables(chapter 9)). The rest of the variance is used in the analysis as the variance of the error. If there is a correlation between IQ and math skills, then the variance of the error can be significantly reduced. SS/ (n-1) .

    Impact of covariates onF- criterion. F- the criterion evaluates the statistical significance of the difference in mean values ​​in groups, while calculating the ratio of intergroup variance ( MSeffect) to the error variance ( MSerror) ... If MSerror decreases, for example, when the IQ factor is taken into account, the value F increases.

    Lots of covariates. The reasoning used above for a single covariate (IQ) can easily be extended to multiple covariates. For example, in addition to IQ, you can include the measurement of motivation, spatial thinking, etc. Instead of the usual correlation coefficient, this uses a multiple correlation coefficient.

    When the valueF -criterion decreases. Sometimes introducing covariates into the experimental design decreases the value F-criteria . This usually indicates that covariates are correlated not only with the dependent variable (such as math skills) but also with factors (such as different textbooks). Suppose IQ is measured at the end of the semester, after almost a year of teaching two groups of students from two different textbooks. Although the students were randomly assigned to groups, it may turn out that the difference in textbooks is so great that both IQ and math skills will differ greatly from group to group. In this case, covariates not only reduce error variance, but also intergroup variance. In other words, after controlling for the difference in IQ in different groups, the difference in math skills will no longer be significant. You can put it differently. After “eliminating” the influence of IQ, the influence of the textbook on the development of math skills is inadvertently eliminated.

    Adjusted averages. When the covariate affects the between-group factor, the adjusted averages, i.e. such means that are obtained after removing all estimates of the covariates.

    Interaction between covariates and factors. Just as interactions between factors are investigated, interactions between covariates and between groups of factors can be investigated. Suppose one of the textbooks is especially suitable for smart students. The second textbook for smart students is boring, but for less smart students the same textbook is difficult. As a result, there is a positive correlation between IQ and learning outcomes in the first group (smarter students, better outcomes) and zero or little negative correlation in the second group (the smarter the student, the less likely it is to acquire math skills from the second textbook). In some studies, this situation is discussed as an example of violation of the assumptions of analysis of covariance. However, since the ANOVA module uses the most general methods of analysis of covariance, you can, in particular, evaluate the statistical significance of the interaction between factors and covariates.

    Variable covariates

    While fixed covariates are discussed quite often in textbooks, variable covariates are mentioned much less often. Usually, when conducting experiments with repeated measurements, we are interested in differences in measurements of the same quantities at different points in time. Namely, we are interested in the significance of these differences. If you measure covariates at the same time as measuring the dependent variables, you can calculate the correlation between the covariate and the dependent variable.

    For example, you can study your interest in math and math skills at the beginning and end of the semester. It would be interesting to test whether changes in interest in mathematics are correlated with changes in mathematics skills.

    Module ANOVA v STATISTICA automatically evaluates the statistical significance of the change in covariates in those designs where possible.

    Multivariate Designs: Multivariate ANOVA and Covariance Analysis

    Intergroup plans

    All of the previous examples included only one dependent variable. When there are several dependent variables at the same time, only the computational complexity increases, and the content and basic principles do not change.

    For example, a study is being conducted on two different textbooks. At the same time, the success of students in the study of physics and mathematics is studied. In this case, there are two dependent variables and you need to find out how two different textbooks affect them at the same time. To do this, you can use multivariate analysis of variance (MANOVA). Instead of one-dimensional F criterion, a multidimensional F test (Wilkes' l-test) based on comparing the error covariance matrix and the intergroup covariance matrix.

    If the dependent variables are correlated with each other, then this correlation should be taken into account when calculating the significance test. Obviously, if the same measurement is repeated twice, then nothing new can be obtained in this case. If a correlated dimension is added to an existing dimension, then some new information is obtained, but the new variable contains redundant information, which is reflected in the covariance between the variables.

    Interpretation of results. If the general multivariate criterion is significant, it can be concluded that the corresponding effect (for example, the type of textbook) is significant. However, the following questions arise. Does the type of textbook affect the improvement of only math skills, only physical skills, or both at the same time improving both skills. In fact, after obtaining a meaningful multidimensional criterion, a one-dimensional F criterion. In other words, the dependent variables that contribute to the significance of the multivariate test are separately examined.

    Repeated measures plans

    If the mathematical and physical skills of students are measured at the beginning of the semester and at the end, then these are repeated measurements. The study of the criterion of significance in such plans is a logical development of the one-dimensional case. Note that multivariate ANOVA methods are also commonly used to investigate the significance of univariate repeated measures factors that have more than two levels. The corresponding applications will be discussed later in this section.

    Summation of variable values ​​and multivariate analysis of variance

    Even experienced users of univariate and multivariate ANOVA often find it difficult to get different results when applying multivariate ANOVA, for example, for three variables, and when applying univariate ANOVA to the sum of these three variables as one variable.

    Idea summation variables is that each variable contains some true variable, which is investigated, as well as a random measurement error. Therefore, when averaging the values ​​of variables, the measurement error will be closer to 0 for all measurements and the averaged values ​​will be more reliable. In fact, in this case, applying ANOVA to the sum of the variables makes sense and is a powerful technique. However, if the dependent variables are inherently multidimensional, then the summation of the variable values ​​is inappropriate.

    For example, suppose the dependent variables consist of four measures success in society... Each indicator characterizes a completely independent side of human activity (for example, professional success, business success, family well-being, etc.). The addition of these variables is similar to the addition of an apple and an orange. The sum of these variables would not be a suitable univariate measure. Therefore, such data must be treated as multivariate indicators in multivariate analysis of variance.

    Analysis of contrasts and post hoc tests

    Why are individual sets of averages being compared?

    Usually hypotheses about experimental data are not simply formulated in terms of main effects or interactions. An example would be this hypothesis: some textbook only improves math skills for male students, while another textbook is approximately equally effective for both genders, but still less effective for men. It can be predicted that the effectiveness of the textbook interacts with the gender of the student. However, this forecast also applies to nature interactions. Significant gender differences are expected for learners in one book and virtually gender independent results for learners in another book. This type of hypothesis is usually explored using contrast analysis.

    Analysis of contrasts

    In short, the analysis of contrasts allows one to assess the statistical significance of some linear combinations of complex effects. Analysis of contrasts is the main and indispensable element of any complex analysis of variance design. Module ANOVA has a variety of contrast analysis capabilities that allow you to isolate and analyze any types of comparisons of means.

    A posteriori comparisons

    Sometimes, as a result of processing an experiment, an unexpected effect is found. Although in most cases a creative researcher will be able to explain any result, this does not provide opportunities for further analysis and estimates for prediction. This problem is one of those for which are used posterior criteria, that is, criteria not using a priori hypotheses. For illustration, consider the following experiment. Suppose there are numbers from 1 to 10 on 100 cards. After dropping all these cards into the header, we randomly select 5 cards each 20 times, and calculate the average value for each sample (the average of the numbers recorded on the cards). Can we expect that there will be two samples in which the mean values ​​differ significantly? This is very believable! Choosing two samples with the maximum and minimum mean, it is possible to obtain the difference in the means, which is very different from the difference in the means, for example, of the first two samples. This difference can be investigated, for example, using the analysis of contrasts. If you do not go into details, then there are several, so-called a posteriori criteria that are based exactly on the first scenario (taking extreme means from 20 samples), that is, these criteria are based on choosing the most differing means to compare all means in the design. These criteria are applied in order not to get an artificial effect purely by chance, for example, to detect a significant difference between the means when there is none. Module ANOVA offers a wide variety of such criteria. When unexpected results are encountered in an experiment involving several groups, then a posteriori procedures for investigating the statistical significance of the results obtained.

    Sum of Squares Type I, II, III, and IV

    Multivariate regression and analysis of variance

    There is a strong relationship between multivariate regression and analysis of variance (analysis of variance). Both methods investigate a linear model. In short, almost all experimental designs can be investigated using multivariate regression. Consider the following simple intergroup 2 x 2 plan.

    Dv A B AxB
    3 1 1 1
    4 1 1 1
    4 1 -1 -1
    5 1 -1 -1
    6 -1 1 -1
    6 -1 1 -1
    3 -1 -1 1
    2 -1 -1 1

    Columns A and B contain codes that characterize the levels of factors A and B, column AxB contains the product of two columns A and B. We can analyze this data using multivariate regression. Variable Dv defined as dependent variable, variables from A before AxB as independent variables. The study of the significance for the regression coefficients will coincide with the calculations in the analysis of variance for the significance of the main effects of the factors A and B and interaction effect AxB.

    Unbalanced and balanced plans

    When calculating the correlation matrix for all variables, for example, for the data shown above, it can be seen that the main effects of the factors A and B and interaction effect AxB uncorrelated. This property of effects is also called orthogonality. They say that the effects A and B - orthogonal or independent apart. If all the effects in the design are orthogonal to each other, as in the example above, then the design is said to be balanced.

    Balanced plans have a “good quality”. The calculations in the analysis of such designs are very simple. All calculations boil down to calculating the correlation between effects and dependent variables. Since the effects are orthogonal, partial correlations (as in full multidimensional regressions) are not calculated. However, in real life, plans are not always balanced.

    Consider real data with an unequal number of observations in the cells.

    Factor A Factor B
    B1 B2
    A1 3 4, 5
    A2 6, 6, 7 2

    If you encode this data as above and calculate the correlation matrix for all variables, then it turns out that the design factors are correlated with each other. Factors in the plan are no longer orthogonal and such plans are called unbalanced. Note that in this example, the correlation between the factors is entirely related to the difference in frequencies 1 and -1 in the columns of the data matrix. In other words, experimental designs with unequal cell volumes (more precisely, disproportionate volumes) will be unbalanced, which means that the main effects and interactions will mix. In this case, the multivariate regression must be fully computed to calculate the statistical significance of the effects. There are several strategies here.

    Sum of Squares Type I, II, III, and IV

    Sum of squares typeIandIII. To study the significance of each factor in a multivariate model, you can calculate the partial correlation of each factor, provided that all other factors have already been taken into account in the model. You can also enter factors into the model in a step-by-step manner, fixing all factors already entered into the model and ignoring all other factors. In general, this is the difference between type III and typeI sum of squares (this terminology was introduced in SAS, see e.g. SAS, 1982; for a detailed discussion, see Searle, 1987, p. 461; Woodward, Bonett, and Brecht, 1990, p. 216; or Milliken and Johnson, 1984, p. 138).

    Sum of squares typeII. The next “intermediate” strategy for the formation of the model consists of: controlling all the main effects while investigating the significance of a separate main effect; in the control of all main effects and all pairwise interactions, when the significance of an individual pairwise interaction is investigated; in the control of all the main effects of all pairwise interactions and all interactions of the three factors; in the study of the individual interaction of three factors, etc. The sums of squares for the effects calculated in this way are called typeII sum of squares. So, type ofII sums of squares controls all effects of the same order and below, ignoring all effects of a higher order.

    Sum of squares typeIV. Finally, for some special designs with missing cells (incomplete designs), one can calculate the so-called type IV sum of squares. This method will be discussed later in connection with incomplete plans (plans with missing cells).

    Interpreting the type I, II, and III sum of squares hypothesis

    Sum of squares typeIII easiest to interpret. Recall that the sums of squares typeIII investigate the effects after controlling all other effects. For example, after finding a statistically significant typeIII effect for factor A in module ANOVA, we can say that there is only one significant effect of the factor A, after introducing all other effects (factors) and interpret this effect accordingly. Probably in 99% of all applications of analysis of variance, this type of criterion is of interest to the researcher. This type of sum of squares is usually calculated in the module ANOVA by default, regardless of whether the option is selected Regression approach or not (standard approaches adopted in the module ANOVA discussed below).

    Significant Effects Obtained Using Sums of Squares type or typeII sums of squares are not easy to interpret. They are best interpreted in the context of stepwise multivariate regression. If using the sum of squares typeI the main effect of factor B turned out to be significant (after including factor A in the model, but before adding the interaction between A and B), it can be concluded that there is a significant main effect of factor B, provided that there is no interaction between factors A and B. (If at using the criterion typeIII factor B also turned out to be significant, then we can conclude that there is a significant main effect of factor B, after the introduction of all other factors and their interactions into the model).

    In terms of the marginal means hypothesis typeI and typeII usually do not have a simple interpretation. In these cases, it is said that one cannot interpret the significance of the effects by considering only the marginal means. Rather presented p mean values ​​refer to a complex hypothesis that combines means and sample size. For example, type ofII the hypotheses for factor A in the simple example of a 2 x 2 design discussed earlier would be (see Woodward, Bonett, and Brecht, 1990, p. 219):

    nij- number of observations in a cell

    uij- average value in a cell

    n. j- marginal mean

    Without going into details (for more details see Milliken and Johnson, 1984, chapter 10), it is clear that these are not simple hypotheses and in most cases none of them is of particular interest to the researcher. However, there are cases where hypotheses typeI might be interesting.

    The default computational approach in a module ANOVA

    By default, if the option is not checked Regression approach, module ANOVA uses cell averages model... It is characteristic of this model that the sums of squares for different effects are calculated for linear combinations of cell means. In a full factorial experiment, this results in sums of squares that are the same as the sums of squares discussed earlier as type of III... However, in the option Scheduled comparisons(in the window ANOVA results), the user can test the hypothesis against any linear combination of weighted or unweighted cell means. Thus, the user can check not only hypotheses typeIII but hypotheses of any type (including type ofIV). This general approach is especially useful when examining designs with missing cells (called incomplete designs).

    For full factorial designs, this approach is also useful when one wants to analyze weighted marginal means. For example, suppose that in the previously considered simple 2 x 2 plan, you want to compare the weighted (by factor levels B) marginal means for factor A. This is useful when the distribution of observations over the cells was not prepared by the experimenter, but was constructed randomly, and this randomness is reflected in the distribution of the number of observations over the levels of factor B in the aggregate.

    For example, there is a factor - the age of the widows. The possible sample of respondents is divided into two groups: under 40 and over 40 (factor B). The second factor (factor A) in terms of whether or not the widow received social support in a certain agency (while some widows were chosen randomly, others served as control). In this case, the age distribution of widows in the sample reflects the actual age distribution of widows in the population. Evaluating the effectiveness of the widows' social support group for all ages will correspond to the weighted average for the two age groups (with weights corresponding to the number of observations in the group).

    Scheduled comparisons

    Note that the sum of the entered contrast ratios is not necessarily 0 (zero). Instead, the program will automatically make corrections so that the corresponding hypotheses are not confused with the overall mean.

    To illustrate this, let's go back to the simple 2 x 2 plan discussed earlier. Recall that the numbers of observations in the cells of this unbalanced design are -1, 2, 3, and 1. Suppose we want to compare the weighted marginal means for factor A (weighted with the frequency of factor B levels). Contrast ratios can be entered:

    Note that these coefficients do not add up to 0. The program will set the coefficients so that they add up to 0, and at the same time their relative values ​​will be preserved, i.e.:

    1/3 2/3 -3/4 -1/4

    These contrasts will compare the weighted means for factor A.

    Principal mean hypotheses. The hypothesis that the unweighted principal mean is 0 can be explored using the coefficients:

    The hypothesis that the weighted principal mean is 0 is tested using:

    In no case does the program make adjustments to the contrast ratios.

    Analysis of designs with missing cells (incomplete designs)

    Factorial designs containing empty cells (processing cell combinations with no cases) are called incomplete. In such designs, some factors are usually not orthogonal and some interactions cannot be calculated. There is no better method for analyzing such plans.

    Regression approach

    In some older programs that rely on multivariate regression analysis of ANOVA designs, the factors in incomplete designs are set by default in the usual way (as if the design was complete). A multivariate regression analysis is then performed for these fictitious coded factors. Unfortunately, this method produces results that are very difficult, if not impossible, to interpret, since it is not clear how each effect participates in a linear combination of means. Consider the following simple example.

    Factor A Factor B
    B1 B2
    A1 3 4, 5
    A2 6, 6, 7 Skipped

    If multivariate regression will be performed like Dependent Variable = Constant + Factor A + Factor B, then the hypothesis about the significance of factors A and B in terms of linear combinations of means looks like this:

    Factor A: Cell A1, B1 = Cell A2, B1

    Factor B: Cell A1, B1 = Cell A1, B2

    This case is simple. In more complex plans, it is impossible to actually determine what exactly will be investigated.

    Means of cells, ANOVA approach , type IV hypotheses

    The approach that is recommended in the literature and which seems to be preferable is the study of meaningful (from the point of view of research tasks) a priori hypotheses about the means observed in the cells of the design. A detailed discussion of this approach can be found in Dodge (1985), Heiberger (1989), Milliken and Johnson (1984), Searle (1987), or Woodward, Bonett, and Brecht (1990). Sums of squares associated with hypotheses about a linear combination of means in incomplete designs, investigating estimates of some of the effects, are also called sums of squares. IV.

    Automatic generation of hypotheses typeIV. When multivariate designs have the complex nature of missing cells, it is desirable to define orthogonal (independent) hypotheses, the study of which is equivalent to the study of main effects or interactions. Algorithmic (computational) strategies (based on a pseudoinverse design matrix) have been developed to generate appropriate weights for such comparisons. Unfortunately, the final hypotheses are not uniquely determined. Of course, they depend on the order in which the effects were defined and rarely allow for simple interpretation. Therefore, it is recommended to carefully study the nature of the missing cells, then formulate hypotheses typeIV, which most meaningfully correspond to the objectives of the study. Then investigate these hypotheses using the option Scheduled comparisons in the window results... The easiest way to specify comparisons in this case is to require the introduction of a vector of contrasts for all factors together in the window Scheduled comparisons. After calling the dialog box Scheduled comparisons all groups of the current plan will be shown and those that are skipped will be marked.

    Skipped cells and specific effect check

    There are several types of designs in which the location of the missing cells is not accidental, but carefully planned, which allows a simple analysis of the main effects without affecting other effects. For example, when the required number of cells in a plan is not available, plans are often used. Latin squares for evaluating the main effects of several factors with a large number of levels. For example, a 4 x 4 x 4 x 4 factorial design requires 256 cells. At the same time, you can use Greco-Latin square to assess the main effects, having only 16 cells in the plan (chapter Experiment planning, volume IV, contains a detailed description of such plans). Incomplete designs, in which the main effects (and some interactions) can be estimated using simple linear combinations of means, are called balanced incomplete plans.

    In balanced designs, the standard (default) method of generating contrasts (weights) for main effects and interactions will then analyze a variance table in which the sums of squares for the respective effects do not mix with each other. Option Specific effects window results will generate skipped contrasts by writing zero to skipped cells in the plan. Immediately after the option is requested Specific effects for the user studying some hypothesis, a table of results with the actual weights appears. Note that in a balanced design, the sums of squares of the respective effects are calculated only if those effects are orthogonal (independent) to all other main effects and interactions. Otherwise, you need to use the option Scheduled comparisons to explore meaningful comparisons between averages.

    Missing cells and combined effects / error terms

    If option Regression approach in the start panel of the module ANOVA is not selected, then the cell-average model will be used when calculating the sum of squares for the effects (default setting). If the plan is not balanced, then when combining non-orthogonal effects (see above discussion of the option Missing cells and specific effect) you can get a sum of squares consisting of non-orthogonal (or overlapping) components. The results obtained are usually not interpretable. Therefore, one must be very careful when choosing and implementing complex incomplete experimental designs.

    There are many books out there that discuss different types of plans in detail. (Dodge, 1985; Heiberger, 1989; Lindman, 1974; Milliken and Johnson, 1984; Searle, 1987; Woodward and Bonett, 1990), but this kind of information is outside the scope of this textbook. However, later in this section, analysis of different types of plans will be demonstrated.

    Assumptions and the Effects of Breaking Assumptions

    Deviation from the Assumption of Normality of Distributions

    Suppose the dependent variable is measured on a numerical scale. Suppose also that the dependent variable is normally distributed within each group. ANOVA contains a wide range of graphs and statistics to support this assumption.

    Violation effects. Generally F the criterion is very resistant to deviation from normality (see Lindman, 1974 for detailed results). If the kurtosis is greater than 0, then the value of the statistic F can get very small. In this case, the null hypothesis is accepted, although it may not be true. The situation changes to the opposite when the kurtosis is less than 0. The asymmetry of the distribution usually has little effect on F statistics. If the number of observations in a cell is large enough, then the deviation from normality does not really matter due to central limit theorem, according to which, the distribution of the mean value is close to normal, regardless of the initial distribution. A detailed discussion of sustainability F statistics can be found in Box and Anderson (1955), or Lindman (1974).

    Dispersion uniformity

    Assumptions. It is assumed that the variances of different groups in the design are the same. This assumption is called the assumption about uniformity of dispersion. Recall that at the beginning of this section, when describing the calculation of the sum of squares of errors, we performed the summation within each group. If the variances in two groups differ from each other, then their addition is not very natural and does not give an estimate of the total intragroup variance (since in this case there is no general variance at all). Module ANOVA -ANOVA/ MANOVA contains a large set of statistical criteria for detecting deviations from the assumptions of uniformity of variance.

    Violation effects. Lindman (1974, p. 33) shows that F the criterion is quite stable with respect to violation of the assumptions of homogeneity of variance ( heterogeneity variances, see also Box, 1954a, 1954b; Hsu, 1938).

    Special case: correlation of means and variances. There are times when F statistics may mislead. This happens when the mean values ​​in the plan cells are correlated with the variance. Module ANOVA allows you to plot scatterplots of variance or standard deviation relative to the means to detect such a correlation. The reason this correlation is dangerous is as follows. Let's imagine that there are 8 cells in the plan, 7 of which have almost the same average, and in one cell the average is much larger than the others. Then F the criterion can reveal a statistically significant effect. But suppose that in a cell with a large mean and the variance is much larger than the rest, i.e. the mean and the variance in the cells are dependent (the larger the mean, the larger the variance). In this case, a large average is unreliable because it can be caused by a large variance in the data. but F statistics based on united variance within the cells will capture a large mean, although tests based on variance in each cell will not count all differences in the means as significant.

    This nature of the data (high mean and high variance) is common when there are outliers. One or two outlier observations will strongly bias the mean and will greatly increase the variance.

    Homogeneity of variance and covariance

    Assumptions. In multidimensional designs, with multidimensional dependent dimensions, the assumption of uniformity of variance described earlier also applies. However, since there are multidimensional dependent variables, it is also required that their cross correlations (covariances) be uniform across all cells of the design. Module ANOVA offers different ways to test these assumptions.

    Violation effects. Multidimensional analog F- criterion - Wilkes λ-criterion. Not much is known about the stability (robustness) of the Wilkes λ-test against violation of the above assumptions. However, since the interpretation of the module results ANOVA is usually based on the significance of one-dimensional effects (after establishing the significance of the general criterion), the discussion of robustness concerns mainly one-dimensional analysis of variance. Therefore, the significance of one-dimensional effects must be carefully examined.

    Special case: analysis of covariance. Particularly serious violations of variance / covariance homogeneity can occur when covariates are included in the design. In particular, if the correlation between covariates and dependent dimensions is different in different cells of the design, misinterpretation of the results may result. It should be remembered that analysis of covariance is essentially a regression analysis within each cell in order to isolate the portion of the variance that corresponds to the covariate. The assumption of homogeneity of variance / covariance assumes that this regression analysis is performed under the following constraint: all regression equations (slopes) for all cells are the same. If this is not intended, then large errors can appear. Module ANOVA has several specific criteria for testing this assumption. It is advisable to use these criteria in order to ensure that the regression equations for different cells are approximately the same.

    Sphericity and Complex Symmetry: Reasons for Using a Multivariate Approach to Repeated Measurements in ANOVA

    In designs containing repeated measures factors with more than two levels, the use of univariate ANOVA requires additional assumptions: assumptions about complex symmetry and assumptions about sphericity. These assumptions are rarely fulfilled (see below). Therefore, in recent years, multivariate analysis of variance has gained popularity in such plans (both approaches are combined in the module ANOVA).

    Complex symmetry assumption The complex symmetry assumption is that the variances (total within group) and covariance (across groups) are homogeneous (the same) for different repeated measurements. This is a sufficient condition for a one-dimensional F criterion for repeated measurements to be valid (i.e., the reported F-values ​​are on average consistent with the F-distribution). However, in this case, this condition is not necessary.

    Sphericity assumption. The sphericity assumption is a necessary and sufficient condition for the F-criterion to be valid. It consists in the fact that within the groups all observations are independent and equally distributed. The nature of these assumptions, as well as the effect of violating them, is usually not well described in ANOVA books - this will be described in the following paragraphs. It will also show that the results of the one-dimensional approach may differ from the results of the multivariate approach, and it will be explained what this means.

    The need for independence of hypotheses. A general way to analyze data in ANOVA is fitting model... If there are some a priori hypotheses, then the variance is split to test these hypotheses (criteria for main effects, interactions). From a computational standpoint, this approach generates a number of contrasts (many comparisons of plan means). However, if the contrasts are not independent of each other, the subdivision of the variances becomes meaningless. For example, if two contrasts A and B are identical and the corresponding part is allocated from the variance, then the same part is allocated twice. For example, it is silly and senseless to single out two hypotheses: “the average in cell 1 is higher than the average in cell 2” and “the average in cell 1 is above the average in cell 2”. So, hypotheses must be independent or orthogonal.

    Independent hypotheses on repeated measurements. General algorithm implemented in the module ANOVA, will try to generate independent (orthogonal) contrasts for each effect. For the repeated measures factor, these contrasts provide many hypotheses about differences between the levels of the factor under consideration. However, if these differences are correlated within groups, then the resulting contrasts are no longer independent. For example, in teaching where learners are measured three times in one semester, it may happen that changes between dimensions 1 and 2 are negatively correlated with changes between dimensions 2 and 3 of subjects. Those who have mastered most of the material between the 1st and 2nd dimensions, master the smaller part during the time that elapsed between the 2nd and 3rd dimensions. In fact, for most of the cases where ANOVA is used on repeated measures, it can be assumed that changes in levels are correlated across subjects. However, when this happens, the complex symmetry and sphericity assumptions are not met and independent contrasts cannot be calculated.

    Impact of violations and ways to correct them. When assumptions of complex symmetry or sphericity are not met, ANOVA can produce erroneous results. Before multidimensional procedures were sufficiently developed, several assumptions were proposed to compensate for violations of these assumptions. (see, for example, Greenhouse & Geisser, 1959 and Huynh & Feldt, 1970). These methods are still widely used today (which is why they are presented in the module ANOVA).

    A multivariate ANOVA approach to repeated measures. In general, the problems of complex symmetry and sphericity relate to the fact that the sets of contrasts included in the study of the effects of repeated measures factors (with more than 2 levels) are not independent of each other. However, they do not need to be independent if using multidimensional a criterion for simultaneously checking the statistical significance of two or more contrasts of the repeated measures factor. This is the reason that the methods of multivariate analysis of variance have become more often used to test the significance of factors of univariate repeated measurements with more than 2 levels. This approach is widespread, since it generally does not require the assumption of complex symmetry and the assumption of sphericity.

    Cases in which the multivariate ANOVA approach cannot be used. There are examples (designs) when the multivariate analysis of variance approach cannot be applied. These are usually cases where there are a small number of subjects in the design and many levels in the repeated measures factor. Then there may be too few observations to carry out multivariate analysis. For example, if there are 12 subjects, p = 4 repeated measurements factor, and each factor has k = 3 levels. Then the interaction of 4 factors will "spend" (k-1) P = 2 4 = 16 degrees of freedom. However, there are only 12 subjects, hence the multivariate test cannot be performed in this example. Module ANOVA will independently detect these observations and calculate only one-dimensional criteria.

    Differences in univariate and multivariate results. If a study includes a large number of repeated measures, there may be cases where the univariate approach of ANOVA to repeated measures gives results that are very different from those obtained with the multivariate approach. This means that the differences between the levels of the respective repeated measurements are correlated across subjects. Sometimes this fact is of some independent interest.

    Multivariate ANOVA and Structural Equation Modeling

    In recent years, structural equation modeling has become popular as an alternative to multivariate variance analysis (see, for example, Bagozzi and Yi, 1989; Bagozzi, Yi, and Singh, 1991; Cole, Maxwell, Arvey, and Salas, 1993). This approach allows you to test hypotheses not only about the means in different groups, but also about the correlation matrices of the dependent variables. For example, you can relax the assumptions about homogeneity of variance and covariance and explicitly include variances and covariance errors in the model for each group. Module STATISTICAStructural Equation Modeling (SEPATH) (see Volume III) allows such an analysis.

    Analysis of variance - analysis of the variability of the effective trait under the influence of any controlled variable factors. (In foreign literature it is called ANOVA - "Analisis of Variance").

    The effective trait is also called a dependent trait, and the influencing factors are called independent traits.

    Limitation of the method: independent characteristics can be measured on a nominal, ordinal or metric scale, dependent - only on a metric scale. To carry out the analysis of variance, several gradations of factor signs are distinguished, and all elements of the sample are grouped in accordance with these gradations.

    Formulation of hypotheses in analysis of variance.

    Null hypothesis: "The average values ​​of the effective trait in all conditions of the factor (or gradations of the factor) are the same."

    Alternative hypothesis: "The average values ​​of the effective trait are different in different conditions of the factor."

    ANOVA can be divided into several categories depending on:

    on the number of considered independent factors;

    on the number of effective variables affected by factors;

    on the nature, nature of obtaining and the presence of the relationship of the compared samples of values.

    In the presence of one factor, the influence of which is being investigated, the analysis of variance is called one-way, and falls into two types:

    - Analysis of unrelated (that is, different) samples ... For example, one group of respondents solves a problem in a quiet environment, the second in a noisy room. (In this case, by the way, the null hypothesis would sound like this: “the average time for solving problems of this type will be the same in silence and in a noisy room,” that is, it does not depend on the noise factor.)

    - Analysis of related samples , that is, two measurements carried out on the same group of respondents in different conditions. The same example: the first time the problem was solved in silence, the second - a similar problem - in conditions of noise interference. (In practice, such experiments should be approached with caution, since an unaccounted factor "learnability" may come into play, the influence of which the researcher risks attributing to a change in conditions, namely, noise.)

    If the simultaneous influence of two or more factors is investigated, we are dealing with multivariate analysis of variance, which can also be subdivided by sample type.

    If several variables are affected by factors, then we are talking about multivariate analysis ... Multivariate analysis of variance is preferable to univariate analysis only when dependent variables are not independent of each other and correlate with each other.

    In general, the task of analysis of variance is to single out three particular variations from the general variability of a feature:

      variability due to the action of each of the investigated independent variables (factors).

      variability due to the interaction of the studied independent variables.

      random variability due to all unaccounted for circumstances.

    To assess the variability due to the action of the studied variables and their interaction, the ratio of the corresponding indicator of variability and random variability is calculated. The indicator of this ratio is F - Fisher's criterion.

    The more the variability of a feature is due to the action of influencing factors or their interaction, the higher the empirical values ​​of the criterion .

    To the formula for calculating the criterion variance estimates are included, and, therefore, this method belongs to the category of parametric.

    The Kruskal-Wallace test is a nonparametric analogue of one-way analysis of variance for independent samples. It is similar to the Mann-Whitney test for two independent samples, except that it sums the ranks for each of groups.

    In addition, the median test can be applied in the analysis of variance. When using it, for each group, the number of observations that exceed the median calculated for all groups and the number of observations that are less than the median are determined, after which a two-dimensional contingency table is built.

    The Friedman test is a nonparametric generalization of the paired t-test for repeated measures samples when the number of compared variables is more than two.

    Unlike correlation analysis, in analysis of variance, the researcher proceeds from the assumption that some variables act as influencing (called factors or independent variables), while others (effective indicators or dependent variables) are influenced by these factors. While this assumption underlies mathematical calculation procedures, it nevertheless requires caution in drawing conclusions about cause and effect.

    For example, if we put forward a hypothesis about the dependence of the success of an official's work on factor H (social courage according to Cattell), then the opposite is possible: the social courage of the respondent may just arise (increase) due to the success of his work - this is on the one hand. On the other hand, should you be aware of how the “success” was measured? If it was based not on objective characteristics (now fashionable “sales volumes”, etc.), but on expert assessments of colleagues, then there is a possibility that “success” can be replaced by behavioral or personal characteristics (volitional, communicative, external manifestations of aggressiveness etc.).

    ANOVA

    Course work in the discipline: "System analysis"

    Artist student gr. 99 ISE-2 Zhbanov V.V.

    Orenburg State University

    Faculty of Information Technology

    Department of Applied Informatics

    Orenburg-2003

    Introduction

    Purpose of work: to get acquainted with such a statistical method as analysis of variance.

    Analysis of variance (from the Latin Dispersio - dispersion) is a statistical method that allows you to analyze the influence of various factors on the variable under study. The method was developed by the biologist R. Fischer in 1925 and was used initially to evaluate experiments in crop production. Later, the general scientific significance of the analysis of variance for experiments in psychology, pedagogy, medicine, etc.

    The purpose of analysis of variance is to test the significance of the difference between means by comparing variances. The variance of the measured trait is decomposed into independent terms, each of which characterizes the influence of a particular factor or their interaction. Subsequent comparison of such terms makes it possible to assess the significance of each studied factor, as well as their combination / 1 /.

    If the null hypothesis is true (about the equality of means in several groups of observations selected from the general population), the estimate of the variance associated with within-group variability should be close to the estimate of the between-group variance.

    When conducting market research, the question of comparability of results often arises. For example, when conducting surveys about the consumption of a product in different regions of the country, it is necessary to draw conclusions about how much the survey data differ or do not differ from each other. It makes no sense to compare individual indicators, and therefore the procedure for comparison and subsequent assessment is carried out according to some averaged values ​​and deviations from this averaged estimate. The variation of the trait is studied. The variance can be taken as a measure of variation. The variance σ 2 is a measure of variation, defined as the average of the squared deviations of a feature.

    In practice, problems of a more general nature often arise - the problem of checking the significance of differences in the means of samples of several populations. For example, it is required to assess the impact of various raw materials on the quality of products, to solve the problem of the effect of the amount of fertilizers on the yield of agricultural products.

    Sometimes analysis of variance is used to establish the homogeneity of several populations (the variances of these populations are the same by assumption; if analysis of variance shows that the mathematical expectations are the same, then in this sense the populations are homogeneous). Homogeneous aggregates can be combined into one and thereby obtain more complete information about it, therefore, more reliable conclusions / 2 /.

    1 Analysis of variance

    1.1 Basic concepts of ANOVA

    In the process of observing the object under study, the qualitative factors change arbitrarily or in a given way. The specific implementation of the factor (for example, a certain temperature regime, the chosen equipment or material) is called the factor level or the processing method. The ANOVA model with fixed levels of factors is called model I, and the model with random factors is called model II. By varying the factor, you can investigate its influence on the magnitude of the response. Currently, the general theory of analysis of variance has been developed for models I.

    Depending on the number of factors that determine the variation of the effective trait, analysis of variance is subdivided into univariate and multivariate.

    The main schemes for organizing the source data with two or more factors are:

    Cross-classification, typical for models I, in which each level of one factor is combined when planning an experiment with each gradation of another factor;

    Hierarchical (nested) classification, characteristic of model II, in which each random, randomly chosen value of one factor corresponds to its own subset of the values ​​of the second factor.

    If the dependence of the response on qualitative and quantitative factors is simultaneously investigated, i.e. factors of mixed nature, then the analysis of covariance is used / 3 /.

    Thus, these models differ from each other in the method of choosing the factor levels, which, obviously, primarily affects the possibility of generalizing the obtained experimental results. For analysis of variance in one-way experiments, the difference between these two models is not so significant, but in multivariate analysis of variance it can turn out to be very important.

    When conducting analysis of variance, the following statistical assumptions must be fulfilled: regardless of the level of the factor, the response values ​​have a normal (Gaussian) distribution law and the same variance. This equality of variances is called homogeneity. Thus, a change in the processing method affects only the position of the response random variable, which is characterized by the mean or median. Therefore, all observations of the response belong to the shift family of normal distributions.

    The ANOVA technique is said to be "robust". The term used by statisticians means that these assumptions can be violated to some extent, but nevertheless, the technique can be used.

    When the law of distribution of the response values ​​is unknown, nonparametric (most often rank) methods of analysis are used.

    Analysis of variance is based on dividing variance into parts or components. The variation caused by the influence of the factor underlying the grouping is characterized by the intergroup variance σ 2. It is a measure of the variation of the quotient means across groups

    around the general average and is determined by the formula:,

    where k is the number of groups;

    n j is the number of units in the j-th group;

    - private average for the j-th group; - the total average for the aggregate of units.

    The variation due to the influence of other factors is characterized in each group by the intragroup variance σ j 2.

    .

    Between the total variance σ 0 2, the intra-group variance σ 2 and the inter-group variance

    1.2 One-way ANOVA

    The one-factor dispersion model has the form:

    x ij = μ + F j + ε ij, (1)

    where x ij is the value of the variable under study, obtained at the i-th level of the factor (i = 1,2, ..., t) with the j-th serial number (j = 1,2, ..., n);

    F i - the effect due to the influence of the i-th level of the factor;

    ε ij is a random component or disturbance caused by the influence of uncontrollable factors, i.e. variation of the change within a particular level.

    The main prerequisites for analysis of variance:

    The mathematical expectation of the disturbance ε ij is equal to zero for any i, i.e.

    M (ε ij) = 0; (2)

    The perturbations ε ij are mutually independent;

    The variance of the variable x ij (or the perturbation ε ij) is constant for

    any i, j, i.e.

    D (ε ij) = σ 2; (3)

    The variable x ij (or the perturbation ε ij) has the normal law

    distribution N (0; σ 2).

    The influence of factor levels can be either fixed or systematic (model I) or random (model II).

    Suppose, for example, it is necessary to find out whether there are significant differences between batches of products in terms of some quality indicator, i.e. check the influence on the quality of one factor - a batch of products. If all batches of raw materials are included in the study, then the influence of the level of such a factor is systematic (model I), and the conclusions obtained are applicable only to those individual batches that were involved in the study. If we include only a randomly selected part of the parties, then the influence of the factor is random (model II). In multifactorial complexes, a mixed model III is possible, in which some factors have random levels, while others are fixed.