To come in
Speech therapy portal
  • Weapon sounds cs go for 1
  • Festival "times and eras"
  • Festival of avant-garde music Fields and "Masters of Music"
  • Vdnkh: description, history, excursions, exact address Moscow Butterfly House
  • After the overhaul, the Kurakina Dacha park was opened with the excavated Kozlov stream
  • Library of Foreign Literature named after
  • According to the number of significant factors, regression is subdivided into. Basics of data analysis. Defining Regression

    According to the number of significant factors, regression is subdivided into.  Basics of data analysis.  Defining Regression

    1. For the first time the term "regression" was introduced by the founder of biometrics F. Galton (XIX century), whose ideas were developed by his follower K. Pearson.

    Regression analysis - a method of statistical data processing that allows you to measure the relationship between one or more causes (factor indicators) and a consequence (effective indicator).

    Sign- this is the main distinguishing feature, feature of the studied phenomenon or process.

    Effective feature - investigated indicator.

    Factor sign- an indicator that affects the value of the effective attribute.

    The purpose of the regression analysis is to assess the functional dependence of the average value of the effective trait ( at) from factorial ( x 1, x 2, ..., x n), expressed as regression equations

    at= f(x 1, x 2, ..., x n). (6.1)

    There are two types of regression: paired and multiple.

    Paired (simple) regression- an equation of the form:

    at= f(x). (6.2)

    The resultant feature in paired regression is considered as a function of one argument, i.e. one factorial attribute.

    Regression analysis includes the following steps:

    · Definition of the type of function;

    · Determination of regression coefficients;

    · Calculation of the theoretical values ​​of the effective indicator;

    · Checking the statistical significance of the regression coefficients;

    · Checking the statistical significance of the regression equation.

    Multiple regression- an equation of the form:

    at= f(x 1, x 2, ..., x n). (6.3)

    The effective feature is considered as a function of several arguments, i.e. many factor signs.

    2. In order to correctly determine the type of function, it is necessary to find the direction of communication on the basis of theoretical data.

    According to the direction of the relationship, the regression is divided into:

    · direct regression, arising under the condition that with an increase or decrease in the independent quantity " NS" the value of the dependent quantity " y " also increase or decrease accordingly;

    · reverse regression, arising under the condition that with an increase or decrease in the independent quantity "NS" dependent quantity " y " decreases or increases accordingly.

    To characterize relationships, the following types of pair regression equations are used:

    · y = a + bxlinear;

    · y = e ax + b - exponential;

    · y = a + b / x - hyperbolic;

    · y = a + b 1 x + b 2 x 2 - parabolic;

    · y = ab x - exponential and etc.

    where a, b 1, b 2- coefficients (parameters) of the equation; at- effective sign; NS- factorial sign.

    3. The construction of the regression equation is reduced to the assessment of its coefficients (parameters), for this use least square method(OLS).

    The least squares method allows one to obtain such parameter estimates for which the sum of the squares of the deviations of the actual values ​​of the effective indicator “ at"From theoretical" y x»Is minimal, that is

    Regression equation parameters y = a + bx least squares are estimated using the formulas:

    where a - free coefficient, b- the regression coefficient, shows how much the resultant sign “ y"When changing the factor attribute" x»Per unit of measurement.

    4. To assess the statistical significance of the regression coefficients, the Student's t-test is used.

    Scheme for testing the significance of regression coefficients:

    1) H 0: a=0, b= 0 - regression coefficients differ insignificantly from zero.

    H 1: a ≠ 0, b ≠ 0 - regression coefficients differ significantly from zero.

    2) R= 0.05 - significance level.

    where m b,m a- random errors:

    ; . (6.7)

    4) t tab(R; f),

    where f=n-k- 1 - the number of degrees of freedom (tabular value), n- the number of observations, k NS".

    5) If, then it is rejected, i.e. the coefficient is significant.

    If, then it is accepted, i.e. the coefficient is insignificant.

    5. To check the correctness of the constructed regression equation, Fisher's criterion is applied.

    Scheme for testing the significance of the regression equation:

    1) H 0: the regression equation is insignificant.

    H 1: the regression equation is significant.

    2) R= 0.05 - significance level.

    3) , (6.8)

    where is the number of observations; k is the number of parameters in the equation for the variables " NS"; at- the actual value of the effective trait; y x- the theoretical value of the effective trait; is the paired correlation coefficient.

    4) F tab(R; f 1; f 2),

    where f 1 = k, f 2 = n-k-1- the number of degrees of freedom (tabular values).

    5) If F calc> F tab, then the regression equation is selected correctly and can be applied in practice.

    If F calc , then the regression equation is chosen incorrectly.

    6. The main indicator reflecting the measure of the quality of the regression analysis is determination coefficient (R 2).

    Determination coefficient shows how much of the dependent variable " at»Taken into account in the analysis and caused by the influence of the factors included in the analysis.

    Determination coefficient (R 2) takes values ​​in between. The regression equation is qualitative if R 2 ≥0,8.

    The determination coefficient is equal to the square of the correlation coefficient, i.e.

    Example 6.1. Using the following data, construct and analyze the regression equation:

    Solution.

    1) Calculate the correlation coefficient:. The relationship between the signs is direct and moderate.

    2) Build a pairwise linear regression equation.

    2.1) Make a calculation table.

    NS at Hu x 2 y x (y-y-x) 2
    55,89 47,54 65,70
    45,07 15,42 222,83
    54,85 34,19 8,11
    51,36 5,55 11,27
    42,28 45,16 13,84
    47,69 1,71 44,77
    45,86 9,87 192,05
    Sum 159,45 558,55
    The average 77519,6 22,78 79,79 2990,6

    ,

    Pairwise linear regression equation: y x = 25.17 + 0.087x.

    3) Find the theoretical values ​​" y x"By substituting the actual values ​​in the regression equation" NS».

    4) Build graphs of actual " y " and theoretical values ​​" y x»Effective sign (Figure 6.1): r xy = 0.47) and a small number of observations.

    7) Calculate the coefficient of determination: R 2= (0.47) 2 = 0.22. The constructed equation is of poor quality.

    Because calculations during the regression analysis are quite voluminous, it is recommended to use special programs ("Statistica 10", SPSS, etc.).

    Figure 6.2 shows a table with the results of regression analysis carried out using the Statistica 10 program.

    Figure 6.2. Results of regression analysis carried out using the program "Statistica 10"

    5. Literature:

    1. Gmurman V.E. Probability theory and mathematical statistics: Textbook. manual for universities / V.E. Gmurman. - M .: Higher school, 2003 .-- 479 p.

    2. Koichubekov B.K. Biostatistics: Textbook. - Almaty: Evero, 2014 .-- 154 p.

    3. Lobotskaya N.L. Higher mathematics. / N.L. Lobotskaya, Yu.V. Morozov, A.A. Dunaev. - Minsk: Higher school, 1987 .-- 319 p.

    4. Medic V.A., Tokmachev M.S., Fishman B.B. Statistics in Medicine and Biology: A Manual. In 2 volumes / Ed. Yu.M. Komarov. T. 1. Theoretical statistics. - M .: Medicine, 2000 .-- 412 p.

    5. Application of methods of statistical analysis for the study of public health and health care: textbook / ed. V.Z. Kucherenko - 4th ed., Rev. and add. - M .: GEOTAR - Media, 2011 .-- 256 p.

    Regression analysis is a method for establishing the analytical expression of the stochastic relationship between the studied features. The regression equation shows how the average changes at when changing any of x i , and has the form:

    where y - dependent variable (it is always one);

    NS i - independent variables (factors) (there may be several).

    If there is only one explanatory variable, this is a simple regression analysis. If there are several of them ( NS 2), then such an analysis is called multivariate.

    In the course of regression analysis, two main tasks are solved:

      building a regression equation, i.e. finding the type of relationship between the final indicator and independent factors x 1 , x 2 , …, x n .

      an estimate of the significance of the resulting equation, i.e. determining to what extent the selected factor attributes explain the variation of the attribute at.

    Regression analysis is used mainly for planning, as well as for the development of a regulatory framework.

    Unlike correlation analysis, which only answers the question of whether there is a relationship between the analyzed features, regression analysis also gives its formalized expression. In addition, if correlation analysis studies any interconnection of factors, then regression analysis studies one-sided dependence, i.e. a relationship showing how a change in factor signs affects the effective sign.

    Regression analysis is one of the most developed methods of mathematical statistics. Strictly speaking, in order to implement regression analysis, it is necessary to fulfill a number of special requirements (in particular, x l , x 2 , ..., x n ;y must be independent, normally distributed random variables with constant variances). In real life, strict compliance with the requirements of regression and correlation analysis is very rare, but both of these methods are quite common in economic research. Dependencies in the economy can be not only direct, but also inverse and nonlinear. A regression model can be built in the presence of any dependence, however, in multivariate analysis, only linear models of the form are used:

    The construction of the regression equation is carried out, as a rule, by the least squares method, the essence of which is to minimize the sum of squares of deviations of the actual values ​​of the resultant attribute from its calculated values, i.e.:

    where T - number of observations;

    j =a + b 1 x 1 j + b 2 x 2 j + ... + b n NS n j - the calculated value of the resultant factor.

    It is recommended to determine the regression coefficients using analytical packages for a personal computer or a special financial calculator. In the simplest case, the regression coefficients of a one-way linear regression equation of the form y = a + bx can be found by the formulas:

    Cluster Analysis

    Cluster analysis is one of the multivariate analysis methods designed for grouping (clustering) a population, the elements of which are characterized by many features. The values ​​of each of the attributes serve as the coordinates of each unit of the studied population in the multidimensional space of attributes. Each observation, characterized by the values ​​of several indicators, can be represented as a point in the space of these indicators, the values ​​of which are considered as coordinates in a multidimensional space. Distance between points R and q with k coordinates is defined as:

    The main criterion for clustering is that the differences between clusters should be more significant than between observations assigned to the same cluster, i.e. in a multidimensional space, the inequality must be observed:

    where r 1, 2 - distance between clusters 1 and 2.

    Just like the regression analysis procedures, the clustering procedure is quite laborious, it is advisable to perform it on a computer.

    What is regression?

    Consider two continuous variables x = (x 1, x 2, .., x n), y = (y 1, y 2, ..., y n).

    Let's place the points on a 2D scatter plot and say we have linear relationship if the data is fitted with a straight line.

    If we believe that y depends on x, and changes in y are caused precisely by changes in x, we can determine the regression line (regression y on x), which best describes the straightforward relationship between these two variables.

    The statistical use of the word "regression" comes from a phenomenon known as regression to the mean, attributed to Sir Francis Galton (1889).

    He showed that although tall fathers tend to have tall sons, the average height of sons is shorter than that of their tall fathers. The average height of sons "regressed" and "reversed" to the average height of all fathers in the population. Thus, on average, tall fathers have lower (but still tall) sons, and lower fathers have higher (but still rather short) sons.

    Regression line

    A mathematical equation that estimates a simple (paired) linear regression line:

    x called the independent variable or predictor.

    Y- dependent variable or response variable. This is the value we expect for y(on average) if we know the value x, i.e. this "predicted value y»

    • a- free member (intersection) of the line of evaluation; this value Y, when x = 0(Fig. 1).
    • b- the slope or gradient of the evaluated line; it represents the amount by which Y increases on average if we increase x by one unit.
    • a and b are called the regression coefficients of the estimated line, although this term is often used only for b.

    Paired linear regression can be extended to include more than one independent variable; in this case it is known as multiple regression.

    Fig. 1. Linear regression line showing the intersection of a and the slope of b (the amount of Y increases as x increases by one unit)

    Least square method

    We perform regression analysis using a sample of observations where a and b- sample estimates of the true (general) parameters, α and β, which determine the line of linear regression in the population (general population).

    The simplest method for determining the coefficients a and b is an least square method(OLS).

    The fit is estimated by considering the residuals (the vertical distance of each point from the line, for example, residual = observed y- predicted y, Rice. 2).

    The best fit line is chosen so that the sum of the squares of the residuals is minimal.

    Rice. 2. Linear regression line with residuals depicted (vertical dashed lines) for each point.

    Linear Regression Assumptions

    So, for each observed value, the residual is equal to the difference and the corresponding predicted value. Each residual can be positive or negative.

    You can use residuals to test the following assumptions underlying linear regression:

    • The balances are normally distributed with a zero mean;

    If the assumptions of linearity, normality and / or constant variance are questionable, we can transform or and calculate a new regression line for which these assumptions are satisfied (for example, use a log transformation, etc.).

    Abnormal values ​​(outliers) and influence points

    An "influential" observation, if omitted, changes one or more estimates of model parameters (ie, slope or intercept).

    An outlier (an observation that contradicts most of the values ​​in a dataset) can be an “influential” observation and can be well detected visually when viewed from a 2D scatter plot or a residual plot.

    Both for outliers and for "influential" observations (points), models are used, both with and without them, paying attention to the change in the estimate (regression coefficients).

    When performing analysis, do not automatically discard outliers or influence points, as simple ignoring can affect the results obtained. Always investigate and analyze the causes of these outliers.

    Linear regression hypothesis

    When constructing a linear regression, the null hypothesis is tested that the general slope of the regression line β is equal to zero.

    If the slope of the line is zero, there is no linear relationship between and: the change does not affect

    To test the null hypothesis that the true slope is zero, you can use the following algorithm:

    Calculate a test statistic equal to the ratio that obeys a distribution with degrees of freedom, where the standard error of the coefficient is


    ,

    - estimation of the variance of the residuals.

    Usually, if the level of significance achieved is the null hypothesis is rejected.


    where is the percentage point of the distribution with degrees of freedom that gives the probability of a two-sided test

    This is the interval that contains the general slope with a 95% probability.

    For large samples, let's say we can approximate with a value of 1.96 (that is, the criterion statistics will tend to a normal distribution)

    Evaluation of the quality of linear regression: coefficient of determination R 2

    Because of the linear relationship, and we expect it to change as it changes , and we call this variation that is caused or explained by regression. The residual variation should be as small as possible.

    If this is the case, then most of the variation will be due to regression, and the points will lie close to the regression line, i.e. the line matches the data well.

    The proportion of the total variance that is explained by the regression is called coefficient of determination, usually expressed in terms of percentage and denote R 2(in paired linear regression, this is the value r 2, the square of the correlation coefficient), allows you to subjectively assess the quality of the regression equation.

    The difference is the percentage of variance that cannot be explained by the regression.

    There is no formal test to evaluate, we have to rely on subjective judgment to determine the quality of the regression line fit.

    Applying a regression line to forecast

    You can use a regression line to predict a value from a value within the observed range (never extrapolate outside these limits).

    We predict the mean for observables that have a particular value by plugging that value into the regression line equation.

    So, if we predict how We use this predicted value and its standard error to estimate the confidence interval for the true mean in the population.

    Repeating this procedure for different values ​​allows you to build confidence limits for this line. This is the band or area that contains the true line, for example, with a 95% confidence level.

    Simple regression designs

    Simple regression designs contain one continuous predictor. If there are 3 cases with predictor values ​​P, for example, 7, 4, and 9, and the design includes a first-order effect P, then the design matrix X will have the form

    and the regression equation using P for X1 looks like

    Y = b0 + b1 P

    If a simple regression design contains a higher-order effect on P, such as a quadratic effect, then the values ​​in column X1 in the design matrix will be raised to the second power:

    and the equation takes the form

    Y = b0 + b1 P2

    Sigma-restricted and overparameterized coding methods do not apply to simple regression designs and other designs containing only continuous predictors (since categorical predictors simply do not exist). Regardless of the coding method chosen, the values ​​of the continuous variables are increased to the appropriate degree and used as the values ​​for the X variables. In this case, no recoding is performed. In addition, when describing regression designs, you can omit consideration of the design matrix X, and work only with the regression equation.

    Example: Simple Regression Analysis

    This example uses the data presented in the table:

    Rice. 3. Table of initial data.

    Data compiled from a comparison of the 1960 and 1970 census in a randomly selected 30 districts. District names are represented as observation names. Information regarding each variable is presented below:

    Rice. 4. Table of variable specifications.

    Research task

    For this example, the correlation between the poverty rate and the degree will be analyzed, which predicts the percentage of families that are below the poverty line. Therefore, we will treat variable 3 (Pt_Poor) as a dependent variable.

    It can be hypothesized that population change and the percentage of families below the poverty line are related. It seems reasonable to expect that poverty leads to population outflow, hence there will be a negative correlation between the percentage of people below the poverty line and population change. Therefore, we will treat variable 1 (Pop_Chng) as a predictor variable.

    Viewing Results

    Regression coefficients

    Rice. 5. Regression coefficients Pt_Poor on Pop_Chng.

    At the intersection of the Pop_Chng row and the Param. the non-standardized coefficient for the Pt_Poor regression on Pop_Chng is -0.40374. This means that for every unit decrease in population, there is a 40374 increase in the poverty rate. The upper and lower (default) 95% confidence limits for this non-standardized coefficient do not include zero, so the regression coefficient is significant at the p level<.05 . Обратите внимание на не стандартизованный коэффициент, который также является коэффициентом корреляции Пирсона для простых регрессионных планов, равен -.65, который означает, что для каждого уменьшения стандартного отклонения численности населения происходит увеличение стандартного отклонения уровня бедности на.65.

    Distribution of variables

    Correlation coefficients can become significantly overestimated or underestimated if there are large outliers in the data. Let us examine the distribution of the dependent variable Pt_Poor by district. To do this, let's build a histogram of the Pt_Poor variable.

    Rice. 6. Histogram of the Pt_Poor variable.

    As you can see, the distribution of this variable differs markedly from the normal distribution. However, although even the two counties (the two right-hand columns) have a higher percentage of households below the poverty line than expected from the normal distribution, they appear to be "within the range."

    Rice. 7. Histogram of the Pt_Poor variable.

    This judgment is somewhat subjective. As a rule of thumb, outliers should be accounted for if the observation (or observations) do not fall within the interval (mean ± 3 times the standard deviation). In this case, it is worth repeating the analysis with and without outliers to ensure that they do not have a significant effect on the correlation between members of the population.

    Scatter plot

    If one of the hypotheses is a priori about the relationship between the given variables, then it is useful to check it on the graph of the corresponding scatterplot.

    Rice. 8. Scatter diagram.

    The scatter plot shows a clear negative correlation (-.65) between the two variables. It also shows the 95% confidence interval for the regression line, i.e., with 95% probability, the regression line passes between the two dashed curves.

    Significance criteria

    Rice. 9. Table containing criteria for significance.

    The criterion for the Pop_Chng regression coefficient confirms that Pop_Chng is strongly related to Pt_Poor, p<.001 .

    Outcome

    This example showed how to analyze a simple regression design. An interpretation of non-standardized and standardized regression coefficients was also presented. The importance of studying the distribution of responses of the dependent variable is discussed, and a technique for determining the direction and strength of the relationship between the predictor and the dependent variable is demonstrated.

    In statistical modeling, regression analysis is a study used to assess the relationship between variables. This mathematical technique includes many other techniques for modeling and analyzing multiple variables, where the focus is on the relationship between the dependent variable and one or more independent variables. More specifically, regression analysis helps you understand how the typical value of the dependent variable changes if one of the explanatory variables changes while the other explanatory variables remain fixed.

    In all cases, the target score is a function of the explanatory variables and is called a regression function. In regression analysis, it is also of interest to characterize the change in the dependent variable as a regression function, which can be described using a probability distribution.

    Regression Analysis Tasks

    This statistical research method is widely used for forecasting, where its use has a significant advantage, but sometimes it can lead to illusion or false attitudes, therefore it is recommended to use it carefully in this issue, since, for example, correlation does not mean causation.

    A large number of methods have been developed for performing regression analysis, such as linear and ordinary least squares regression, which are parametric. Their essence is that the regression function is defined in terms of a finite number of unknown parameters that are estimated from the data. Nonparametric regression allows its functions to lie in a certain set of functions, which can be infinite-dimensional.

    As a statistical research method, regression analysis in practice depends on the form of the data generation process and how it relates to the regression approach. Since the true form of a data process is usually an unknown number, regression analysis of data often depends to some extent on assumptions about the process. These assumptions are sometimes testable if there is sufficient data available. Regression models are often useful even when the assumptions are moderately broken, although they may not work as efficiently as possible.

    In a narrower sense, regression can refer specifically to the estimation of continuous response variables, as opposed to discrete response variables used in classification. The case of a continuous output variable is also called metric regression to distinguish it from related problems.

    History

    The earliest form of regression is the well-known least squares method. It was published by Legendre in 1805 and Gauss in 1809. Legendre and Gauss applied the method to the problem of determining from astronomical observations the orbits of bodies around the Sun (mainly comets, but later also newly discovered minor planets). Gauss published a further development of the theory of least squares in 1821, including a variant of the Gauss-Markov theorem.

    The term regression was coined by Francis Galton in the 19th century to describe a biological phenomenon. The bottom line was that the growth of offspring from the growth of the ancestors, as a rule, regresses down to the normal mean. For Galton, regression had only this biological meaning, but later his work was continued by Udney Yoley and Karl Pearson and brought into a more general statistical context. In the work of Yule and Pearson, the joint distribution of the response and explanatory variables is considered to be Gaussian. This assumption was rejected by Fischer in 1922 and 1925. Fisher suggested that the conditional distribution of the response variable is Gaussian, but the joint distribution should not be. In this regard, Fisher's assumption is closer to the formulation of Gauss in 1821. Until 1970, it sometimes took up to 24 hours to get the result of the regression analysis.

    Regression analysis methods continue to be an area of ​​active research. In recent decades, new methods have been developed for robust regression; regression with correlated responses; regression methods accommodating different types of missing data; nonparametric regression; Bayesian regression methods; regressions in which predictor variables are measured with error; regression with more predictors than observations; and causal inferences with regression.

    Regression models

    Regression analysis models include the following variables:

    • Unknown parameters, denoted beta, which can be a scalar or vector.
    • Independent variables, X.
    • Dependent variables, Y.

    In various fields of science where regression analysis is applied, different terms are used instead of dependent and independent variables, but in all cases the regression model assigns Y to a function of X and β.

    The approximation is usually written in the form E (Y | X) = F (X, β). To carry out the regression analysis, the form of the function f must be determined. Less commonly, it is based on knowledge of the relationship between Y and X that does not rely on data. If such knowledge is not available, then a flexible or convenient F form is chosen.

    Dependent variable Y

    Suppose now that the vector of unknown parameters β has length k. To perform regression analysis, the user must provide information about the dependent variable Y:

    • If N data points of the form (Y, X) are observed, where N< k, большинство классических подходов к регрессионному анализу не могут быть выполнены, так как система уравнений, определяющих модель регрессии в качестве недоопределенной, не имеет достаточного количества данных, чтобы восстановить β.
    • If exactly N = K are observed, and the function F is linear, then the equation Y = F (X, β) can be solved exactly, not approximately. This boils down to solving a set of N-equations with N-unknowns (elements β), which has a unique solution as long as X is linearly independent. If F is nonlinear, the solution may not exist, or many solutions may exist.
    • The most common situation is where N> points to the data are observed. In this case, there is enough information in the data to estimate the unique value for β that best fits the data, and a regression model where application to the data can be viewed as an overdetermined system in β.

    In the latter case, regression analysis provides tools for:

    • Search for a solution for unknown parameters β, which will, for example, minimize the distance between the measured and predicted value of Y.
    • Under certain statistical assumptions, regression analysis uses excess information to provide statistical information about unknown β parameters and predicted values ​​of the dependent variable Y.

    Required number of independent measurements

    Consider a regression model that has three unknown parameters: β 0, β 1, and β 2. Suppose the experimenter makes 10 measurements on the same value of the independent variable for vector X. In this case, the regression analysis does not yield a unique set of values. The best thing to do is to estimate the mean and standard deviation of the dependent variable Y. Similarly, by measuring two different X-values, you can get enough data to regress with two unknowns, but not three or more unknowns.

    If the experimenter's measurements were made at three different values ​​of the independent variable of the vector X, then the regression analysis will provide a unique set of estimates for the three unknown parameters in β.

    In the case of general linear regression, the above statement is equivalent to the requirement that the matrix X T X is invertible.

    Statistical assumptions

    When the number of measurements N is greater than the number of unknown parameters k and measurement errors ε i, then, as a rule, an excess of information contained in the measurements is then propagated and used for statistical predictions regarding unknown parameters. This excess of information is called the degree of freedom of the regression.

    Underlying assumptions

    Classic assumptions for regression analysis include:

    • The sample is a representative of inference prediction.
    • The error is a random variable with a mean of zero, which is conditional on the explanatory variables.
    • The explanatory variables are measured without error.
    • As independent variables (predictors), they are linearly independent, that is, it is not possible to express any predictor as a linear combination of the others.
    • The errors are uncorrelated, that is, the error covariance matrix of the diagonals and each nonzero element is the variance of the error.
    • The variance of the error is constant according to observations (homoscedasticity). If not, then weighted least squares or other methods can be used.

    These sufficient conditions for the least squares estimate have the required properties, in particular, these assumptions mean that the parameter estimates will be objective, consistent and effective, especially when taken into account in the class of linear estimates. It is important to note that evidence rarely meets the conditions. That is, the method is used even if the assumptions are not correct. Variation from assumptions can sometimes be used as a measure of how useful the model is. Many of these assumptions can be relaxed in more advanced methods. Statistical analysis reports typically include analysis of tests against sample data and methodology for the usefulness of the model.

    In addition, variables in some cases refer to values ​​measured at point locations. There may be spatial trends and spatial autocorrelations in variables that violate statistical assumptions. Geographic Weighted Regression is the only technique that deals with this kind of data.

    In linear regression, the feature is that the dependent variable, which is Y i, is a linear combination of parameters. For example, simple linear regression uses one independent variable, x i, and two parameters, β 0 and β 1, to model n-points.

    In multiple linear regression, there are several independent variables or their functions.

    When randomly sampled from a population, its parameters provide a sample of a linear regression model.

    In this aspect, the least squares method is the most popular. It is used to obtain parameter estimates that minimize the sum of the squares of the residuals. This kind of minimization (which is typical of linear regression) of this function leads to a set of normal equations and a set of linear equations with parameters, which are solved to obtain parameter estimates.

    Assuming further that the population error is usually propagated, the researcher can use these estimates of standard errors to create confidence intervals and test hypotheses about its parameters.

    Nonlinear Regression Analysis

    An example where the function is not linear with respect to the parameters indicates that the sum of squares should be minimized using an iterative procedure. This introduces many complications that distinguish between linear and non-linear least squares. Consequently, the results of regression analysis when using a nonlinear method are sometimes unpredictable.

    Calculation of power and sample size

    There are generally no consistent methods here regarding the number of observations versus the number of explanatory variables in the model. The first rule was proposed by Dobra and Hardin and looks like N = t ^ n, where N is the sample size, n is the number of independent variables, and t is the number of observations needed to achieve the desired accuracy if the model had only one independent variable. For example, a researcher builds a linear regression model using a dataset that contains 1000 patients (N). If the researcher decides that five observations are needed to accurately determine the straight line (m), then the maximum number of independent variables that the model can support is 4.

    Other methods

    Although the parameters of a regression model are usually estimated using the least squares method, there are other methods that are used much less frequently. For example, these are the following methods:

    • Bayesian methods (for example, Bayesian linear regression method).
    • Percentage regression, used for situations where lowering percentage errors is considered more appropriate.
    • Smallest absolute deviations, which is more robust in the presence of outliers leading to quantile regression.
    • Nonparametric regression requiring a large number of observations and calculations.
    • Distance learning metric that is learned in search of a meaningful distance metric in a given input space.

    Software

    All major statistical software packages are performed using least squares regression analysis. Simple Linear Regression and Multiple Regression Analysis can be used in some spreadsheet applications as well as some calculators. Although many statistical software packages can perform various types of nonparametric and robust regression, these methods are less standardized; different software packages implement different methods. Specialized regression software has been developed for use in areas such as survey analysis and neuroimaging.

    Modern political science proceeds from the position of the relationship of all phenomena and processes in society. It is impossible to understand events and processes, predict and manage the phenomena of political life without studying the connections and dependencies that exist in the political sphere of the life of society. One of the most common objectives of policy research is to examine the relationship between certain observable variables. A whole class of statistical methods of analysis, united by the general name "regression analysis" (or, as it is also called, "correlation-regression analysis"), helps to solve this problem. However, if correlation analysis allows one to assess the strength of the relationship between two variables, then using regression analysis it is possible to determine the type of this relationship, to predict the dependence of the value of any variable on the value of another variable.

    First, let's remember what correlation is. Correlation is called the most important special case of statistical connection, consisting in the fact that equal values ​​of one variable correspond to different average values another. With a change in the value of the attribute x, the average value of the attribute y changes in a regular way, while in each individual case the value of the attribute at(with different probabilities) can take on many different values.

    The appearance in statistics of the term "correlation" (and political science involves the achievement of statistics, which, therefore, is a discipline related to political science), is associated with the name of the English biologist and statistician Francis Galton, who proposed in the 19th century. theoretical foundations of correlation and regression analysis. The term "correlation" was known in science before. In particular, in paleontology as early as the 18th century. it was applied by the French scientist Georges Cuvier. He introduced the so-called law of correlation, with the help of which, from the remains of animals found during excavations, it was possible to restore their appearance.

    There is a known story associated with the name of this scientist and his law of correlation. So, on the days of the university holiday, students, who decided to play a trick on the famous professor, pulled a goat skin with horns and hooves on one student. He climbed through the window of Cuvier's bedroom and shouted: "I'll eat you." The professor woke up, looked at the silhouette and replied: “If you have horns and hooves, then you are a herbivore and you cannot eat me. And for ignorance of the law of correlation, you will receive a two. " He turned over and fell asleep. It's a joke, but in this example we observe a special case of using multiple correlation-regression analysis. Here the professor, proceeding from the knowledge of the values ​​of two observed traits (the presence of horns and hooves), on the basis of the correlation law, deduced the average value of the third trait (the class to which this animal belongs - herbivore). In this case, we are not talking about the specific value of this variable (i.e., this animal could take different values ​​on the nominal scale - it could be a goat, a ram, or a bull ...).

    Now let's move on to the term "regression". Strictly speaking, it is not connected with the meaning of those statistical problems that are solved using this method. An explanation of the term can be given only on the basis of knowledge of the history of the development of methods for studying relationships between features. One of the first examples of research of this kind was the work of statisticians F. Galton and K. Pearson, who tried to find a pattern between the growth of fathers and their children according to two observed characteristics (where X- the growth of fathers and Y- growth of children). In the course of their research, they confirmed the initial hypothesis that, on average, tall fathers grow up, on average, tall children. The same principle applies to low fathers and children. However, if scientists had stopped at this, then their works would never have been mentioned in textbooks on statistics. The researchers found another pattern within the already mentioned confirmed hypothesis. They proved that very tall fathers are born on average tall children, but not very different in height from children, whose fathers, although above average, do not differ much from the average height. The same is true for fathers with very short stature (deviating from the average of the short group) - their children, on average, did not differ in height from their peers, whose fathers were simply short. They called the function describing this pattern the regression function. After this study, all equations describing similar functions and constructed in a similar way were called regression equations.

    Regression analysis is one of the methods of multivariate statistical analysis of data, combining a set of statistical techniques designed to study or model relationships between one dependent and several (or one) independent variables. The dependent variable, according to tradition accepted in statistics, is called the response and is denoted as V The independent variables are called predictors and are denoted as X. During the analysis, some variables will be weakly related to the response and will eventually be excluded from the analysis. The remaining variables associated with the dependent can also be referred to as factors.

    Regression analysis makes it possible to predict the values ​​of one or more variables depending on another variable (for example, the propensity for unconventional political behavior depending on the level of education) or several variables. It is calculated on the PC. To draw up a regression equation that allows you to measure the degree of dependence of the controlled feature on the factorial, it is necessary to involve professional mathematicians-programmers. Regression analysis can be invaluable in building predictive models for the development of a political situation, assessing the causes of social tension, and in carrying out theoretical experiments. Regression analysis is actively used to study the influence of a number of socio-demographic parameters on the electoral behavior of citizens: gender, age, profession, place of residence, nationality, level and nature of income.

    Regression analysis uses the concepts independent and dependent variables. An independent variable is a variable that explains or causes a change in another variable. A dependent variable is a variable whose value is explained by the influence of the first variable. For example, in the presidential elections in 2004, the determining factors, i.e. independent variables were such indicators as the stabilization of the material situation of the country's population, the level of popularity of candidates and the factor incumbency. The dependent variable in this case is the percentage of votes cast for candidates. Similarly, in the pair of variables “voter age” and “level of electoral activity”, the first is independent, the second is dependent.

    Regression analysis allows you to solve the following tasks:

    • 1) establish the very fact of the presence or absence of a statistically significant relationship between Ki X;
    • 2) build the best (in the statistical sense) estimates of the regression function;
    • 3) according to the set values X build a forecast for the unknown Have
    • 4) estimate the specific weight of the influence of each factor X on Have and, accordingly, exclude insignificant features from the model;
    • 5) by identifying causal relationships between the variables, partially control the values ​​of P by regulating the values ​​of the explanatory variables X.

    Regression analysis is associated with the need to select mutually independent variables that affect the value of the indicator under study, determine the form of the regression equation, estimate the parameters using statistical methods for processing primary sociological data. This type of analysis is based on the idea of ​​the shape, direction and tightness (density) of the relationship. Distinguish steam room and multiple regression depending on the number of investigated features. In practice, regression analysis is usually performed in conjunction with correlation analysis. Regression equation describes the numerical relationship between quantities, expressed as a tendency to increase or decrease in one variable with an increase or decrease in another. At the same time, they are angry. frosty and nonlinear regression. When describing political processes, both variants of regression are equally found.

    Scatter plot for the distribution of the interdependence of interest in political articles ( Y) and education of respondents (X) is a linear regression (Fig. 30).

    Rice. thirty.

    Scatter plot for the distribution of the level of electoral activity ( Y) and the respondent's age (A) (conditional example) is a non-linear regression (Fig. 31).


    Rice. 31.

    To describe the relationship of two features (A "and Y) in the paired regression model, a linear equation is used

    where a, is a random value of the error of the equation with variation of features, i.e. deviation of the equation from "linearity".

    To estimate the coefficients a and b use the least squares method, which assumes that the sum of the squares of the deviations of each point on the scatter plot from the regression line should be minimal. Odds a h b can be calculated using the system of equations:

    The least squares estimation method gives such estimates of the coefficients a and B, at which the straight line passes through a point with coordinates NS and y, those. the relation holds at = ax + b. The graphical representation of the regression equation is called theoretical regression line. With a linear relationship, the regression coefficient represents on the graph the tangent of the slope of the theoretical regression line to the abscissa axis. The sign at the coefficient shows the direction of the link. If it is greater than zero, then the connection is direct; if it is less, it is inverse.

    The example below from the study “Political Petersburg-2006” (Table 56) shows a linear relationship between citizens' perceptions of the degree of satisfaction with their lives in the present and expectations of changes in the quality of life in the future. The relationship is direct, linear (the standardized regression coefficient is 0.233, the significance level is 0.000). In this case, the regression coefficient is low, but it exceeds the lower bound of the statistically significant indicator (the lower bound of the square of the statistically significant indicator of the Pearson coefficient).

    Table 56

    Impact of the quality of life of citizens in the present on expectations

    (St. Petersburg, 2006)

    * Dependent variable: "How do you think your life will change in the next 2-3 years?"

    In political life, the value of the studied variable most often depends on several characteristics at the same time. For example, the level and nature of political activity are simultaneously influenced by the political regime of the state, political traditions, peculiarities of the political behavior of people in a given area and the respondent's social micro-group, his age, education, income level, political orientation, etc. In this case, it is necessary to use the equation multiple regression which looks like this:

    where the coefficient B.- partial regression coefficient. It shows the contribution of each independent variable to the determination of the values ​​of the independent (resulting) variable. If the partial regression coefficient is close to 0, then we can conclude that there is no direct relationship between the independent and dependent variables.

    The calculation of such a model can be performed on a PC using matrix algebra. Multiple regression allows you to reflect the multifactorial nature of social ties and to clarify the degree of influence of each factor individually and collectively on the resulting feature.

    Coefficient denoted B, is called the coefficient of linear regression and shows the strength of the relationship between the variation of the factor attribute X and variation of the effective trait Y This coefficient measures the strength of the bond in absolute units of measurement of the features. However, the tightness of the correlation of features can be expressed in fractions of the standard deviation of the effective feature (such a coefficient is called the correlation coefficient). Unlike the regression coefficient b the correlation coefficient does not depend on the accepted units of measurement of features, and therefore, it is comparable for any features. Usually, the connection is considered strong if /> 0.7, medium tightness - at 0.5 g 0.5.

    As you know, the closest connection is a functional connection, when each individual meaning Y can be unambiguously assigned to the value X. Thus, the closer the correlation coefficient is to 1, the closer the relationship is to functional. The significance level for regression analysis should not exceed 0.001.

    The correlation coefficient has long been considered as the main indicator of the tightness of the relationship of features. However, later, the coefficient of determination became such an indicator. The meaning of this coefficient is as follows - it reflects the share of the total variance of the resulting feature Have, explained by the variance of the feature X. It is found by simple squaring the correlation coefficient (varying from 0 to 1) and, in turn, for a linear relationship, reflects the proportion from 0 (0%) to 1 (100%) characteristic values Y, defined by characteristic values X. It is written as I 2, and in the resulting tables of regression analysis in the SPSS package - without the square.

    Let's designate the main problems of constructing the multiple regression equation.

    • 1. Selection of factors included in the regression equation. At this stage, the researcher first draws up a general list of the main reasons that, according to the theory, determine the phenomenon under study. Then he has to select features in the regression equation. The basic rule of selection: the factors included in the analysis should correlate with each other as little as possible; only in this case it is possible to assign a quantitative measure of influence to a certain factor-attribute.
    • 2. Selecting the Form of the Multiple Regression Equation(in practice, they often use linear or linear-logarithmic). So, to use multiple regression, the researcher must first build a hypothetical model of the effect of several independent variables on the resultant. For the obtained results to be reliable, it is necessary that the model exactly corresponds to the real process, i.e. the relationship between the variables must be linear, no significant independent variable can be ignored, and no variable that is not directly related to the process under study cannot be included in the analysis. In addition, all measurements of variables must be extremely accurate.

    The above description implies a number of conditions for the application of this method, without which it is impossible to proceed to the procedure of multiple regression analysis (MRA) itself. Only the observance of all of the points listed below makes it possible to correctly carry out the regression analysis.

    2005-2017, HOCHU.UA