To come in
Speech therapy portal
  • How to gain self-confidence, achieve calmness and increase self-esteem: discovering the main secrets of Gaining self-confidence
  • Psychological characteristics of children with general speech underdevelopment: features of cognitive activity Mental characteristics of children with onr
  • What is burnout at work and how to deal with it How to deal with burnout at work
  • How to Deal with Emotional Burnout Methods for Dealing with Emotional Burnout
  • How to Deal with Emotional Burnout Methods for Dealing with Emotional Burnout
  • Burnout - How To Deal With Work Stress How To Deal With Emotional Burnout
  • For what scales is regression analysis applied? Regression equation. Multiple regression equation. Problem using a linear regression equation

    For what scales is regression analysis applied?  Regression equation.  Multiple regression equation.  Problem using a linear regression equation

    Regression and correlation analysis - statistical research methods. These are the most common ways to show how a parameter depends on one or more independent variables.

    Below, using specific practical examples, we will consider these two analyzes that are very popular among economists. And also we will give an example of obtaining the results when they are combined.

    Regression analysis in Excel

    Shows the influence of some values ​​(independent, independent) on the dependent variable. For example, how the number of economically active population depends on the number of enterprises, the size of wages and other parameters. Or: how do foreign investments, energy prices, etc., affect the level of GDP.

    The result of the analysis allows you to prioritize. And based on the main factors, predict, plan the development of priority areas, make management decisions.

    Regression happens:

    • linear (y = a + bx);
    • parabolic (y = a + bx + cx 2);
    • exponential (y = a * exp (bx));
    • power (y = a * x ^ b);
    • hyperbolic (y = b / x + a);
    • logarithmic (y = b * 1n (x) + a);
    • exponential (y = a * b ^ x).

    Let's look at an example of building a regression model in Excel and interpreting the results. Let's take a linear regression type.

    Task. At 6 enterprises, the average monthly salary and the number of employees who quit were analyzed. It is necessary to determine the dependence of the number of employees who quit on the average salary.

    The linear regression model is as follows:

    Y = a 0 + a 1 x 1 + ... + a k x k.

    Where a - regression coefficients, x - influencing variables, k - number of factors.

    In our example, Y is the indicator of employees who quit. The influencing factor is wages (x).

    Excel has built-in functions that you can use to calculate the parameters of a linear regression model. But the Analysis Package add-in will do it faster.

    We activate a powerful analytical tool:

    Upon activation, the add-in will be available on the Data tab.

    Now let's go directly to the regression analysis.



    First of all, pay attention to the R-square and the coefficients.

    R-square is the coefficient of determination. In our example - 0.755, or 75.5%. This means that the calculated parameters of the model explain the relationship between the studied parameters by 75.5%. The higher the coefficient of determination, the better the model is. Good - above 0.8. Bad - less than 0.5 (such an analysis can hardly be considered reasonable). In our example - "not bad".

    The coefficient 64.1428 shows what Y will be if all the variables in the model under consideration are equal to 0. That is, other factors that are not described in the model also affect the value of the analyzed parameter.

    The coefficient -0.16285 shows the weight of the variable X on Y. That is, the average monthly salary within this model affects the number of people leaving with a weight of -0.16285 (this is a small degree of influence). The "-" sign indicates bad influence: the higher the salary, the fewer quitters. Which is fair.

    

    Correlation analysis in Excel

    Correlation analysis helps to establish whether there is a relationship between indicators in one or two samples. For example, between the operating time of the machine and the cost of repairs, the price of equipment and the duration of operation, the height and weight of children, etc.

    If there is a relationship, then whether an increase in one parameter leads to an increase (positive correlation) or a decrease (negative) in the other. Correlation analysis helps the analyst determine whether the value of one indicator can predict the possible value of another.

    The correlation coefficient is denoted by r. Varies from +1 to -1. The classification of correlations for different areas will be different. When the coefficient is 0, there is no linear relationship between the samples.

    Let's take a look at how to use Excel tools to find the correlation coefficient.

    To find paired coefficients, the CORREL function is used.

    Objective: Determine if there is a relationship between the operating time of the lathe and the cost of its maintenance.

    We put the cursor in any cell and press the fx button.

    1. In the "Statistical" category, select the CORREL function.
    2. Argument "Array 1" - the first range of values ​​- machine operation time: A2: A14.
    3. Array 2 is the second range of values ​​- the cost of repair: B2: B14. Click OK.

    To determine the type of connection, you need to look at the absolute number of the coefficient (each field of activity has its own scale).

    For correlation analysis with several parameters (more than 2), it is more convenient to use Data Analysis (Analysis Package add-in). In the list, you need to select a correlation and designate an array. Everything.

    The obtained coefficients will be displayed in the correlation matrix. Like this:

    Correlation-regression analysis

    In practice, these two techniques are often used together.

    Example:


    Now the regression data is also visible.

    During their studies, students very often come across a variety of equations. One of them - the regression equation - is discussed in this article. This type of equation is used specifically to describe the characteristics of the relationship between mathematical parameters. This type of equality is used in statistics and econometrics.

    Defining Regression

    In mathematics, regression means a certain quantity that describes the dependence of the average value of a set of data on the values ​​of another quantity. The regression equation shows, as a function of a certain characteristic, the average value of another characteristic. The regression function has the form of a simple equation y = x, in which y is the dependent variable, and x is the independent (attribute-factor). In fact, the regression is expressed as y = f (x).

    What are the types of relationships between variables

    In general, there are two opposite types of relationship: correlation and regression.

    The first is characterized by the equality of conditional variables. In this case, it is not known for certain which variable depends on the other.

    If there is no equality between the variables and the conditions say which variable is explanatory and which is dependent, then we can talk about the presence of a relationship of the second type. In order to build a linear regression equation, it will be necessary to find out what type of relationship is observed.

    Regression types

    To date, there are 7 various types regressions: hyperbolic, linear, multiple, nonlinear, pairwise, inverse, logarithmically linear.

    Hyperbolic, linear and logarithmic

    The linear regression equation is used in statistics to clearly explain the parameters of the equation. It looks like y = c + m * x + E. The hyperbolic equation has the form of a regular hyperbola y = c + m / x + E. The logarithmically linear equation expresses the relationship using logarithmic function: In y = In c + t * In x + In E.

    Multiple and nonlinear

    Two more complex types of regression are multiple and non-linear. The multiple regression equation is expressed by the function y = f (x 1, x 2 ... x c) + E. In this situation, y is the dependent variable, and x is the explanatory one. Variable E is stochastic and includes the influence of other factors in the equation. The non-linear regression equation is a bit controversial. On the one hand, it is not linear with respect to the indicators taken into account, but on the other hand, in the role of assessing indicators, it is linear.

    Inverse and Paired Regressions

    The inverse is the kind of function that needs to be converted to a linear form. In the most traditional applications, it takes the form of a function y = 1 / c + m * x + E. The paired regression equation demonstrates the relationship between the data as a function of y = f (x) + E. In the same way as in other equations, y depends on x, and E is a stochastic parameter.

    Correlation concept

    This is an indicator that demonstrates the existence of a relationship between two phenomena or processes. The strength of the relationship is expressed as a correlation coefficient. Its value fluctuates within the interval [-1; +1]. A negative indicator indicates the presence of feedback, a positive indicator indicates a direct one. If the coefficient takes a value equal to 0, then there is no relationship. The closer the value is to 1, the stronger the relationship between the parameters, the closer to 0, the weaker.

    Methods

    Correlation parametric methods can assess the closeness of the relationship. They are used on the basis of a distribution estimate to study parameters that obey the normal distribution law.

    The parameters of the linear regression equation are necessary to identify the type of dependence, the function of the regression equation and to evaluate the indicators of the selected relationship formula. The correlation field is used as a link identification method. To do this, all existing data must be displayed graphically. In a rectangular two-dimensional coordinate system, all known data must be plotted. This is how the correlation field is formed. The value of the describing factor is marked along the abscissa, while the values ​​of the dependent factor are marked along the ordinate. If there is a functional relationship between the parameters, they are lined up in the form of a line.

    If the correlation coefficient of such data is less than 30%, we can talk about an almost complete absence of communication. If it is between 30% and 70%, then this indicates the presence of links of medium density. 100% indicator is evidence of functional connection.

    The nonlinear regression equation, as well as the linear one, must be supplemented with the correlation index (R).

    Correlation for multiple regression

    The coefficient of determination is a measure of the square of multiple correlation. He speaks of the tightness of the relationship between the presented set of indicators and the trait under study. He can also talk about the nature of the influence of parameters on the result. The multiple regression equation is estimated using this indicator.

    In order to calculate the index of multiple correlation, it is necessary to calculate its index.

    Least square method

    This method is a way to estimate regression factors. Its essence lies in minimizing the sum of the squared deviations obtained due to the dependence of the factor on the function.

    A paired linear regression equation can be estimated using this method. This type of equations is used in the case of detection between the indicators of a paired linear relationship.

    Equation parameters

    Each parameter of the linear regression function has a specific meaning. The paired linear regression equation contains two parameters: c and m. The parameter m shows the average change in the final indicator of the function y, subject to a decrease (increase) in the variable x by one conventional unit. If the variable x is zero, then the function is equal to the parameter c. If the variable x is not zero, then the factor c has no economic meaning. The only effect on the function is the sign before the factor c. If there is a minus, then we can say about a delayed change in the result compared to the factor. If there is a plus, then this indicates an accelerated change in the result.

    Each parameter that changes the value of a regression equation can be expressed through an equation. For example, factor c has the form c = y - tx.

    Grouped data

    There are conditions of the problem in which all information is grouped according to the attribute x, but at the same time, for a certain group, the corresponding average values ​​of the dependent indicator are indicated. In this case, the average values ​​characterize how the indicator, depending on x, changes. Thus, the grouped information helps to find the regression equation. It is used as a relationship analysis. However, this method has its drawbacks. Unfortunately, the average indicators are often subject to external fluctuations. These fluctuations are not a reflection of the regularity of the relationship, they only mask its "noise". The averages show much worse relationship patterns than the linear regression equation. However, they can be used as a base for finding an equation. By multiplying the size of an individual population by the corresponding average, you can get the sum of y within the group. Next, you need to knock out all the amounts received and find the final indicator y. It is a little more difficult to make calculations with the indicator of the amount xy. In the event that the intervals are small, it is possible to conventionally take the x exponent for all units (within the group) to be the same. You should multiply it with the sum of y to find out the sum of the products of x and y. Further, all the amounts are knocked together and the total amount xy is obtained.

    Multiple Pairwise Regression Equation: Assessing the Importance of a Link

    As discussed earlier, multiple regression has a function of the form y = f (x 1, x 2,…, x m) + E. Most often, such an equation is used to solve the problem of supply and demand for a product, interest income on repurchased shares, and study the reasons and type of the production cost function. It is also actively used in a wide variety of macroeconomic studies and calculations, but at the level of microeconomics, such an equation is used a little less often.

    The main task of multiple regression is to build a data model containing a huge amount of information in order to further determine what influence each of the factors individually and in their general totality has on the indicator that needs to be modeled and its coefficients. The regression equation can take on a wide variety of values. At the same time, two types of functions are usually used to assess the relationship: linear and nonlinear.

    The linear function is depicted in the form of such a relationship: y = a 0 + a 1 x 1 + a 2 x 2, + ... + a m x m. In this case, a2, a m, are considered the coefficients of "pure" regression. They are necessary to characterize the average change in the parameter y with a change (decrease or increase) in each corresponding parameter x by one unit, with the condition of a stable value of other indicators.

    Nonlinear equations have, for example, the form power function y = ax 1 b1 x 2 b2 ... x m bm. In this case, the indicators b 1, b 2 ..... b m - are called elasticity coefficients, they demonstrate how the result will change (by how many%) with an increase (decrease) in the corresponding indicator x by 1% and with a stable indicator of other factors.

    What factors need to be considered when constructing multiple regression

    In order to correctly construct multiple regression, it is necessary to find out which factors should be paid special attention to.

    It is necessary to have a certain understanding of the nature of the relationship between economic factors and the modeled. The factors that will need to be included must meet the following criteria:

    • Must be quantifiable. In order to use a factor describing the quality of an object, in any case, it should be quantified.
    • There should be no intercorrelation of factors, or a functional relationship. Such actions most often lead to irreversible consequences - the system ordinary equations becomes unconditioned, and this entails its unreliability and unclear estimates.
    • If there is a huge correlation indicator, there is no way to find out the isolated influence of factors on the final result of the indicator, therefore, the coefficients become uninterpretable.

    Construction methods

    There are a huge number of methods and techniques that explain how you can choose the factors for the equation. However, all these methods are based on the selection of coefficients using the correlation indicator. Among them are:

    • Exclusion method.
    • Method of inclusion.
    • Regression analysis step by step.

    The first method involves filtering out all coefficients from the aggregate set. The second method involves the introduction of many additional factors. Well, the third is the elimination of factors that were previously applied to the equation. Each of these methods has a right to exist. They have their pros and cons, but they can all solve the issue of dropping unnecessary indicators in their own way. As a rule, the results obtained by each individual method are fairly close.

    Multivariate analysis methods

    Such methods of determining factors are based on the consideration of individual combinations of interrelated features. These include discriminant analysis, face recognition, principal component analysis, and cluster analysis. In addition, there is also factor analysis, but it appeared as a result of the development of the method of components. All of them apply in certain circumstances, subject to availability certain conditions and factors.

    The purpose of regression analysis is to measure the relationship between a dependent variable and one (paired regression analysis) or multiple (multiple) independent variables. The explanatory variables are also called factorial, explanatory, determinative, regressors, and predictors.

    The dependent variable is sometimes called the determinable, explainable, "response." The extremely widespread use of regression analysis in empirical research is not only due to the fact that it is a convenient tool for testing hypotheses. Regression, especially multiple regression, is an effective modeling and forecasting technique.

    To explain the principles of working with regression analysis, we will start with a simpler one - the pairwise method.

    Paired Regression Analysis

    The first steps when using regression analysis will be almost identical to those taken by us in calculating the correlation coefficient. Three main conditions for the effectiveness of correlation analysis according to Pearson's method - normal distribution of variables, interval measurement of variables, linear relationship between variables - are also relevant for multiple regression. Accordingly, at the first stage, scatterplots are built, a statistically descriptive analysis of the variables is carried out, and the regression line is calculated. As in the framework of correlation analysis, regression lines are constructed using the least squares method.

    To more clearly illustrate the differences between the two methods of data analysis, let us turn to the already considered example with the variables "PCA support" and "share rural population". The original data are identical. The difference in scatterplots will be that in regression analysis it is correct to postpone the dependent variable - in our case, "support for the PCA" along the Y axis, while in correlation analysis it does not matter. After cleaning out the outliers, the scatter diagram looks like:

    The basic idea of ​​regression analysis is that, having a general tendency for variables - in the form of a regression line - it is possible to predict the value of the dependent variable, having the values ​​of the independent.

    Let's represent the usual mathematical linear function. Any straight line in Euclidean space can be described by the formula:

    where a is a constant specifying the displacement along the ordinate; b - coefficient that determines the angle of inclination of the line.

    Knowing the slope and the constant, you can calculate (predict) the value of y for any x.

    This simplest function formed the basis of the regression analysis model with the proviso that we will not predict the value of y accurately, but within a certain confidence interval, i.e. approximately.

    A constant is the point of intersection of the regression line and the ordinate (F-intersection, in statistical packages, usually denoted "interceptor"). In our example with a vote for the PCA, its rounded value will be 10.55. The slope b will be approximately -0.1 (as in correlation analysis, the sign shows the type of relationship - direct or reverse). Thus, the resulting model will have the form SP C = -0.1 x Sel. US. + 10.55.

    So, for the case of the "Republic of Adygea" with a share of the rural population of 47%, the predicted value will be 5.63:

    ATP = -0.10 x 47 + 10.55 = 5.63.

    The difference between the initial and predicted values ​​is called the remainder (we have already encountered this term, which is fundamental for statistics, when analyzing contingency tables). So, for the case of "Republic of Adygea" the remainder will be 3.92 - 5.63 = -1.71. The larger the modular value of the remainder, the less well-predicted the value.

    We calculate the predicted values ​​and residuals for all cases:
    Happening He sat down. US. THX

    (original)

    THX

    (predicted)

    Leftovers
    Republic of Adygea 47 3,92 5,63 -1,71 -
    Altai Republic 76 5,4 2,59 2,81
    Republic of Bashkortostan 36 6,04 6,78 -0,74
    The Republic of Buryatia 41 8,36 6,25 2,11
    The Republic of Dagestan 59 1,22 4,37 -3,15
    The Republic of Ingushetia 59 0,38 4,37 3,99
    Etc.

    Analysis of the ratio of the initial and predicted values ​​serves to assess the quality of the resulting model, its predictive ability. One of the main indicators of regression statistics is the multiple correlation coefficient R - the coefficient of correlation between the original and predicted values ​​of the dependent variable. In paired regression analysis, it is equal to Pearson's usual correlation coefficient between the dependent and independent variable, in our case 0.63. In order to meaningfully interpret the multiple R, it must be transformed into a coefficient of determination. This is done in the same way as in correlation analysis - by squaring. The coefficient of determination R -square (R 2) shows the proportion of variation in the dependent variable explained by the independent (independent) variables.

    In our case, R 2 = 0.39 (0.63 2); this means that the variable “rural population share” explains about 40% of the variation in the variable “CPS support”. The greater the value of the coefficient of determination, the higher the quality of the model.

    Another measure of model quality is the standard error of estimate. It is a measure of how much the points are "scattered" around the regression line. The standard deviation is a measure of the dispersion for interval variables. Accordingly, the standard error of the estimate is the standard deviation of the distribution of the residuals. The higher its value, the greater the spread and the worse the model. In our case, the standard error is 2.18. It is by this value that our model will be “mistaken on average” when predicting the value of the variable “SPS support”.

    Regression statistics also includes analysis of variance. With its help, we find out: 1) what proportion of the variation (variance) of the dependent variable is explained by the independent variable; 2) what proportion of the variance of the dependent variable falls on the residuals (unexplained part); 3) what is the ratio of these two quantities (/ "- ratio). Dispersion statistics is especially important for sample studies - it shows how likely it is that there is a relationship between the independent and dependent variables in the general population. However, for continuous studies (as in our example), study results analysis of variance not useless. In this case, it is checked whether the revealed statistical regularity is caused by a coincidence of random circumstances, how characteristic it is for the complex of conditions in which the studied population is located, i.e. it is not the truth of the result obtained for some wider general population that is established, but the degree of its regularity, freedom from random influences.

    In our case, the analysis of variance statistics is as follows:

    SS df MS F meaning
    Regres. 258,77 1,00 258,77 54,29 0.000000001
    The remainder. 395,59 83,00 L, 11
    Total 654,36

    The F-ratio of 54.29 is significant at 0.0000000001. Accordingly, we can confidently reject the null hypothesis (that the relationship we discovered is random).

    A similar function is performed by the t criterion, but with respect to regression coefficients (angular and F-intersection). Using the / criterion, we test the hypothesis that in the general population the regression coefficients are equal to zero. In our case, we can again confidently reject the null hypothesis.

    Multiple regression analysis

    The multiple regression model is almost identical to the paired regression model; the only difference is that several independent variables are sequentially included in the linear function:

    Y = b1X1 + b2X2 +… + bpXp + a.

    If there are more than two independent variables, we are not able to get a visual idea of ​​their relationship; in this regard, multiple regression is less "clear" than a pair regression. When there are two independent variables, it can be useful to display the data in a 3D scatterplot. In professional statistical software packages (for example, Statisticа) there is an option for rotating a three-dimensional diagram, which allows a good visual representation of the data structure.

    When working with multiple regression, as opposed to a pair regression, it is necessary to define an analysis algorithm. The standard algorithm includes all available predictors in the final regression model. The step-by-step algorithm assumes sequential inclusion (exclusion) of independent variables, based on their explanatory "weight". The stepwise method is good when there are many independent variables; it "cleans" the model of frankly weak predictors, making it more compact and laconic.

    An additional condition for the correctness of multiple regression (along with interval, normality, and linearity) is the absence of multicollinearity - the presence of strong correlations between independent variables.

    The interpretation of multiple regression statistics includes all the elements that we considered for the case of paired regression. In addition, there are other important components to the statistics of multiple regression analysis.

    We will illustrate the work with multiple regression by the example of testing hypotheses explaining the differences in the level of electoral activity in the regions of Russia. Specific empirical studies have suggested that voter turnout is influenced by:

    The national factor (the variable “Russian population”; operationalized as the share of the Russian population in the constituent entities of the Russian Federation). It is assumed that an increase in the share of the Russian population leads to a decrease in voter turnout;

    Urbanization factor (variable “ urban population"; operationalized as the share of the urban population in the constituent entities of the Russian Federation, we have already worked with this factor in the framework of the correlation analysis). It is assumed that an increase in the share of the urban population also leads to a decrease in voter turnout.

    The dependent variable - "the intensity of electoral activity" ("asset") is operationalized through the averaged data of the turnout by regions in the federal elections from 1995 to 2003. The initial data table for two independent and one dependent variable will have the following form:

    Happening Variables
    Assets. Mountains. US. Rus. US.
    Republic of Adygea 64,92 53 68
    Altai Republic 68,60 24 60
    The Republic of Buryatia 60,75 59 70
    The Republic of Dagestan 79,92 41 9
    The Republic of Ingushetia 75,05 41 23
    Republic of Kalmykia 68,52 39 37
    Karachay-Cherkess Republic 66,68 44 42
    Republic of Karelia 61,70 73 73
    Komi Republic 59,60 74 57
    Mari El Republic 65,19 62 47

    Etc. (after cleaning out emissions, 83 cases out of 88 remain)

    Statistics describing the quality of the model:

    1. Multiple R = 0.62; L-square = 0.38. Consequently, the national factor and the factor of urbanization together explain about 38% of the variation in the variable "electoral activity".

    2. The average error is 3.38. This is how “on average” the model is wrong when predicting the turnout level.

    3. / L-ratio of explained and unexplained variation is 25.2 at the level of 0.000000003. The null hypothesis about the randomness of the identified links is rejected.

    4. Criterion / for the constant and regression coefficients of the variables "urban population" and "Russian population" is significant at the level of 0.0000001; 0.00005 and 0.007, respectively. The null hypothesis of the randomness of the coefficients is rejected.

    Additional useful statistics in analyzing the relationship between the original and predicted values ​​of the dependent variable are Mahalanobis distance and Cook distance. The first is a measure of the uniqueness of a case (it shows how much the combination of values ​​of all independent variables for a given case deviates from the mean for all independent variables simultaneously). The second is the measure of the impact of the event. Different observations have different effects on the slope of the regression line, and the Cook distance can be used to compare them for this indicator. This can be useful when cleaning out outliers (an outlier can be thought of as an overly influential case).

    In our example, Dagestan is one of the unique and influential cases.

    Happening The original

    meaning

    Predsca

    meaning

    Leftovers Distance

    Mahalanobis

    Distance
    Adygea 64,92 66,33 -1,40 0,69 0,00
    Altai Republic 68,60 69.91 -1,31 6,80 0,01
    The Republic of Buryatia 60,75 65,56 -4,81 0,23 0,01
    The Republic of Dagestan 79,92 71,01 8,91 10,57 0,44
    The Republic of Ingushetia 75,05 70,21 4,84 6,73 0,08
    Republic of Kalmykia 68,52 69,59 -1,07 4,20 0,00

    The regression model itself has the following parameters: Y-intersection (constant) = 75.99; B (Hor. Sat.) = -0.1; B (Rus. Us.) = -0.06. Final formula:

    Aactive, = -0.1 x Horus. sat n + - 0.06 x Rus. sat n + 75.99.

    Can we compare the “explanatory power” of the predictors based on the value of the coefficient 61. In this case, yes, since both explanatory variables have the same percentage format. However, most often multiple regression deals with variables measured on different scales (for example, income level in rubles and age in years). Therefore, in the general case, it is incorrect to compare the predictive capabilities of variables by the regression coefficient. In multiple regression statistics, there is a special beta coefficient (B) for this purpose, calculated separately for each independent variable. It is a private (calculated after taking into account the influence of all other predictors) correlation coefficient of the factor and the response and shows the independent contribution of the factor to the prediction of the response values. In paired regression analysis, the beta coefficient is understandably equal to the paired correlation coefficient between the dependent and independent variable.

    In our example, beta (Gor. Us.) = -0.43, beta (Rus. Us.) = -0.28. Thus, both factors negatively affect the level of electoral activity, while the significance of the urbanization factor is significantly higher than the significance of the national factor. The combined influence of both factors determines about 38% of the variation in the variable "electoral activity" (see the value of the L-square).

    In statistical modeling, regression analysis is a study used to assess the relationship between variables. This mathematical method includes many other techniques for modeling and analyzing multiple variables, where the focus is on the relationship between the dependent variable and one or more independent variables. More specifically, regression analysis helps you understand how the typical value of the dependent variable changes if one of the explanatory variables changes while the other explanatory variables remain fixed.

    In all cases, the target score is a function of the explanatory variables and is called a regression function. In regression analysis, it is also of interest to characterize the change in the dependent variable as a regression function, which can be described using a probability distribution.

    Regression Analysis Tasks

    This statistical research method is widely used for forecasting, where its use has a significant advantage, but sometimes it can lead to illusion or false attitudes, therefore it is recommended to use it carefully in this issue, since, for example, correlation does not mean causation.

    A large number of methods have been developed for performing regression analysis, such as linear and ordinary least squares regression, which are parametric. Their essence is that the regression function is defined in terms of a finite number of unknown parameters that are estimated from the data. Nonparametric regression allows its functions to lie in a specific set of functions, which can be infinite-dimensional.

    As a statistical research method, regression analysis in practice depends on the form of the data generation process and how it relates to the regression approach. Since the true form of a data process is usually an unknown number, regression analysis of data often depends to some extent on assumptions about the process. These assumptions are sometimes testable if there is enough data available. Regression models are often useful even when the assumptions are moderately broken, although they may not perform as efficiently as possible.

    In a narrower sense, regression can refer specifically to the estimation of continuous response variables, as opposed to discrete response variables used in classification. The case of a continuous output variable is also called metric regression to distinguish it from related problems.

    History

    The earliest form of regression is the well-known least squares method. It was published by Legendre in 1805 and Gauss in 1809. Legendre and Gauss applied the method to the problem of determining from astronomical observations the orbits of bodies around the Sun (mainly comets, but later also newly discovered minor planets). Gauss posted further development least squares theory in 1821, including a variant of the Gauss-Markov theorem.

    The term regression was coined by Francis Galton in the 19th century to describe a biological phenomenon. The bottom line was that the growth of offspring from the growth of the ancestors, as a rule, regresses down to the normal mean. For Galton, regression had only this biological meaning, but later his work was continued by Udney Yoley and Karl Pearson and brought into a more general statistical context. In the work of Yule and Pearson, the joint distribution of the response and explanatory variables is considered to be Gaussian. This assumption was rejected by Fischer in 1922 and 1925. Fisher suggested that the conditional distribution of the response variable is Gaussian, but the joint distribution should not be. In this regard, Fisher's assumption is closer to the formulation of Gauss in 1821. Until 1970, it sometimes took up to 24 hours to get the result of the regression analysis.

    Regression analysis methods continue to be an area of ​​active research. In recent decades, new methods have been developed for robust regression; regression with correlated responses; regression methods accommodating different types of missing data; nonparametric regression; Bayesian regression methods; regressions in which predictor variables are measured in error; regression with more predictors than observations; and causal inferences with regression.

    Regression models

    Regression analysis models include the following variables:

    • Unknown parameters, denoted beta, which can be a scalar or vector.
    • Independent variables, X.
    • Dependent variables, Y.

    In various fields of science where regression analysis is applied, different terms are used instead of dependent and independent variables, but in all cases the regression model assigns Y to a function of X and β.

    The approximation is usually written in the form E (Y | X) = F (X, β). To carry out the regression analysis, the form of the function f must be determined. Less commonly, it is based on knowledge of the relationship between Y and X that does not rely on data. If such knowledge is not available, then a flexible or convenient F form is chosen.

    Dependent variable Y

    Suppose now that the vector of unknown parameters β has length k. To perform regression analysis, the user must provide information about the dependent variable Y:

    • If N data points of the form (Y, X) are observed, where N< k, большинство классических подходов к регрессионному анализу не могут быть выполнены, так как система уравнений, определяющих модель регрессии в качестве недоопределенной, не имеет достаточного количества данных, чтобы восстановить β.
    • If exactly N = K are observed, and the function F is linear, then the equation Y = F (X, β) can be solved exactly, not approximately. This boils down to solving a set of N-equations with N-unknowns (elements β), which has a unique solution as long as X is linearly independent. If F is nonlinear, the solution may not exist, or there may be many solutions.
    • The most common situation is where N> points to the data are observed. In this case, there is enough information in the data to estimate the unique value for β that best fits the data, and a regression model where application to the data can be viewed as an overridden system in β.

    In the latter case, regression analysis provides tools for:

    • Search for a solution for unknown parameters β, which will, for example, minimize the distance between the measured and predicted value of Y.
    • Under certain statistical assumptions, regression analysis uses excess information to provide statistical information about unknown β parameters and predicted values ​​of the dependent variable Y.

    Required number of independent measurements

    Consider a regression model that has three unknown parameters: β 0, β 1, and β 2. Suppose the experimenter makes 10 measurements on the same value of the independent variable for vector X. In this case, the regression analysis does not yield a unique set of values. The best thing to do is to estimate the mean and standard deviation of the dependent variable Y. Similarly, by measuring two different meanings X, you can get enough data for regression with two unknowns, but not for three or more unknowns.

    If the experimenter's measurements were made at three different values ​​of the independent variable of the vector X, then the regression analysis will provide a unique set of estimates for the three unknown parameters in β.

    In the case of general linear regression, the above statement is equivalent to the requirement that the matrix X T X is invertible.

    Statistical assumptions

    When the number of measurements N is greater than the number of unknown parameters k and measurement errors ε i, then, as a rule, an excess of information contained in the measurements is then propagated and used for statistical predictions regarding unknown parameters. This excess of information is called the degree of freedom of the regression.

    Underlying assumptions

    Classic assumptions for regression analysis include:

    • The sample is a representative of predicting inference.
    • The error is random variable with a mean of zero, which is conditional on the explanatory variables.
    • The explanatory variables are measured without error.
    • As independent variables (predictors), they are linearly independent, that is, it is not possible to express any predictor as a linear combination of the others.
    • The errors are uncorrelated, that is, the error covariance matrix of the diagonals and each nonzero element is the variance of the error.
    • The variance of the error is constant according to observations (homoscedasticity). If not, then weighted least squares or other methods can be used.

    These sufficient conditions for least squares estimation have the required properties, in particular, these assumptions mean that parameter estimates will be objective, consistent and effective, especially when taken into account in the class of linear estimates. It is important to note that evidence rarely meets the conditions. That is, the method is used even if the assumptions are not correct. Variation from assumptions can sometimes be used as a measure of how useful the model is. Many of these assumptions can be relaxed in more advanced methods. Reports statistical analysis typically include analysis of tests on sample data and methodology for the utility of the model.

    In addition, variables in some cases refer to values ​​measured at point locations. There may be spatial trends and spatial autocorrelations in variables that violate statistical assumptions. Geographic Weighted Regression is the only technique that deals with this kind of data.

    In linear regression, the feature is that the dependent variable, which is Y i, is a linear combination of parameters. For example, simple linear regression uses one independent variable, x i, and two parameters, β 0 and β 1, to model n-points.

    In multiple linear regression, there are several independent variables or their functions.

    When randomly sampled from a population, its parameters provide a sample of a linear regression model.

    In this aspect, the least squares method is the most popular. It is used to obtain parameter estimates that minimize the sum of the squares of the residuals. This kind of minimization (which is typical of linear regression) of this function leads to a set of normal equations and a set of linear equations with parameters, which are solved to obtain parameter estimates.

    Assuming further that the population error is usually propagated, the researcher can use these estimates of standard errors to create confidence intervals and test hypotheses about its parameters.

    Nonlinear Regression Analysis

    An example where the function is not linear with respect to the parameters indicates that the sum of squares should be minimized using an iterative procedure. This introduces many complications that distinguish between linear and non-linear least squares. Consequently, the results of regression analysis when using a nonlinear method are sometimes unpredictable.

    Calculation of power and sample size

    There is usually no consistent method for the number of observations versus the number of explanatory variables in the model. The first rule was proposed by Dobra and Hardin and looks like N = t ^ n, where N is the sample size, n is the number of independent variables, and t is the number of observations needed to achieve the desired accuracy if the model had only one independent variable. For example, a researcher builds a linear regression model using a dataset that contains 1000 patients (N). If the researcher decides that five observations are needed to accurately determine the straight line (m), then the maximum number of independent variables that the model can support is 4.

    Other methods

    Although the parameters of a regression model are usually estimated using the least squares method, there are other methods that are used much less frequently. For example, these are the following methods:

    • Bayesian methods (for example, Bayesian linear regression method).
    • Percentage regression, used for situations where lowering percentage errors is considered more appropriate.
    • Smallest absolute deviations, which is more robust in the presence of outliers leading to quantile regression.
    • Nonparametric regression requiring a large number of observations and calculations.
    • Distance learning metric that is learned in search of a meaningful distance metric in a given input space.

    Software

    All major statistical software packages are performed using least squares regression analysis. Simple Linear Regression and Multiple Regression Analysis can be used in some spreadsheet applications as well as some calculators. Although many statistical software packages can perform various types of nonparametric and robust regression, these methods are less standardized; different software packages implement different methods. Specialized regression software was developed for use in areas such as survey analysis and neuroimaging.

    Regression analysis is a method for establishing the analytical expression of the stochastic relationship between the studied features. The regression equation shows how the average changes at when changing any of x i , and has the form:

    where y - dependent variable (it is always one);

    NS i - independent variables (factors) (there may be several).

    If there is only one explanatory variable, this is a simple regression analysis. If there are several of them ( NS 2), then such an analysis is called multivariate.

    In the course of regression analysis, two main tasks are solved:

      building a regression equation, i.e. finding the type of relationship between the final indicator and independent factors x 1 , x 2 , …, x n .

      an estimate of the significance of the resulting equation, i.e. determining to what extent the selected factor characteristics explain the variation of the characteristic at.

    Regression analysis is used mainly for planning, as well as for the development of a regulatory framework.

    Unlike correlation analysis, which only answers the question of whether there is a relationship between the analyzed features, regression analysis also gives its formalized expression. In addition, if correlation analysis studies any interconnection of factors, then regression analysis studies one-sided dependence, i.e. a connection showing how a change in factor signs affects the effective sign.

    Regression analysis is one of the most developed methods of mathematical statistics. Strictly speaking, in order to implement regression analysis, it is necessary to fulfill a number of special requirements (in particular, x l , x 2 , ..., x n ;y must be independent, normally distributed random variables with constant variances). V real life strict compliance with the requirements of regression and correlation analysis is very rare, but both of these methods are quite common in economic research. Dependencies in the economy can be not only direct, but also inverse and nonlinear. A regression model can be built in the presence of any dependence, however, in multivariate analysis, only linear models of the form are used:

    The construction of the regression equation is carried out, as a rule, by the least squares method, the essence of which is to minimize the sum of squares of deviations of the actual values ​​of the resultant attribute from its calculated values, i.e.:

    where T - number of observations;

    j =a + b 1 x 1 j + b 2 x 2 j + ... + b n NS n j - the calculated value of the resultant factor.

    It is recommended to determine the regression coefficients using analytical packages for a personal computer or a special financial calculator. In the simplest case, the regression coefficients of a one-way linear regression equation of the form y = a + bx can be found by the formulas:

    Cluster Analysis

    Cluster analysis is one of the multivariate analysis methods designed for grouping (clustering) a population, the elements of which are characterized by many features. The values ​​of each of the attributes serve as the coordinates of each unit of the studied population in the multidimensional space of attributes. Each observation, characterized by the values ​​of several indicators, can be represented as a point in the space of these indicators, the values ​​of which are considered as coordinates in a multidimensional space. Distance between points R and q with k coordinates is defined as:

    The main criterion for clustering is that the differences between clusters should be more significant than between observations assigned to the same cluster, i.e. in a multidimensional space, the inequality must be observed:

    where r 1, 2 - distance between clusters 1 and 2.

    Just like the regression analysis procedures, the clustering procedure is quite laborious, it is advisable to perform it on a computer.