Research Guides: R Studio guide: Regression

Linear Regression

A linear regression is one type of regression test used to analyze the direct association between a dependent variable that must be continuous and one or more independent variable(s) that can be any level of measurement, nominal, ordinal, interval, or ratio. A linear regression tests the changes in the mean of the dependent variable by the predictors included in our model, the independent variable(s).

In the example below, our research question is:

What are the predictors of individuals wage in the dataset?

We are going to include the variables age, sex, education, and language in our model to test the direct association onto wages. Let’s break the variables down a bit more to better understand our linear regression model.

Below is a breakdown of the variables included in our model to help us keep track of the types of variables we are working with.

Dependent variable

Wages of respondent (wages). This is a continuous variable that ranges from a score of 2.30 to 49.92, which is a large range! If you would like to investigate this variable more use the code for the descriptive statistics to better understand the distribution, which is very important for a linear regression model.

Independent variables

Age of respondent in years (age). This is a continuous level variable measuring the age of each respondent.
Sex of respondent (sex). This is a nominal level variable measuring the sex of each respondent and is coded as 1= FEMALE and 2=MALE.
Education of respondent in years (education). This is a continuous level variable measuring the number of years of education each respondent has.
Language of respondent (language). This is a nominal level variable measuring the language that each respondent speaks. Language is coded as 1= English, 2= French, and 3= Other.

Formula

There are two formulas below a general linear regression formula and the specific formula for our example.

The first section shows us descriptive statistics of the residuals of the model. Residuals are the predicted values of the independent variables onto the dependent variable. RStudio provides us with the Min (minimum), 1Q (first quartile), Median, 3Q (third quartile), and the Max (maximum) value of the residuals. We can use these to gage how well or not well are independent variables are predicting the dependent variable.

The second section, coefficients:, shows us the results from our regression analysis for each independent variable included. There are six rows of results (Intercept), age, sex, education, languageFrench, and langaugeOther. The(Intercept) corresponds to our $β_{0}$ in the regression formula, which can be thought of as our ‘starting’ point on the graph.

For the columns, we can see there are the Estimate , which is our unstandardized beta coefficients for each variable, that is often reported in studies and publications. In a multiple linear regression, we can interpret these as a one unit increase in the independent variable is multiplied by our unstandardized beta coefficient to see the change in the dependent variable wages. Lastly, on the right end of the table the column Pr(>|t|), is the significance of each independent variable which indicates if an independent variable is a significant predictor of the wages. We can see that all the independent variables, except for language (p= .6887) is a significant predictor of wages.

Forth, RStudio shows us the results from the ANOVA test. An ANOVA is used to test the statistical significance of the overall regression model telling us if our model is significant or not. We can see the F-statistic of the ANOVA test and is often reported in publications along with the DF or degrees of freedom. The p-value is the statistical significance of the ANOVA test, which we can see is <2.2e-16, far below our .05 threshold. We can interpret this as our regression model is statistically significant and what we are examining ‘matters’.

The third section shows us a host of information relating to the regression model. We will break it down line by line.

First, there is the Residual standard error, which is 6.6 on 3981 degrees of freedom. This tells us that WHAT??

Second, RStudio tells us that there are 3438 observations are deleted due to missing values during listwise deletion. We know there are 7425 observations that exist in the dataset and after deletion 3987 observations are used int he regression analysis.

The third line is our model fit statistics to judge how well our independent variables explain the variance of wages. A Multiple R-squared value of 0.2973 is interpreted as the variables age, sex, education, and language explain 29.73% percent of the variance of individuals wages in this dataset. This is a high value! Although, we need to look at the Adjusted R-squared that accounts for the number of independent variables in our model. Adjusted R-squared is important because the more independent variables we include in our model the higher our R squared value will become. The Adjusted R-squared accounts for this and adjusts for inflation from the number of variables included. We can see the Adjusted R-squared value is slightly lower than the Multiple R-squared, 0.2964.