While solving certain analytical problems, have you ever wondered how one variable is affected by or related to the other one? Or maybe what will be the value of one variable in the future based on the value of the other? If you have been troubled by such questions, then you are at the right place. All these questions can be answered by a simple concept named “ regression”. Regression analysis in statistical modeling is defined as the process which allows you to model, examine and explore spatial relationships and can help explain the factors behind observed spatial patterns. Simply put, regression helps an individual predict past, present or future events based on information about past or present events. Fascinating, isn’t it? Using a simple technique and a few lines of code, it would be very convenient to answer certain questions such as:
- No of likes of profile pictures
- Admission rate into a university
So you get the flow, right? Now that you have a fair bit understanding of what regression can accomplish, let us deep dive into few of its techniques. I would recommend preliminary knowledge about the basic functions of R and statistical analysis. I would be talking about multiple linear regression in this post.
Multiple linear regression: Linear regression is the most basic and commonly used regression model for predictive analytics. Multiple linear regression involves finding the relationship between:
- A Dependent(response variable)
- 2 or more independent(explanatory or predictor variables)
For e.g. Finding the BMI (Body Mass Index) of a person using height and weight variables.
In general, a multiple linear regression model is as follows:
Y = a + a1X1 + a2X2 + … + anXn + e
Y is the dependent variable
X1 , X2 , Xn represent independent variables
a, a1, an represent fixed(but unknown) parameters
e is a random variable representing errors, or residuals
Implementing multiple regression model in R involves the following steps:
Step 1: Loading the input data
I have taken the “Prestige” dataset as input, which can be found within the “cars” dataset.
To load the input, type the following snippet:
Head (Prestige, 5)
Here, each row corresponds to a particular occupation.
Step 2: Refining the input:
As you can see in Fig 1, you will find that the dataset has a number of columns. However, for the sake of simplicity, I will focus only on the first 4 columns by breaking down the input as:
Newdata = Prestige [, c (1:4)]
Here, education refers to the average number of years of education that exists in each profession, income variable refers to the average income in each profession (in $), women variable refers to the percentage of women in each profession and the prestige variable refers to a prestige score for each profession.
Step 3: Analyzing the input:
Now that the input is loaded, the next step is to obtain a matrix plot to visualize the relationship between all the variables in a single image. To do this, type:
Plot (newdata, pch=16, col=”blue”, main=”Matrix Scatterplot of Income, Education, Women and Prestige”)
If you need a brief introduction to the plot function and its usage, click here.
You will see the following image on your screen:
You can make a number of inferences such as the relationship between income and percentage of women (third column left to right second row top to bottom graph in Fig 3). As you can see, as the percentage of women increases, average income in the profession declines.
The aim of this model is to solve the following equation:
Where B0,B1, B2,B3 represent the intercept and the coefficient values predicted by the regression model.
Step 4: Building a model using lm ()
In this step, once the dataset is loaded, lm () command is used to build a statistical model. It takes as input the dependent and independent variables, along with the input data.
In this example,
Dependent/Response variable – Income
Independent/Predictor variables – Education, Women and Prestige
Before proceeding with building the model, let us centralize all the predictor variables to their mean. This is not a must step but it always makes sense to do this to draw better inferences from the output.
education.c = scale (newdata$education, center=TRUE, scale=FALSE)
prestige.c = scale (newdata$prestige, center=TRUE, scale=FALSE)
women.c = scale (newdata$women, center=TRUE, scale=FALSE)
# bind these new variables to newdata and display a summary
new.c.vars = cbind (education.c, prestige.c, women.c)
newdata = cbind (newdata, new.c.vars)
names (newdata)[5:7] = c (“education.c”, “prestige.c”, “women.c” )
Now that the vegetables are cut, it is time to put them together and cook the dish (Feel free to ignore the PJ):
Mod1 = lm (income ~ education.c + prestige.c + women.c , data = newdata)
Step 5: Analyzing the model
This step involves analyzing the regression model to extract critical information such as the coefficient values, error values and the results of various hypotheses. Final equation is also built in this step from the obtained values of coefficients and the intercept. R code for this step can be shown as:
If you execute all these lines in your R console, you will get an output like this:
I am sure you are overwhelmed by the various fancy terms on your screen right now. Let me help you decipher the output.
Call: The first line of the output shows the formula used to build the regression model.
Residuals: This portion represents the minimum, 1st quarter (25%), median, 3rd quarter (75%) values of the residuals i.e. the difference between the actual values of the variable you are predicting and the predicted values from the regression model.
Coefficients: The intercept value represents the estimated mean Y value when all Xs are 0. The corresponding estimated value of all the parameters is nothing but the coefficient value which will be fitted into the equation. Greater the value, greater is the significance of the particular parameter.
Significant codes: It represents the level of significance of the p-value for the model.
Residual standard error: It represents how far the observed Y values are from the predicted values.
Multiple R-squared: It signifies approximately 91% of variation in Y can be explained by the X variables.
F-statistics and p-value: These values represent the result of the goodness of fit test of this model. Lower the p-value, better the regression model.
So, this multiple regression model can be mathematically expressed as:
- Multiple regression example in R – http://www.stat.columbia.edu/~martin/W2024/R6.pdf
- To understand and interpret the output of summary command in detail, visit output of summary command
So folks this was a brief on how to develop a multiple regression model using R. Do watch this space for the next blog dealing with Polynomial Regression.