Logistic regression

Hi all, in my previous blog i wrote about creating a simple linear regression model for predicting variables.Now, let’s learn about Logistic regression. It is used in scenarios where there are more than one independent variable and only two outcomes are possible like true or false..!!

Note: Logistic regression predicts only the probability of a certain output rather than the output itself.

Let’s create logistic regression model using R.

I am trying to predict the probability of a student getting admission in an institute based on various parameters.

Dataset: You can get the data from the following data set

Step 1: Load the data and run numerical summaries.

mydata <- read.csv(“GRE.csv”)

head(mydata)

head data.png

Note that data set has a binary response variable called admit and there are 3 predictor variables named gre, gpa and rank.

Step 3: Fit a logistic regression model.

R must be told that rank is a categorical variable since it takes only discreet values.

mydata$rank <- factor(mydata$rank)

glm command is used to generate logistic regression model.

mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = “binomial”)

Family is to be specified as binomial as the output is either 0 or 1.

summary(mylogit)

summary.png

To understand the above ouput follow the link 

  • Null deviance shows how well the response variable is predicted by a model that includes only the intercept. In our example we have a value of 499.98 on 399 degrees of freedom.
  • Residual variance has a value of 458.52 on 394 degrees of freedom i.e. after the addition of independent variables there is a loss of 4 degrees of freedom.
  • Akaike Information Criterion (AIC) provides a method for assessing the quality of model through comparison of related models, if there are many models then go with the one which has smallest AIC value.

Step 4: Use the fitted model to do predictions.

To predict the probabilities we create a data frame. Let’s start with holding gre and gpa at their means.

newdata1 <- with (mydata, data.frame ( gre = mean ( gre ), gpa = mean ( gpa ), rank = factor ( 1:4 )))

Now using the predict function, a new column “rankP” is created in the above data frame in which the predicted values of the probabilities are placed.

newdata1$rankP <- predict (mylogit, newdata = newdata1, type = “response” )

predict.png

The value of rankp suggests that students from rank 1 institute are more likely to get admission than a student from any other institutes.

In this prediction we kept gre and gpa at their means now let’s try other way too i.e. by generating a set of values for gre and predicting again.

newdata2 <- with( mydata, data.frame( gre = rep( seq( from = 200, to = 800, length.out = 100 ), 4),gpa = mean( gpa ), rank = factor( rep( 1:4, each = 100 ))))

newdata3 <- cbind(newdata2, predict(mylogit, newdata = newdata2, type=”link”, se=TRUE))newdata3 <- within(newdata3, {  PredictedProb <- plogis(fit)  LL <- plogis(fit – (1.96 * se.fit))  UL <- plogis(fit + (1.96 * se.fit))})

finally, first few records with the predicted probability is as shown.

final predict.png

The plot of the probabilities is obtained by using ggplot function.

ggplot (newdata3, aes ( x = gre, y = PredictedProb )) + geom_ribbon ( aes ( ymin = LL, ymax = UL, fill = rank), alpha = .2 ) + geom_line ( aes ( colour = rank ), size=1 )

gg plot.png

References:

Data analysis using Logistic regression

Introduction to Logistic regression

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s