Hi all, in my previous blog i wrote about creating a simple linear regression model for predicting variables.Now, let’s learn about **Logistic regression**. It is used in scenarios where there are **more than one** independent variable and **only two** outcomes are possible like true or false..!!

**Note:** Logistic regression predicts only the **probability** of a certain output rather than the output itself.

Let’s create logistic regression model using R.

I am trying to predict the probability of a student getting admission in an institute based on various parameters.

Dataset: You can get the data from the following data set

**Step 1**: Load the data and run numerical summaries.

mydata <- read.csv(“GRE.csv”)

head(mydata)

Note that data set has a binary response variable called admit and there are 3 predictor variables named gre, gpa and rank.

**Step 3**: Fit a **logistic regression** model.

R must be told that rank is a categorical variable since it takes only discreet values.

mydata$rank <- factor(mydata$rank)

**glm** command is used to generate logistic regression model.

mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = “binomial”)

Family is to be specified as **binomial** as the output is either 0 or 1.

summary(mylogit)

To understand the above ouput follow the link

**Null deviance**shows how well the response variable is predicted by a model that includes only the intercept. In our example we have a value of 499.98 on 399 degrees of freedom.- Residual variance has a value of 458.52 on 394 degrees of freedom i.e. after the addition of independent variables there is a loss of 4 degrees of freedom.
- Akaike Information Criterion (
**AIC**) provides a method for assessing the quality of model through comparison of related models, if there are many models then go with the one which has**smallest**AIC value.

**Step 4**: Use the fitted model to do predictions.

To predict the probabilities we create a data frame. Let’s start with **holding gre and gpa at their means**.

newdata1 <- with (mydata, data.frame ( gre = mean ( gre ), gpa = mean ( gpa ), rank = factor ( 1:4 )))

Now using the predict function, a new column “rankP” is created in the above data frame in which the predicted values of the probabilities are placed.

newdata1$rankP <- predict (mylogit, newdata = newdata1, type = “response” )

The value of rankp suggests that students from rank 1 institute are more likely to get admission than a student from any other institutes.

In this prediction we kept gre and gpa at their means now let’s try other way too i.e. **by generating a set of values for gre** and predicting again.

newdata2 <- with( mydata, data.frame( gre = rep( seq( from = 200, to = 800, length.out = 100 ), 4),gpa = mean( gpa ), rank = factor( rep( 1:4, each = 100 ))))

newdata3 <- cbind(newdata2, predict(mylogit, newdata = newdata2, type=”link”, se=TRUE))newdata3 <- within(newdata3, { PredictedProb <- plogis(fit) LL <- plogis(fit – (1.96 * se.fit)) UL <- plogis(fit + (1.96 * se.fit))})

finally, first few records with the predicted probability is as shown.

The **plot of the probabilities** is obtained by using ggplot function.

ggplot (newdata3, aes ( x = gre, y = PredictedProb )) + geom_ribbon ( aes ( ymin = LL, ymax = UL, fill = rank), alpha = .2 ) + geom_line ( aes ( colour = rank ), size=1 )

References:

Data analysis using Logistic regression

Introduction to Logistic regression