Library of Methods and Models > Data Mining > Logistic Regression

Logistic Regression

Logistic regression is a type of multiple regression, the general purpose of which is to analyze the relationship between several independent variables (also referred to as regressors or predictors) and the dependent variable. The binary logistic regression is applied if the dependent variable is binary (that is, it can take only two values). In other words, the logistic regression is used to evaluate the probability that an event occurs for a particular subject (sick or healthy, loan repayment or default, and so on).

All regression models can be written as a formula:

In the multiple linear regression, the dependent variable is a linear function of independent variables:

This regression can be used to evaluate the probability of the event outcome after calculating standard regression coefficients. For example, when considering the outcome of a loan, define the y variable with values 1 and 0 where 1 implies that the loan is repaid and 0 indicates default. There is a problem: the multiple regression does not consider that the response variable is binary in nature. This leads to a model with predicted values greater than 1 and less than 0. But these values are not valid for the initial problem. Thus, the multiple regression simply ignores the limitations on the range of values for y.

To solve the problem, the task can be formulated differently: instead of predicting the binary variable, the continuous variable is predicted with values in the interval [0, 1] at any values of independent variables. This is achieved by applying the following regression equation (logit-transformation):

where:

P. The probability that the required event occurs.
e. The base for natural logs 2.71….
y. The standard regression equation.