Linear and Logistic Regression
In this section we will learn about linear regression and logistic regression. These are two fundamental statistical modeling techniques that are commonly used by researchers in many different fields to test theoretical models of the relationship between a set of independent variables (variables that are thought to be causally related to another variable, called the dependent variable) and a dependent variable (the variable that is hypothesized to be influenced by the independent variables).
The fundamental difference between linear and logistic regression is that in linear regression the dependent variable is a numeric variable, whereas in logistic regression the dependent variable is a categorical variable. For example, if we were building a model to test a proposed relationship between the independent variables advertising expenditures and investment in research and development and the dependent variable product sales we would use linear regression because the dependent variable, product sales, is a numeric quantity. On the other hand, if we were building a model to test the relationship between the independent variables credit score and yearly income and a dependent variable that represents whether a borrower defaults on a loan we would use logistic regression because the dependent variable is a categorical variable consisting of two categories: will default and won’t default.
The way in which we have described linear and logistic regression models being used so far can be thought of as explanatory modeling; the purpose of an explanatory model is to explain the relationship between the independent variables and the dependent variable. For example, the linear regression models we run will produce estimates of how much the dependent variable changes in response to a one-unit change in each of the independent variables when all other variables are held constant. In other words, for the regression model proposed above that has advertising expenditures and research and development investment as independent variables and product sales as the dependent variable the regression model will estimate how much a one unit increase in advertising expenditures affects product sales and how much a one unit increase in research and development investment affects product sales. Moreover, the linear regression model will also conduct a formal hypothesis test for each independent variable in which the null hypothesis is that the variable is actually unrelated to the dependent variable (and thus any relationship estimated by the regression model is the result of random chance).
You can see how such a model could be used for testing hypotheses about the causal relationship between the independent and dependent variables. If our hypothesis is that both advertising expenditures and research and development investments will lead to higher product sales, we can test our hypotheses by building and running a regression model with advertising expenditures and research and development investments as the independent variables and product sales as the dependent variable. If the linear regression model estimates a positive relationship between each independent variable and the dependent variable, and if the hypothesis test for each independent variable rejects the null hypothesis that the independent variable is actually unrelated to the dependent variable, then our hypotheses are supported by the linear regression model. However, if the relationship between one of the variables and product sales is found to be negative, or if the hypothesis test for one of the independent variables fails to reject the null hypothesis that the variable is unrelated to the dependent variable, then our hypothesis for that variable is not supported.
The equations produced by linear regression and logistic regression relate the values of the independent variables to the values of the dependent variables. Because of this these models may also be used to make predictions about the value of the dependent variable depending on the values of the independent variables. Using models to make predictions is called predictive modeling. We will take a closer look at how linear and logistic regression models are used for predictive modeling in the next section of the book, which covers machine learning for predictive analytics.