13Chapter 13 - Categorical predictors and non-linear effects in Regression Models
In this chapter we cover the following types of multiple regression models:
Regression with categorical predictors (X variables)
Interaction effects in multiple regression
Quadratic effects in multiple regression
13.1 Introduction
Until now we have included continuous independent (explanatory) variables in our regression models. A categorical variable is usually a non-numeric variable that represents categories or groups where observations can be classified. When we have a categorical variable in a regression model we need to do a special treatment for its coding before we include it in the regression model.
A categorical variable is a variable that might not have a numeric ranking meaning, but they are useful variables for classification. For example, for a sample of companies, industry is a categorical variable since we cannot sum nor average its values; we can only count different types of industries.
Nevertheless, there are some numeric variables that can be treated as categorical variables. For example, year can be treated as both numeric and also categorical variable in a regression.
When we are interested in evaluating the effect of one numeric variable on a dependent variable, it is recommended to include control variables to make our results more robust. A control variable is a variable that might not be the focus of study, but in previous research or analysis, this control variable has been shown to be related to the dependent variable of study. Categorical variables usually are included in a regression as control variables.
if we run a regression to examine whether one explanatory (independent) numeric variable has an effect on the dependent variable, and we include one or two control variables, then if we find that there is a significant relationship between the explanatory variable and the dependent variable, we can say that this effect holds even after considering the effect of the control variables. In this case, our result will be more robust, and we will have more statistical evidence about the effect of the independent variable on the dependent variable.
Now we will use categorical variables as control variables in multiple regression models. Before doing this, I will briefly explain what we need to do before we include a categorical variable in a regression model.
Imagine a data set where each observation has several numeric variables, which represent features of a company, and there is one categorical variable that classifies the firms in 2 groups: manufacturing and non-manufacturing companies. We cannot include this categorical variable directly as X variable since this variable is not numeric. We need to codify a dummy variable for this categorical variable. A dummy variable is a variable that only has 2 values: 0 or 1. We can codify a dummy variable assigning 1 for manufacturing firms and 0 for non-manufacturing firms (or vice-versa).
Let’s do an example.
The data set d1 is a random sample of 60 students with the following variables:
Y_grade: Grade of a test
X1_hrs: Number of hours dedicated to study for a specific test
X2_method: Whether the student followed a specific didactic method based on writing, or did not follow a didactic method
Downloading the dataset:
import pandas as pdimport requestsheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}# Download the first fileurl1 ='https://www.apradie.com/datos/ch11_d1.csv'response1 = requests.get(url1, headers=headers)withopen('ch11_d1.csv', 'wb') as f: f.write(response1.content)# Import as data framesdf = pd.read_csv('ch11_d1.csv')# Display the first few rows of the dataprint(df.head())# Display the last few rows of the dataprint(df.tail())
We can do a scatter plot to see the relationship between Y and X1:
Pending:
Explain how to do the dummy coding and why hot-coding creates multicollienarity
Equations for each categorical group
Example with price elasticity controlling for seasonality
Interaction effects
What happens when we believe that each group not only has different intercept, but also the effect of an important X variable is different for each group?
Example with finance dataset: effect of size and interaction between size and earnings per share on firm productivity
Equations for each categorical group
Quadratic effects
Modelling quadratic effects with a linear model: just square an X variable