13 Chapter 13 - Categorical predictors and non-linear effects in Regression Models

In this chapter we cover the following types of multiple regression models:

Regression with categorical predictors (X variables)
Interaction effects in multiple regression
Quadratic effects in multiple regression

13.1 Introduction

Until now we have included continuous independent (explanatory) variables in our regression models. A categorical variable is usually a non-numeric variable that represents categories or groups where observations can be classified. When we have a categorical variable in a regression model we need to do a special treatment for its coding before we include it in the regression model.

A categorical variable is a variable that might not have a numeric ranking meaning, but they are useful variables for classification. For example, for a sample of companies, industry is a categorical variable since we cannot sum nor average its values; we can only count different types of industries.

Nevertheless, there are some numeric variables that can be treated as categorical variables. For example, year can be treated as both numeric and also categorical variable in a regression.

When we are interested in evaluating the effect of one numeric variable on a dependent variable, it is recommended to include control variables to make our results more robust. A control variable is a variable that might not be the focus of study, but in previous research or analysis, this control variable has been shown to be related to the dependent variable of study. Categorical variables usually are included in a regression as control variables.

if we run a regression to examine whether one explanatory (independent) numeric variable has an effect on the dependent variable, and we include one or two control variables, then if we find that there is a significant relationship between the explanatory variable and the dependent variable, we can say that this effect holds even after considering the effect of the control variables. In this case, our result will be more robust, and we will have more statistical evidence about the effect of the independent variable on the dependent variable.

Now we will use categorical variables as control variables in multiple regression models. Before doing this, I will briefly explain what we need to do before we include a categorical variable in a regression model.

Imagine a data set where each observation has several numeric variables, which represent features of a company, and there is one categorical variable that classifies the firms in 2 groups: manufacturing and non-manufacturing companies. We cannot include this categorical variable directly as X variable since this variable is not numeric. We need to codify a dummy variable for this categorical variable. A dummy variable is a variable that only has 2 values: 0 or 1. We can codify a dummy variable assigning 1 for manufacturing firms and 0 for non-manufacturing firms (or vice-versa).

Let’s do an example.

The data set d1 is a random sample of 60 students with the following variables:

Y_grade: Grade of a test

X1_hrs: Number of hours dedicated to study for a specific test

X2_method: Whether the student followed a specific didactic method based on writing, or did not follow a didactic method

Downloading the dataset:

import pandas as pd
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

# Download the first file
url1 = 'https://www.apradie.com/datos/ch11_d1.csv'
response1 = requests.get(url1, headers=headers)
with open('ch11_d1.csv', 'wb') as f:
    f.write(response1.content)
# Import as data frames
df = pd.read_csv('ch11_d1.csv')

# Display the first few rows of the data
print(df.head())
# Display the last few rows of the data
print(df.tail())

     Y_grade  X1_hrs       X2_method
0  49.353420      10  writing method
1  84.115315      26  writing method
2  79.857837      22  writing method
3  70.981086      20  writing method
4  45.973687      10  writing method
      Y_grade  X1_hrs  X2_method
55  93.255247      36  no method
56  73.744501      25  no method
57  61.859911      22  no method
58  62.120559      21  no method
59  37.892966      15  no method

We can do a scatter plot to see the relationship between Y and X1:

Pending:

Explain how to do the dummy coding and why hot-coding creates multicollienarity
Equations for each categorical group
Example with price elasticity controlling for seasonality
Interaction effects

What happens when we believe that each group not only has different intercept, but also the effect of an important X variable is different for each group?

Example with finance dataset: effect of size and interaction between size and earnings per share on firm productivity
Equations for each categorical group
Quadratic effects

Modelling quadratic effects with a linear model: just square an X variable