4 Chapter 4 - Basic Data treatments

Before we run any statistical model or ML model, we need to get data that represent the phenomenom we want to model. We usually get samples of populations about several variables. We need to prepare the data so that models can train and run efficiently and avoid bias from scale or encoding issues.

The most common data treatments/transformations for numerical variables:

Scaling and Normalization
- Standardization
- Min-Max scaling
- Robust scaling
Imputation for missing values
Imputation of outliers/extreme values
- Winsorization: cap values at certain percentiles
- Clipping: limit values to a fixed range
Mathematical transformations:
- Logarithmic transformation
- First (or annual) difference of log value (growth factor)
- Power transformation - makes data more Gaussian-like
- Polynomial features - adds interactions and higher-order terms

The most common data treatments/transformations for categorical variables are:

One-Hot Encoding: One binary column per category
Dummy Enconding: Similar to one-hot encoding, but one category (binary column) is dropped, and it is used as the reference category. This is used in regression models to avoid multicollienarity

From the previous data transformations, I will explain in more detail the most used in Statistical modelling

4.1 Normalization and Scaling

4.1.1 Standardization as normalization method

For each x_i variable, we normalize by calculating its corresponding z_i value. The z_i value is calculated as the number of standard deviations each value of x_i is away from its mean \bar{x_i} :

z_i = \frac{(x_i - \bar{x_i})}{\sigma_{x_i}}

Where:

\bar{x_i}= arithmetic mean of x_i

\sigma_{x_i} is the standard deviation of x_i

The mean of any z_i variable will always be zero and its variance and standard deviation will always be equal to 1.

4.1.2 Min-Max scaling

x'= \frac{x-x_{min}}{x_{max}-x_{min}}

x' will be is a scaled version of x and its value can be any number from 0 to 1 (included)

4.1.3 Robust scaling

Similar to standardization, but instead of using the arithmetic mean, the median is used as a central measure, and instead of using the standard deviation, the interquartile range is used as a measure of dispersion:

IQR = Q_{75} - Q_{25}

where Q_{75} is the 75th percentile and Q_{25} is the 25th percentile.

The robust scale measure is calculated as:

x' = \frac{x-median(x)}{IQR} Why using percentiles?

Median and IQR are robust statistics since they are much less sensitive to extreme values than mean and standard deviation
Outliers don’t distort the scaling
The transformed data typically has a median of xzero and an IQR of 1

4.2 Imputation

4.2.1 Imputation for missing values

Depending of the context, if the variables have missing values, it is possible to fill a numeric value to avoid lossing observations, but mantaining the main patterns of the data. The most commmon imputations values for numeric variables are mean, median or linear regression interpolation.

4.2.2 Imputation for outliers/extreme values

4.2.2.1 Winsorization

Winsorization is the process of flattening extreme (outliers) observations of a variable. The winsorization process can be applied to high or low (negative) values of the variable. For high values we can indicate from which percentile we consider the values as outliers, then those values above that percentile is replace with the value of that percentile. For low values we indicate from which percentile we consider the values as outliers, then those values below that percentile is replace with the value of that percentile.

Winsorization is commonly used for independent variables of multiple regression models to avoid unreliable estimations of regression coefficients.

4.2.2.2 Clipping

Clipping is similar to winsorization, but instead of limiting values by percentiles, a fixed minimum and maximum values are used to replace extreme values.

4.3 Mathematical transformations:

4.3.1 Logarithmic transformation

Applying the natural log to a numeric variable reduces skewness and stabilizes variance compared to the original variable

The logarithm of a variable is a very useful mathematical transformation for statistical analysis.

What is a natural logarithm?

The natural logarithm of a number is the exponent that the number e (=2.71…) needs to be raised to get another number. For example, let’s name x=natural logarithm of a stock price p. Then:

e^x = p

The way to get the value of x that satisfies this equality is actually getting the natural log of p:

x = log_e(p)

Then, we have to remember that the natural logarithm is actually an exponent that you need to raise the number e to get a result or a specific number.

The natural log is the logarithm of base e (=2.71…). The number e is an irrational number (it cannot be expressed as a division of 2 natural numbers), and it is also called the Euler constant. Leonard Euler (1707-1783) took the idea of the logarithm from the great mathematician Jacob Bernoulli, and discovered very astonishing features of the e number. Euler is considered the most productive mathematician of all times. Some historians believe that Jacob Bernoulli discovered the number e around 1690 when he was playing with calculations to know how an amount of money grows over time with an interest rate.

How e is related to the grow of any amount over time? It is mainly related with the concept of compounding.

Next I give an example of the effect of compounding when calculating percentage growth rates

4.3.1.1 The effect of compounding in calculating percentage growth rates

Here is a simple example:

If I invest $100.00 today (t=0) with an annual interest rate of 50%, then the end balance of my investment at the end of the first year will be:

I_1=100*(1+0.50)=150

If the interest rate is 100%, then I would get:

I_1=100*(1+1)=200

Then, the general formula to get the final amount of my investment at the beginning of year 2, for any interest rate R can be:

I_1=I_0*(1+R)

The (1+R) is the growth factor of my investment.

In Finance, the investment amount is called principal. If the interests are calculated (compounded) each month instead of each year, then I would end up with a higher amount at the end of the year.

Monthly compounding means that a monthly interest rate is applied to the amount to get the interest of the month, and then the interest of the month is added to the investment (principal). Then, at the beginning of month 2 the principal will be higher than the initial investment. At the end of month 2 the interest will be calculated using the updated principal amount. Putting in simple math terms, the final balance of an investment at the end of month 1 when doing monthly compounding will be:

I_1=I_0*\left(1+\frac{R}{12}\right)

We can do the same for month 2:

I_2=I_1*\left(1+\frac{R}{12}\right)^{1}

We can plug the calculation for I_1 in this formula to express I_2 in terms of the initial investment:

I_2=I_0*\left(1+\frac{R}{12}\right)\left(1+\frac{R}{12}\right)

We group the growth factor using an exponent:

I_2=I_0*\left(1+\frac{R}{12}\right)^{2}

We can see the pattern to calculate the end balance of the investment in month 12 when comounding monthly. The monthly interest rate is equal to the annual interest rate R divided by 12 (R/N). Then, with an annual rate of 100% and monthly compounding (N=12), the end value of the investment will be:

I_{12}=100*\left(1+\frac{1}{12}\right)^{1*12}=100*(2.613..)

In this case, the growth factor is (1+1/12)^{12}, which is equal to 2.613.

Instead of compounding each month, if the compounding is every moment, then we are calculating a continuously compounded rate.

If we do a continuously compounding for the previous example, then the growth factor for one year becomes the astonishing Euler constant e:

Let’s do an example for a compounding of each second (1 year has 31,536,000 seconds). The investment at the end of the year 1 (or month 12) will be:

I_{12}=100*\left(1+\frac{1}{31536000}\right)^{1*31536000}=100*(2.718282..)\cong100*e^1

Now we see that e^1 is the GROWTH FACTOR after 1 year if we do the compounding of the interests every moment!

We can generalize to any other annual interest rate R, so that e^R is the growth factor for an annual nominal rate R when the interest is compounded every moment.

When compounding every instant, we use small r instead of R for the interest rate. Then, the growth factor will be: e^r

Then we can do a relationship between this growth rate and an effective equivalent rate:

\left(1+EffectiveRate\right)=e^{r}

If we apply the natural logarithm to both sides of the equation:

ln\left(1+EffectiveRate\right)=ln\left(e^r\right)

Since the natural logarithm function is the inverse of the exponential function, then:

ln\left(1+EffectiveRate\right)=r

In the previous example with a nominal rate of 100%, when doing a continuously compounding, then the effective rate will be:

\left(1+EffectiveRate\right)=e^{r}=2.7182

EffectiveRate=e^{r}-1

Doing the calculation of the effective rate for this example:

EffectiveRate=e^{1}-1 = 2.7182.. - 1 = 1.7182 = 171.82\%

Then, when compounding every moment, starting with a nominal rate of 100% annual interest rate, the actual effective annual rate would be 171.82%!

4.3.2 First difference of the log - continuously compounded growth rate

One way to calculate cc growth rates is by subtracting the log of the current value of the variable (at t) minus the log of the previous value (at t-1):

Let’s assume we want to know the growth rate of a stock price of a company. In this case, the growth rate of the price is called return.

r_{t}=log(price_{t})-log(price_{t-1})

This is also called as the difference of the log of the value (price in this example).

We can also calculate cc returns as the log of the current adjusted price (at t) divided by the previous adjusted price (at t-1):

r_{t}=log\left(\frac{price_{t}}{price_{t-1}}\right)

cc returns are usually represented by small r, while simple returns are represented by capital R.