3  Chapter 3 - Descriptive Statistics

Descriptive statistics is like the highlight reel of your data. Instead of staring at endless rows of numbers, we boil them down into quick, meaningful summaries that tell us the “story so far.”

It’s our first step in understanding a phenomenon: finding out what’s typical and how much things change.

💡 Imagine this:

Descriptive statistics can be seen as a set of summaries of raw data related to one or several variables of a phenomenon. Descriptive statistics usually gives us a first general idea of a phenomenon by looking at summaries such as averages and variability of variables that represent different aspects of a phenomenon.

So, descriptive statistics is basically asking two big questions about your data:

  1. What’s the “typical” value of a variable?Measures of central tendency

  2. How much does it vary?Measures of dispersion

3.1 Central tendency measures

The main central tendency measures are:

  • Arithmetic mean

  • Median

  • Mode

3.1.1 Arithmetic mean

An arithmetic mean of a variable X is a simple measure that tells us the average value of all valid values of X, assuming that each value has the same importance (or weight). The variable X can be representing any attribute of a subject. A subject can be an individual, a group, a team, a business unit, a company, a financial portfolio, an industry, a region, a country, etc.

An example of a variable X can be the monthly sales amount of a company for the last 3 years. In this case, the variable X will have 36 observations (36 monthly sales). The subject here is a company and the variable or attribute is the company sales over time. Another example can be a variable that represents the daily returns of a financial portfolio over the last 2 years. In this case, the variable might have about 500 observations considering 250 business days each year. The subject in this example is a financial portfolio, that might be composed of more than one stock and/or bond.

To calculate the arithmetic mean of a variable X we simply sum all the non-missing values of the variable and then divide them by the number of non-missing values. Then, the calculation is as follows:

\bar{X}=\frac{{\displaystyle {\displaystyle \sum_{i=1}^{N}X_{i}}}}{N}

Where N is the number of non-missing values (observations) of X. A missing value of a variable happens when the variable X for a specific observation has no value. It is important to note that a missing value is not a zero value. When we work with real world datasets, it is very common to find non-missing values in many variables.

One of the disadvantage of the arithmetic mean is that it is very sensible to extreme values. If a variable has few extreme values, the arithmetic mean might not be a good representation of an average or mid point. In the prescence of few very extreme values in a variable, the best measure of central tendency is the median, not the arithmetic mean.

3.1.2 The Median

Another measure of central tendency is the median. The median of a variable is its 50 percentile, which is the mid point of its values when the values are sorted in ascending order. When we have an even number of observations, there will be 2 mid points, so the median will be equal to the arithmetic mean of these 2 mid points. When we have an odd number of observations there will be only 1 value in the middle, which is the median.

For example, if we want to know what is the typical size of all companies that trade shares in the Mexican stock market we can calculate the median of firm size. These firms are called public firms. Firm size can be measured with different variables. We can use the total value of its assets (total assets), the market value, or the number of employees. In this example we will use total assets at the end of 2018 for all public Mexican firms. At the end of 2018 there were 146 Mexican public firms in the market exchange (“Bolsa Mexicana de Valores”). I will show how to calculate the median total assets of these 146 firms.

The 2018 total assets of the 146 Mexican public firms for 2018 are shown below (sorted alphabetically)

Mexican firms in the BMV
Firm row # Industry 2018 Total Assets (in thousand pesos)
ACCEL 1 Services $6,454,560.00
AEROMEXICO 2 Transport Services $76,772,848.00
VOLARIS 148 Transport Services $22,310,652.00
WALMART 149 Retail $306,528,832.00

We sort the list from the lowest to the highest value of 2018 total assets:

Firm row # Industry Size Rank 2018 Total Assets (in thousand pesos)
INGEAL 98 Food & Beverages 1 $171,104.00
HIMEXSA 88 Textile 2 $494,378.00
FHIPO14 45 Real State 73 $27,979,1184.00
TVAZTECA 139 Telecommunications 74 $27,988,054.00
AMERICA MOVIL 8 Telecommunications 145 $1,429,223,392.00
GFBANORTE 69 Financial Services 146 $1,620,470,400.00

The median total assets is the mid point of the list. However, in this case, I have 146 firms, so it is not possible to find an exact mid point. Then, I need to calculate the arithmetic average assets of the 2 firms that are in the middle (firms in positions 73 and 74). Then the median will be equal to $27,983,619.00 thousand pesos (about 27 billion pesos), which is the average value between FHIPO14 and TVAZTECA assets. The arithmetic mean for total assets considering the 146 firms is $97,860,896.23 thousand pesos (about 97.8 billion pesos), which is much bigger than the median. Then, which might be the best measure that better represents the typical size of Mexican firms? In this case, the best measure is the median, so we can say that the typical size of a Mexican public firm is about $27.9 thousand million pesos.

Then, what is the difference between the mean and the median? When the distribution of the values of a variable is very close to a normal distribution, the mean and the median will be very similar, so we can use the mean or median to represent the typical value of the variable. When the variable has few very extreme values, then the distribution of values will not be similar to a normal distribution; it will have fat tails due to the presence of extreme values. In this case the best measure of central tendency is the median, not the mean.

What is a normal distribution? It is a very common probability distribution of random variables. We will further explain probability distributions later. For now, just consider that many variables of all disciplines and nature follow a close-to-normal distribution.

The median gives of a better representation of the “average” value of a variable compared with the arithmetic mean when the distribution of the values does NOT follow a normal distribution. In the case of 2018 total assets we can explore its distribution using a histogram:

I will later explain in more detail what a histogram is.

I a histogram we see how often different ranges of values of a variable appear. This histogram does not look like a normal distributed variable. This histogram is said to be “skewed” to the right since there are very few firms with very high values of total assets. Normal distributed variable look like a bell shape curve where most of the values are around the arithmetic mean. In this case, we can see that most of the firms (about 100 firms) have a range of total assets between 0 and $25 thousand million pesos. Since the total of firms is 146 then, only about 46 firms have assets higher than $25 thousand million (or 25 billion pesos). Actually I can see that there are very few firms with assets greater than $1,000 thousand million (or greater than $1 trillion pesos), and one above $1,500 trillion. Looking at the previous table we can see that AMERICA MOVIL and GFBANORTE have assets greater than $1,400 trillion pesos.

With the histogram we can see that most of the firms (about 67%, 100 out of 146) have assets less than 25 billion pesos. The arithmetic mean of total assets is more than $97 billion, and the median total assets (or 50 percentile) is about $27 billion. The arithmetic mean is very sensible to extreme values, while the median is not. If we use the mean as a measure of a typical size of a Mexican firm we would be very far from the most common values of total assets. Then, the best measure of a typical size will be the median, which is about $27 billion pesos.

In sum, for skewed distributions the median will always be the best measure for central tendency, while the arithmetic mean will be a biased measure that will not represent the central or typical value. Actually, in the case of normal distributed variables, the median will be very close to the mean, so the median is always a good measure of central tendency.

Examples of business variables with a skewed distribution similar to total assets are employee salaries, income of families in a region or country, any variable from the income statement such as firm sales, firm profits.

3.1.3 Mode

Mode is the value that most appear in the variable. Mode can be calculated only for discrete variables, not for continuous variables. Mode is rarely used as a central tendency measure.

3.2 Dispersion measures

3.2.1 Variance and standard deviation

Standard deviation is used to measure how much on average the individual values of a variable change from the mean.

The variance of a variable X is the average of squared deviations from each individual value X_i from its mean:

Var(X)=\frac{1}{N}\sum_{i=1}^{N}\left(X_{i}-\bar{X}\right)^{2}=\sigma_{X}^{2}

Where:

X_i = Value i of the variable X

\overline{X}=\frac{1}{n}\sum_{i=1}^{n}X_{i} = Arithmetic average of X

Why the variance is the average of squared deviations? The reason is because if we do not square the deviations, then they will cancel out each other since some deviations are positive and other negative. Then, the squaring is just a trick to avoid canceling the positive with the negative deviations.

The result of the variance will be a number that our brain cannot easily interpret. To have a more reasonable measure of linear deviation, then we just take the square root of the variance, and then, we will be able to interpret that number as the average deviations of all points from their mean. This measure is called standard devation:

SD(X)=\sqrt{Var(X)}= \sqrt{\frac{1}{N}\sum_{i=1}^{N}\left(X_{i}-\bar{X}\right)^{2}}=\sigma_{X}

The variance can also be expressed as the expected value of squared deviations:

Var(X)=E[(X-\bar{X})^2]

Doing the multiplication of the squared term:

Var(X)=E[(X^2-X\bar{X}-\bar{X}X+\bar{X}^2)]

Since \bar{X} and \bar{Y} are constants, I can take them out of the expectation:

Var(X)=E[X^2]-\bar{X}E[X]-\bar{X}E[X]+\bar{X}^2

Since E[X]=\bar{X}, then:

Var(X)=E[X^2]-\bar{X}^2

Then, the variance can be defined as the expected value of X squared minus its squared mean.

Also, we can express the variance of X as:

Var(X)=\frac{1}{N}\sum_{i=1}^{N}\left(X_{i}\right)^2-\bar{X}^2

Most Statistics books and Statistics software use (N-1) instead of N as the denominator of the variance formula to get a more conservative value of the variance. This measure is called sample variance. When we divide by N in the variance formula, we are calculating the population variance. Both formulas provide very similar results, but the sample variance will be a bit bigger than the population variance, so it is a more conservative value.

In Statistics, the sample variance is an unbiased measure of the underlying (real) variance.

Then, we can re-write the formula for sample variance of X as:

Var(X)=\frac{1}{\left(N-1\right)}\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2}=\sigma_{X}^{2}

And the sample standard deviation of X can be written as:

SD(X)=\sqrt[2]{Var(X)}=\sqrt{\frac{1}{(n-1)}\sum_{i=1}^{n}(X_{i}-\bar{X})^{2}}

SD(X)=\frac{\sqrt{\sum_{i=1}^{n}(X_{i}-\bar{X})^{2}}}{\sqrt{(n-1)}}=\sigma_{X}

3.3 Illustrating measures of descriptive statistics

Let’s work with an interesting dataset about the Gross Domestic Product (GDP) per capita for all countries. GDP is the total monetary value of all final goods and services produced in a country. It is usually measured by quarter or by year. GDP per capita is the average GDP produced by citizen.

The World Bank mantains a historical dataset about GDP and GDP per capita.

Let’s download the GDP per capita for all countries for 2024:

import requests, pandas as pd, numpy as np
from scipy import stats

IND = "NY.GDP.PCAP.CD"   # GDP per capita (current US$)
YEAR = 2024              # pick a year

url = f"https://api.worldbank.org/v2/country/all/indicator/{IND}?format=json&date={YEAR}:{YEAR}&per_page=20000"
j = requests.get(url, timeout=30).json()
rows = [r for r in j[1] if r['value'] is not None]
gdp = pd.DataFrame([{"country": r["country"]["value"], "gdp_pc": float(r["value"])} for r in rows])

# Filter out regions/aggregates
regional_keywords = ['Heavily indebted poor countries (HIPC)', 'Least developed countries: UN classification', 'Low & middle income', 'Middle income', 'Africa', 'Arab World', 'Caribbean', 'Central', 'Europe', 'Euro Area', 'Latin America', 'Middle East', 'North America', 'OECD', 'Small states', 'South Asia', 'Sub-Saharan Africa', 'Upper middle income', 'Lower middle income', 'High income', 'Low income', 'World', 'IDA', 'IBRD', 'East Asia', 'Pacific', 'Post-demographic', 'Pre-demographic', 'Early-demographic', 'Late-demographic', 'Fragile and conflict', 'Developing World']
gdp = gdp[~gdp['country'].str.contains('|'.join(regional_keywords), case=False)]

gdp = gdp[gdp['country'] != 'Heavily indebted poor countries (HIPC)']

def quick_stats(x):
    x = pd.Series(x).dropna()
    return pd.Series({
        "n": len(x),
        "mean": x.mean(),
        "median": x.median(),
        "trimmed_mean_20%": stats.trim_mean(x.to_numpy(), 0.2),
        "std": x.std(ddof=1),
        "IQR": x.quantile(0.75) - x.quantile(0.25),
        "MAD(normalized)": stats.median_abs_deviation(x, scale='normal'),
        "min": x.min(), "Q1": x.quantile(0.25), "Q3": x.quantile(0.75), "max": x.max()
    })

pd.DataFrame({
    "GDP per capita": quick_stats(gdp["gdp_pc"]),
    "GDP per capita (drop top 1%)": quick_stats(gdp["gdp_pc"][gdp["gdp_pc"] <= gdp["gdp_pc"].quantile(0.99)])
})
C:\Users\L00352955\AppData\Local\Temp\ipykernel_3736\451211396.py:14: UserWarning:

This pattern is interpreted as a regular expression, and has match groups. To actually get the groups, use str.extract.
GDP per capita GDP per capita (drop top 1%)
n 189.000000 187.000000
mean 21515.496863 19441.562027
median 8230.043115 7919.208868
trimmed_mean_20% 12299.686008 11893.942121
std 32466.462560 24497.414627
IQR 26689.281151 26572.411084
MAD(normalized) 10213.410689 9752.567146
min 219.424831 219.424831
Q1 2694.737809 2680.306128
Q3 29384.018960 29252.717212
max 288001.433369 137781.681659