3 Chapter 3 - Descriptive Statistics

Descriptive statistics is like the highlight reel of your data. Instead of staring at endless rows of numbers, we boil them down into quick, meaningful summaries that tell us the “story so far.”

It’s our first step in understanding a phenomenon: finding out what’s typical and how much things change.

💡 Imagine this:

In Economics, you want to know how a country’s economy has been doing over the last decade. You could calculate the average annual GDP growth to see the overall trend.
In Finance, you might look at the average yearly return of an investment over 5 years—and check how bumpy the ride was by measuring the variability of those returns.

Descriptive statistics can be seen as a set of summaries of raw data related to one or several variables of a phenomenon. Descriptive statistics usually gives us a first general idea of a phenomenon by looking at summaries such as averages and variability of variables that represent different aspects of a phenomenon.

So, descriptive statistics is basically asking two big questions about your data:

What’s the “typical” value of a variable? → Measures of central tendency
How much does it vary? → Measures of dispersion

3.1 Central tendency measures

The main central tendency measures are:

Arithmetic mean
Median
Mode

3.1.1 Arithmetic mean

An arithmetic mean of a variable X is a simple measure that tells us the average value of all valid values of X, assuming that each value has the same importance (or weight). The variable X can be representing any attribute of a subject. A subject can be an individual, a group, a team, a business unit, a company, a financial portfolio, an industry, a region, a country, etc.

An example of a variable X can be the monthly sales amount of a company for the last 3 years. In this case, the variable X will have 36 observations (36 monthly sales). The subject here is a company and the variable or attribute is the company sales over time. Another example can be a variable that represents the daily returns of a financial portfolio over the last 2 years. In this case, the variable might have about 500 observations considering 250 business days each year. The subject in this example is a financial portfolio, that might be composed of more than one stock and/or bond.

To calculate the arithmetic mean of a variable X we simply sum all the non-missing values of the variable and then divide them by the number of non-missing values. Then, the calculation is as follows:

\bar{X}=\frac{{\displaystyle {\displaystyle \sum_{i=1}^{N}X_{i}}}}{N}

Where N is the number of non-missing values (observations) of X. A missing value of a variable happens when the variable X for a specific observation has no value. It is important to note that a missing value is not a zero value. When we work with real world datasets, it is very common to find non-missing values in many variables.

One of the disadvantage of the arithmetic mean is that it is very sensible to extreme values. If a variable has few extreme values, the arithmetic mean might not be a good representation of an average or mid point. In the prescence of few very extreme values in a variable, the best measure of central tendency is the median, not the arithmetic mean.

3.1.2 The Median

Another measure of central tendency is the median. The median of a variable is its 50 percentile, which is the mid point of its values when the values are sorted in ascending order. When we have an even number of observations, there will be 2 mid points, so the median will be equal to the arithmetic mean of these 2 mid points. When we have an odd number of observations there will be only 1 value in the middle, which is the median.

For example, if we want to know what is the typical size of all companies that trade shares in the Mexican stock market we can calculate the median of firm size. These firms are called public firms. Firm size can be measured with different variables. We can use the total value of its assets (total assets), the market value, or the number of employees. In this example we will use total assets at the end of 2018 for all public Mexican firms. At the end of 2018 there were 146 Mexican public firms in the market exchange (“Bolsa Mexicana de Valores”). I will show how to calculate the median total assets of these 146 firms.

The 2018 total assets of the 146 Mexican public firms for 2018 are shown below (sorted alphabetically)

Mexican firms in the BMV
Firm	row #	Industry	2018 Total Assets (in thousand pesos)
ACCEL	1	Services	$6,454,560.00
AEROMEXICO	2	Transport Services	$76,772,848.00
…	…	…	…
VOLARIS	148	Transport Services	$22,310,652.00
WALMART	149	Retail	$306,528,832.00

We sort the list from the lowest to the highest value of 2018 total assets:

Firm	row #	Industry	Size Rank	2018 Total Assets (in thousand pesos)
INGEAL	98	Food & Beverages	1	$171,104.00
HIMEXSA	88	Textile	2	$494,378.00
…	…	…	…	…
FHIPO14	45	Real State	73	$27,979,1184.00
TVAZTECA	139	Telecommunications	74	$27,988,054.00
…	…	…	…	…
AMERICA MOVIL	8	Telecommunications	145	$1,429,223,392.00
GFBANORTE	69	Financial Services	146	$1,620,470,400.00

The median total assets is the mid point of the list. However, in this case, I have 146 firms, so it is not possible to find an exact mid point. Then, I need to calculate the arithmetic average assets of the 2 firms that are in the middle (firms in positions 73 and 74). Then the median will be equal to $27,983,619.00 thousand pesos (about 27 billion pesos), which is the average value between FHIPO14 and TVAZTECA assets. The arithmetic mean for total assets considering the 146 firms is $97,860,896.23 thousand pesos (about 97.8 billion pesos), which is much bigger than the median. Then, which might be the best measure that better represents the typical size of Mexican firms? In this case, the best measure is the median, so we can say that the typical size of a Mexican public firm is about $27.9 thousand million pesos.

Then, what is the difference between the mean and the median? When the distribution of the values of a variable is very close to a normal distribution, the mean and the median will be very similar, so we can use the mean or median to represent the typical value of the variable. When the variable has few very extreme values, then the distribution of values will not be similar to a normal distribution; it will have fat tails due to the presence of extreme values. In this case the best measure of central tendency is the median, not the mean.

What is a normal distribution? It is a very common probability distribution of random variables. We will further explain probability distributions later. For now, just consider that many variables of all disciplines and nature follow a close-to-normal distribution.

The median gives of a better representation of the “average” value of a variable compared with the arithmetic mean when the distribution of the values does NOT follow a normal distribution. In the case of 2018 total assets we can explore its distribution using a histogram:

I will later explain in more detail what a histogram is.

I a histogram we see how often different ranges of values of a variable appear. This histogram does not look like a normal distributed variable. This histogram is said to be “skewed” to the right since there are very few firms with very high values of total assets. Normal distributed variable look like a bell shape curve where most of the values are around the arithmetic mean. In this case, we can see that most of the firms (about 100 firms) have a range of total assets between 0 and $25 thousand million pesos. Since the total of firms is 146 then, only about 46 firms have assets higher than $25 thousand million (or 25 billion pesos). Actually I can see that there are very few firms with assets greater than $1,000 thousand million (or greater than $1 trillion pesos), and one above $1,500 trillion. Looking at the previous table we can see that AMERICA MOVIL and GFBANORTE have assets greater than $1,400 trillion pesos.

With the histogram we can see that most of the firms (about 67%, 100 out of 146) have assets less than 25 billion pesos. The arithmetic mean of total assets is more than $97 billion, and the median total assets (or 50 percentile) is about $27 billion. The arithmetic mean is very sensible to extreme values, while the median is not. If we use the mean as a measure of a typical size of a Mexican firm we would be very far from the most common values of total assets. Then, the best measure of a typical size will be the median, which is about $27 billion pesos.

In sum, for skewed distributions the median will always be the best measure for central tendency, while the arithmetic mean will be a biased measure that will not represent the central or typical value. Actually, in the case of normal distributed variables, the median will be very close to the mean, so the median is always a good measure of central tendency.

Examples of business variables with a skewed distribution similar to total assets are employee salaries, income of families in a region or country, any variable from the income statement such as firm sales, firm profits.

3.1.3 Mode

Mode is the value that most appear in the variable. Mode can be calculated only for discrete variables, not for continuous variables. Mode is rarely used as a central tendency measure.

3.2 Dispersion measures

3.2.1 Variance and standard deviation

Standard deviation is used to measure how much on average the individual values of a variable change from the mean.

The variance of a variable X is the average of squared deviations from each individual value X_i from its mean:

Var(X)=\frac{1}{N}\sum_{i=1}^{N}\left(X_{i}-\bar{X}\right)^{2}=\sigma_{X}^{2}

Where:

X_i = Value i of the variable X

\overline{X}=\frac{1}{n}\sum_{i=1}^{n}X_{i} = Arithmetic average of X

Why the variance is the average of squared deviations? The reason is because if we do not square the deviations, then they will cancel out each other since some deviations are positive and other negative. Then, the squaring is just a trick to avoid canceling the positive with the negative deviations.

The result of the variance will be a number that our brain cannot easily interpret. To have a more reasonable measure of linear deviation, then we just take the square root of the variance, and then, we will be able to interpret that number as the average deviations of all points from their mean. This measure is called standard devation:

SD(X)=\sqrt{Var(X)}= \sqrt{\frac{1}{N}\sum_{i=1}^{N}\left(X_{i}-\bar{X}\right)^{2}}=\sigma_{X}

The variance can also be expressed as the expected value of squared deviations:

Var(X)=E[(X-\bar{X})^2]

Doing the multiplication of the squared term:

Var(X)=E[(X^2-X\bar{X}-\bar{X}X+\bar{X}^2)]

Since \bar{X} and \bar{Y} are constants, I can take them out of the expectation:

Var(X)=E[X^2]-\bar{X}E[X]-\bar{X}E[X]+\bar{X}^2

Since E[X]=\bar{X}, then:

Var(X)=E[X^2]-\bar{X}^2

Then, the variance can be defined as the expected value of X squared minus its squared mean.

Also, we can express the variance of X as:

Var(X)=\frac{1}{N}\sum_{i=1}^{N}\left(X_{i}\right)^2-\bar{X}^2

Most Statistics books and Statistics software use (N-1) instead of N as the denominator of the variance formula to get a more conservative value of the variance. This measure is called sample variance. When we divide by N in the variance formula, we are calculating the population variance. Both formulas provide very similar results, but the sample variance will be a bit bigger than the population variance, so it is a more conservative value.

In Statistics, the sample variance is an unbiased measure of the underlying (real) variance.

Then, we can re-write the formula for sample variance of X as:

Var(X)=\frac{1}{\left(N-1\right)}\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2}=\sigma_{X}^{2}

And the sample standard deviation of X can be written as:

SD(X)=\sqrt[2]{Var(X)}=\sqrt{\frac{1}{(n-1)}\sum_{i=1}^{n}(X_{i}-\bar{X})^{2}}

SD(X)=\frac{\sqrt{\sum_{i=1}^{n}(X_{i}-\bar{X})^{2}}}{\sqrt{(n-1)}}=\sigma_{X}

3.3 Illustrating measures of descriptive statistics

Let’s work with an interesting dataset about the Gross Domestic Product (GDP) per capita for all countries. GDP is the total monetary value of all final goods and services produced in a country. It is usually measured by quarter or by year. GDP per capita is the average GDP produced by citizen.

The World Bank mantains a historical dataset about GDP and GDP per capita.

Let’s download the GDP per capita for all countries for 2024:

import requests, pandas as pd, numpy as np
from scipy import stats

IND = "NY.GDP.PCAP.CD"   # GDP per capita (current US$)
YEAR = 2024              # pick a year

url = f"https://api.worldbank.org/v2/country/all/indicator/{IND}?format=json&date={YEAR}:{YEAR}&per_page=20000"
j = requests.get(url, timeout=30).json()
rows = [r for r in j[1] if r['value'] is not None]
gdp = pd.DataFrame([{"country": r["country"]["value"], "gdp_pc": float(r["value"])} for r in rows])

# Filter out regions/aggregates
regional_keywords = ['Heavily indebted poor countries (HIPC)', 'Least developed countries: UN classification', 'Low & middle income', 'Middle income', 'Africa', 'Arab World', 'Caribbean', 'Central', 'Europe', 'Euro Area', 'Latin America', 'Middle East', 'North America', 'OECD', 'Small states', 'South Asia', 'Sub-Saharan Africa', 'Upper middle income', 'Lower middle income', 'High income', 'Low income', 'World', 'IDA', 'IBRD', 'East Asia', 'Pacific', 'Post-demographic', 'Pre-demographic', 'Early-demographic', 'Late-demographic', 'Fragile and conflict', 'Developing World']
gdp = gdp[~gdp['country'].str.contains('|'.join(regional_keywords), case=False)]

gdp = gdp[gdp['country'] != 'Heavily indebted poor countries (HIPC)']

def quick_stats(x):
    x = pd.Series(x).dropna()
    return pd.Series({
        "n": len(x),
        "mean": x.mean(),
        "median": x.median(),
        "trimmed_mean_20%": stats.trim_mean(x.to_numpy(), 0.2),
        "std": x.std(ddof=1),
        "IQR": x.quantile(0.75) - x.quantile(0.25),
        "MAD(normalized)": stats.median_abs_deviation(x, scale='normal'),
        "min": x.min(), "Q1": x.quantile(0.25), "Q3": x.quantile(0.75), "max": x.max()
    })

pd.DataFrame({
    "GDP per capita": quick_stats(gdp["gdp_pc"]),
    "GDP per capita (drop top 1%)": quick_stats(gdp["gdp_pc"][gdp["gdp_pc"] <= gdp["gdp_pc"].quantile(0.99)])
})

C:\Users\L00352955\AppData\Local\Temp\ipykernel_2068\451211396.py:14: UserWarning: This pattern is interpreted as a regular expression, and has match groups. To actually get the groups, use str.extract.
  gdp = gdp[~gdp['country'].str.contains('|'.join(regional_keywords), case=False)]

	GDP per capita	GDP per capita (drop top 1%)
n	189.000000	187.000000
mean	21515.496863	19441.562027
median	8230.043115	7919.208868
trimmed_mean_20%	12299.686008	11893.942121
std	32466.462560	24497.414627
IQR	26689.281151	26572.411084
MAD(normalized)	10213.410689	9752.567146
min	219.424831	219.424831
Q1	2694.737809	2680.306128
Q3	29384.018960	29252.717212
max	288001.433369	137781.681659