Wednesday, September 11, 2024

# Gini Coefficient and how to calculate it

The Gini Coefficient is a way to measure the income distribution of a population. The value for the Gini coefficient ranges from 0 to 1 where higher values represent greater income inequality and where: 0 represents perfect income equality (everyone has the same income) and 1 represents perfect income inequality (one individual has all the income). It was developed by Italian statistician Corrado Gini in 1912.

Mathematically, the Gini coefficient is defined based on the Lorenz curve, which plots the percentiles of the population on the graph’s horizontal axis according to income or wealth, whichever is being measured. The cumulative income or wealth of the population is plotted on the vertical axis. In this post, I’ll show you how I calcualted gini coefficient in Excel and how to interpret its results.

Let’s look at a simple example with a small dataset. We have two companies and their list of employee codes and salary for each. The number of employees vary between the companies. We want to find out how these companies compare with each other in terms of income equality. In other words, we want to see if the pay among employees are spread more or less evenly or is there a large difference between the top earners and the low earners. This is exactly where we can employ Gini coefficient concept.

The data tables look like this:

Company ABC

Company XYZ

To calculate Gini Coeff in Excel, we need to take several steps as explained below in order.

1. Sort the data by Salary in ascending order (small->large)
2. Calculate the Cumulative Population % for each row.
Add a new column: Cumulative Population %
To calculate the Cumulative Population % column, we need to divide the row number (position in relation to the table…so, first row should be 1, second
row should be 2, and so on down the column) by the total number of rows in the datarange. To make it easier and flexible, I added a new helper column ‘Row Position’ using INDEX() to get the value from Emp column (e.g. ‘f’, ‘a’…) then using MATCH() to get the position (index). The formula: =MATCH(INDEX(B22:\$B\$27,1), \$B\$22:\$B\$27,0)
where B is Emp column and datarange is B22:B27. Notice, we do not anchor the INDEX array’s starting position as we want it to increment it as we copy the formula down, but the bottom ranges are anchored.
This returns a number 1…n (where n is the last row). Drag the formula down the ‘Row Position’ column till last row. Then in Cumulative Population % column, enter the formula: =D22/COUNTA(\$B\$22:\$B\$27)
where D contains the Row Position value (1…n) and COUNTA() simply gets the non-empty rows total for B column in the range B22:27 so we get the total number of rows in our dataset.
Notice: We do not anchor the D cells as it should increment as we copy down the formula, but anchor the range for COUNTA(). Copy the formula down the Cumulative Population % ‘ column till last row.
Change the column’s Number format to % with 2 decimals. To calculate the Cumulative Income % column, divide the cumulative sum of the Salary column by the total sum of the Salary column.
3. Calculate the and Cumulative Income % for each row.
But in order to do that, we first need another column and Cumulative Sum of Salary
Add a new column: Cumulative Sum of Salary

But now, the rows in the column will have different formulas and should NOT be automatically filled in as a table. To avoid it, convert the table back to a simple datarange. R-click on the table, then Table->Convert to Range. Then continue entering the formulas (be sure to not disturb the sorting of Salary ASC!)

In the first row of the “Cumulative Sum of Salary” column, enter the value from the first row of the Salary column. In the second row of the “Cumulative Sum of Salary” column, enter a formula to add the value from the first row of the “Cumulative Sum of Salary” column to the value from the second row of the Salary column. So, formula in first row: =C22 (where C22 is the first row of Salary column)
Formula in 2nd row: = F22+C23 (where F22 is the first row of Cumulative Sum of Salary, C23 is the second row of Salary). Drag this formula down to fill in the rest of the “Cumulative Sum of Salary” column.
Notice: From 2nd row of Cumulative Sum of Salary, the formula will be F22+C23, then F23+C24 and so on.

1. Next, calculate the Cumulative Income % column by dividing each value in the “Cumulative Sum of Salary” column by the total sum of the Salary column:
So, the formula would be: =F22/SUM(\$C\$22:\$C\$27) …only the sum datarange is anchored. Where F is Cumulative Sum of Salary column and each will row will increment as we drag the formula down.
Drag the formula down the Cumulative Income % ‘ column till last row.
Change the column’s Number format to % with 2 decimals.
2. Now, we need to calculate the ‘Area under Lorenz Curve’ column…
We need to calculate the area of the trapezoid formed by two data points on the Lorenz curve before finding Gini Coefficient.
Add a new column Area under Lorenz Curve
In its first row, enter 0.
In the second row, enter this formula (to find the area under the curve):
=(E23-E22)*(G23+G22) * 0.5
where E is the column with Cumulative Population % and G is the column with Cumulative Income %
And E22 is the first row of Cumulative Population %, E23 is 2nd row of Cumulative Population %
G22 is the first row of Cumulative Income %, G23 is 2nd row of Cumulative Income %
Drag the formula down the column. Keep the precision of Area under Lorenz Curve column to 4 decimlas.
3. In a new cell, enter Gini Coefficient formula: = 1-2*SUM(H22:H27) where H is the Area under Lorenz Curve column.

— Repeat for the other company to get its Gini Coeff, then compare the two.

The calculations for the companies should look something like this:

As you can see, Company XYZ has a much higher income INEQUALITY (Gini Coeff: 0.667) than Company ABC, which has a moderate inequality with 0.200. So, we can tell that XYZ has some people making a lot more money than the rest compared to ABC.

Gini Coeff can be used for many other applications, such as GDP per capita amongst countries, comparing home values across cities, comparing household incomes across cities, comparing income inequalities across countries over time (see my chart below).

Remember, Gini Coefficient value is a relative number showing its proximity to 0 (for equality) or to 1 (for inequality), and does NOT given any percentage or value indication of what percentage is high income, etc. For that, we can use quartile measures or histogram or box plots. For example, we can get more information about salary spread by doing some additional calculations as shown below:

We can even get the percentage of people who are the top 10% wage earners, or bottom 50% etc. as shown below.

However, those are beyond what Gini Coefficient can provide as Gini Coeff is just high-level understanding of inequality tendency in a dataset. See my posts on reshaping data and finding outliers in Related Posts section below for more.

Here is a chart showing Gini Coeffs of income inequalities across some selected countries over time.

It’s evident from this chart that Denmark and Sweden have relatively very low income inequality. They have been initially socialist countries although they utilize many of the capitalistic principles in the modern times which has raised their income gap (inequality) over time, but with the socialistic principles inherent in the system, their income gaps remain much lower than many developed nations. In comparison, you can see Ecuador and USA have relatively high income inequality meaning, their rich is much richer than the rest of the population (“income top-heavy” if you will), although it has come down a bit since 2002. The chart shows the Gini Coefficient values at highest and most recent data points for each selected country. I hope this was educational as well as insightful.

Related Posts: