Thursday, September 28, 2023

# Analyzing Age, Height, Weight, BMI connection

I pulled down some real data from CDC (https://wwwn.cdc.gov/) which are relatively recent and contains data on 2 year cycles. The data cover people’s age, height, and BMIs (broken up by genders).

I wanted to see how age and height correlated, how height and weight correlated, and how BMIs vary by age for both male and female. This data is for U.S. only and is for 2 year-old to 20 year-olds only. However, based on the data, we can find correlations and should be able to predict someone’s height, BMI, age, etc. given another parameter.

The download dataset looks something like below (this is for male 2-20 years old): Similarly, another dataset contains the same parameters for female 2-20 years old.

First, the Age-Height relationship:

Using the sample data, we can see visually see the pattern here. Using some statistical formulas, we can find out the trendline and R^2 values. So, we see a pretty good relationship between Age and Height. As expected, it’s not terribly strong, but certainly the dataset is strong enough to show some connection between the two as we see at R^2 being at 69+%.

Using the trendline, we can now, reasonably well, the find the height of a person (with age as an input), or their age (with height as the input) as shown below: If there was a definite, perfect correlation between age and height, it’d look as below (as I use experimental dataset to prove the point)… The actual correlation of Age and Height happens to be a respectable: 0.833 (with standard deviations of 56.45 and 79.84 for population and sample respectively).

Below are some interesting (and necessary) numbers to chew on. This combines the data of male and female and shows the mean weights and heights in the dataset, but more importantly the variance values. So, the standard deviation (in weight) is 11.96 and standard deviation (in height) is 1.94. Instead of doing the stdev() function directly, here I use var() first then compute the stdev by a formula instead of a built-in Excel function. (Results should be same of course).

Helps to visualize this in a chart: Very good correlation.

Let’s get back to the original two datasets: one for male, another for female. And we’ll tackle each one separately (as mixing the BMIs together wouldn’t be fair/meaningful here). Also, for this blog purpose, I’ll use only the 50th percentile values.

For males, we find the basics which are tabulated below:

 Min 15.38 Max 23.04 Median 17.2 Modes 15.68 21.44

Meaning, 15.68 and 21.44 are the most common BMIs in our dataset. We also find a strong correlation between age and BMI: 0.94

And to visualize the progression: As before, we can now compute the BMI given age, and vice versa.

For female dataset, we find:

 Min 15.15 Max 21.72 Median 17.47 Modes 14.58 14.5 17.28

We have again multi-modal BMIs. We also find a strong correlation between age and BMI: 0.96

And the progression is: So, both males and females BMI values are very closely correlated to their ages. However, we see a slightly higher rate of increase of BMI by age for male compared to female (as evidenced by the slopes of the above charts) although the intercepts tell us that female’s BMI started out at a slightly higher BMI than male’s when x=0 based on the dataset (13.26 vs 13.14).