Concepts addressed: Linear regression, critical analysis
Grade level: 9th – 10th
In the age of startups and self-made influencers, you might wonder what is the point in studying any higher than high school. We keep hearing that education increases our chances for a better job, but is it true?
The table below shows the number of years a person studied (starting in grade school) and the associated median yearly income in the USA. Let’s see what we can learn from it.
- Plot the data on a graph. Without calculating anything, do you think the data is correlated? Why? Can you draw by hand the regression line for this dataset? Explain your choice of dependent/independent variable.
- Assume there is a linear correlation between the two variables, calculate the regression line. Compare this regression line to the guess you drew in the previous question. How well does the calculated regression line represent the data? Use mathematical parameters if possible and interpret them.
- Calculate the expected salaries for a person who studied for 24 years and for someone who studied for 6 years. Do both results make sense? Explain why they do or they don’t.
- [Bonus] Looking at the results for question 3, is the regression line a good way to predict the salary of ANY person? For which values of “years studied” are the predictions most accurate?
|Studies (in years)||Yearly income ($)|
- Pearson Correlation calculator – https://www.omnicalculator.com/statistics/pearson-correlation
- Least Squares Regression Line Calculator – https://www.omnicalculator.com/math/least-squares-regression
- Coefficient of Determination Calculator (R-squared) – https://www.omnicalculator.com/statistics/coefficient-of-determination
Question 1 hints:
Question 2 hints:
Question 3 hints:
Bonus question hints:
y = mx + b => y = 5786.51x - 35925.35 => yearly_income = 5786.51*years_studied - 35925.35
There are many ways to assess the fit of a linear regression line. Two of the most common methods are the correlation factor and the determination coefficient (R-Squared or R2).
For the correlation factor => we have r = 0.9983, which means a strong correlation between the variables.
For R-squared => we have R2 = 0.9864, which means that the regression line describes the data very accurately.
Other possible parameters of analysis are the residues.
After 6 years of studies, your predicted salary is: -1206.29 USD/year
While for 24 years of study the prediction seems reasonable, for 6 years of studies it is clearly wrong. Salaries cannot be negative and there are laws that put a lower limit on salaries.
On the lower end, salaries cannot be negative. Jobs that require no formal education still have to comply with the minimum wage regulations. For this reason, the linear relationship won’t describe the real situation accurately.
Since we want to infer if there is a correlation between our two variables without any calculations, we plot the points in a graph. We do it by hand or using one of the multiple tools available online (search for “Scatter plot”). The result will be something similar to the image next to the text. We can see that the points seem to align in a (roughly) straight line, which means there is a linear correlation between them.
The correlation will be apparent no matter what we choose as the dependent/independent variable. However, we can safely assume that the money we earn can not change the number of years we have studied. The number of years you have studied can actually affect the amount of money you make since higher-paying jobs usually require a higher level of education (which means more years studying). That is why we have selected the years studied as the independent variable (x-axis) and the income as the dependent one (y-axis)
There are many techniques to calculate the regression line (or trend line) of a collection of points. The most commonly used is the least-squares method. This method minimizes the sum of the square of the distance from each point to the line. Using the Least Squares Regression Line Calculator, we obtain the values for the slope as well as the y-intercept with their respective errors. These values are all we need to get the equation of a line in the slope-intercept form.
The result is that for a slope: m = 5790 while the intercept is b = -36000.
The equation of the line is:
y = mx + b => y = 5786.51x – 35925.35
using our variables we obtain:
yearly_income = 5786.51*years_studied – 35925.35
We can assess the goodness of a regression line by using many different parameters. We will focus on the correlation factor and the coefficient of determination (R2).
To obtain the numerical answer we want, we can simply use the Coefficient of Determination Calculator (R-squared) and the Pearson Correlation calculator. The results are:
Correlation factor: 0.9983
The correlation factor tells us the correlation between our variables, that is, how changing one variable affects the other. The correlation factor always has a value between -1 and +1 (both inclusive) with a higher absolute value indicating a stronger correlation. In our case, 0.9983 is a very strong correlation, almost perfect.
On the other hand, R-squared takes a value between 0 and +1, with the higher values indicating that the regression line is a very accurate description of the underlying relationship between our data points. A value of 0.9864 indicates that our points are almost part of the line. A typical interpretation of such a high value is that the trend line fits the data very well.
Now, we want to obtain the expected income for someone who has studied for 24 years. Since the trend line is what helps us predict values for missing data, we simply plug the number into the equation for the line:
yearly_income = 5786.51*years_studied – 35925.35 => yearly_income = 5786.51*24 – 35925.35 => yearly_income = 102951 USD
We can do the same for 6 years:
yearly_income = 5786.51*years_studied – 35925.35 => yearly_income = 5786.51*6 – 35925.35 => yearly_income = -1206.29 USD
While the value for studying 24 years seems reasonable, the expected salary after 6 years of studies is clearly incorrect. Not only because salaries are always positive (or zero for unemployed) but they also have to comply with minimum wage regulations, putting a lower limit on the yearly earnings.
This is our first peek at the importance of analyzing the results we get from the trendline and seeing if they make sense. Knowing the range in which we can make accurate predictions is as important as the predictions themselves.
We have seen from Question 3 that not all of our predictions are correct. To estimate the range of validity we first look at our data range. Within 12 to 21 years of education, we can be confident that our predictions will match with reality fairly accurately (see Question 2 for how accurately). Beyond those limits, we have to play a guessing game.
Here are some examples or possible argumentations:
- At the lower levels, not having a high school degree is generally the cutoff point. We can assume that the jobs available for someone with no high school diploma are the same irrespective of when they dropped.
- The validity of our predictions can be expected to hold for values slightly above 21 years. However, as we move higher and higher over that value the accuracy of our predictions will suffer. Job experience will become a very important factor, there is an upper limit to how many years a person can study, at some point salaries depend more on the responsibilities and power inherent to the position that to the qualification required.
- Minimum wage regulations and job availability can have an impact on the correlation above and below our data range.
More arguments can be made, so just think about it and come up with your own.