Executive Summary

This comprehensive statistical analysis investigates various aspects of gender and income dynamics in the United States spanning several decades. Four distinct parts explore different dimensions of the relationship between gender and income, utilizing different statistical tests to derive meaningful conclusions.

Part 2: Gender Income Disparity (1967 - 2021)

Research Question: Is there a significant difference in Median Male and Female Income from 1967 to 2021?

Statistical Test: T-Test for Independent Samples

Key Findings: The study uncovers a significant disparity in incomes between genders in the US, as the resultant p-value (< 0.05) and the median income for males has consistently been than that of females over the period examined. Despite legislative efforts such as the Equal Pay Act, a substantial gender pay gap persists, indicative of underlying structural and societal factors influencing income distribution.

Part 3: Gender, Income, and Degrees (1991 - 2021)

Research Question: Is there an interaction between Gender and Degree Type on Median Income Level from 1991 to 2021?

Statistical Test: Two-Way ANOVA

Key Findings: This analysis reveals a significant p-value (< 0.05) with regards to median income levels in both the main effects—gender and degree type—as well as their interactions. While educational attainment demonstrates a positive influence on income, significant gender disparities persist across degree types, suggesting the presence of complex factors beyond education contributing to income differentials.

Part 4: Employment Status and Gender (1991 - 2021)

Research Question: Is there an association between Gender and Employment Status for those with a Bachelor’s Degree or higher from 1991 to 2021?

Statistical Test: Chi-Square Test for Independence

Key Findings: The study identifies a significant p-value (< 0.05) indicating an association between gender and employment status, particularly in the context of unemployment. While gender alone may not determine employment outcomes, social factors and norms are likely to be contributing factors in the observed disparities. Which underscores the imperative for further exploration into the multifaceted determinants of workforce participation.

Overall Implications:

These findings collectively highlight the persistence of gender-based income disparities in the US despite legislative efforts and societal advancements. While educational attainment positively correlates with income, significant gender differences persist across educational levels and employment statuses. This emphasizes the necessity for ongoing endeavors aimed at address the underlying societal inequalities. Further research into external factors influencing income dynamics and workforce participation is imperative to elucidate the reasons behind the persistence of the wage gap in the United States. Such insights are essential for developing targeted interventions that promote gender equity in income distribution and economic opportunities.







Part 1 - Female Income in the US from 1967 - 2021.

Introduction

Since the early 1900s in the US women have gained more of an equal standing in society. 1920 was a major year for Woman’s rights, the nineteenth amendment to the U.S constitution was ratified (NWHP, 2023), this amendment gave women the right to vote. Further progress was made towards an equal society throughout the 1900s, for example in 1978 the court ruled into effect the Pregnancy Discrimination Act (NWHP, 2023). Which banned employment discrimination against pregnant women.

Women stormed through the 1960s as a part of the feminist movement. 1967 in particular was part of the second wave of feminism (Binghamton University, 2024). This movement was set out to fight for equal rights between genders as previously in history women have gone through much hardship.

Part one of this study is therefore set out to determine whether any progress has been made since the second wave of feminism in 1967. This will be done by analyzing whether there has been a significant increase in Median Female Income in the US from 1967 - 2021.

Therefore, the research question to be answered in this study is:

Has there been a significant increase in Median Female Income from 1967 - 2021?

Linear Regression:

The statistical test which will be used in this study is linear regression, as we are looking to determine whether there is a significant relationship between our two variables:

  • Time in Years

  • Median Female Income ($)

The formula for Linear regression is:

  • Y = β0 + β1X

Where:

  • Y is the dependent variable

  • X is the explanatory variable

  • β0 is the intercept, the value of Y when X = 0

  • β1 is the slope of the linear regression line

Our Formula:

  • Median Female Income = β0 + β1(Time in Years)

Hypothesis:

In order to answer this question we need to set our Null and Research Hypothesis, and level of significance.

Null Hypothesis:

  • H0: β1 = 0

β1 represents the slope coefficient associated with Y (Median Female Income).

Our Null Hypothesis is that the slop is equal to 0. Meaning that there is no relationship between X (Time in Years) and Y

Research Hypothesis:

  • H1: β1 > 0

Since we are testing whether there has been a significant increase in Median Female Income from 1967-2021. We are testing with our Research Hypothesis that the slope coefficient β1 is greater than 0.

For this study we set a significance level of 0.05, which means we are willing to say that the there is a 5% chance that the results will occur due to chance alone. Only when our p-value is lower than our significance level of 0.05 can we reject the Null Hypothesis and determine that the Research Hypothesis is more attractive.

Method

To start our analysis the relevant packages are loaded into R Studio.

#Packages
library(knitr)
library(kableExtra)
library(ggplot2)
library(dplyr)

Then the data is sourced and downloaded from www.census.gov. Table A-7: Number of Real Median Earnings of Total Workers and Full time, Year-Round Workers by Sex and Female-to-Male Earnings Ratio: 1960 - 2021.

For purpose of understanding: The word “Earnings” will be used interchangeably with “Income” as they are both used to describe the money earned from direct employment. Furthermore, the data used in this study is for Total Workers not Full-Time, Year-Round Workers.

The data was subsequently transformed in Excel for data cleaning purposes and ease of loading into R.

Rows for 2017 and 2013 have been repeated in the data set downloaded from US Census, so when imported into excel I have calculated the average of these rows instead of allowing two instances for the same year.

After further data cleaning in R, a number was given to each year value starting at 1967 as 1, 1968 as 2, and so on. In order to make more sense of the X value when conducting linear regression and not skew our analysis.

Table 1.1: US Census data regarding Median earnings for Total Female Workers from 1967 - 2021. The Years are numbered in order to not skew the analysis later. Median female Earnings in Dollars US ($).
Year Female.Earnings YearNum
1967 16721 1
1968 17192 2
1969 16799 3
1970 17046 4
1971 17862 5
1972 18481 6
... ... ...
2016 34867 50
2017 35091 51
2018 35232 52
2019 37967 53
2020 37527 54
2021 39201 55

From first glance of our data we can see that the Median Female Earnings is a lot higher in 2021 that in 1967, with values of $39,201 and $16,721 respectively.

In order to determine whether there has been a significant increase or not a linear regression model will be made and the assumptions will be tested to determine the fit of the model.

Model 1:

Our Linear Regression Model, follows our original formula of Y = β0 + β1X. Where Y is Median Female Income and X is Time in years.

The coefficients for Model 1 are as follows:

Table 1.2: Linear Regression Model 1. The intercept (constant) β0 and the slope (coefficient) β1, which are used with the predictor X to achieve the response variable Y.
β0 (Intercept) β1 (Slope)
15383.08 406.6405

For our Linear Regression Model 1 our constant β0 (Intercept) is 15383.08 and the slope (coefficient) β1 is 405.6405.

Making our Formula:

Predicted Y = 15383.08 + 405.6405(X)

Where:

  • Predicted Y is the Predicted Median Female Income Given X

  • X is the Year Number starting with 1967 as 1

This means that β1 is the amount in $ added to Median Female Income for each unit increase in Year from 1967, with the starting value equal to the (intercept) β0 value of 15383.03.

Visualizing model 1:

Figure 1.1: Scatter Plot of our US Census data regarding Median Earnings for Women plotted over Time in Years. With a Regression line or line of best fit. This graph shows a linear relationship between our X and Y variables.

In order to determine whether this model can be used for our data we must check whether the assumptions of linear regression are met.

From our graph we can determine that the relationship between our two variables is linear. Yet, we must further check for homoscedasticity (variance of residuals is constant) and normality of residuals (residuals follow a normal distribution). These checks can be done with diagnostic plots.

Diagnostic Plots

Figure 1.2: Diagnostic plots for our Linear Regression Model 1: With Y equal to Median Female Earnings and X equal to Time in Years. The plot is done to check for the assumptions of homoscedasticity and normality of residuals. The Residuals vs fitted shows a slight pattern, yet could be considered homoscedastic. Whereas the QQ Plot shows that the residuals are normally distributed.

The Residuals vs Fitted plot is used to visualize the homoscedasticity of our model, which we have described as the variance of our residuals, which we want to be constant. Our residuals are the Y value - the Predicted Y value. This plot shows our residuals as a function of the fitted values.

Here we want to see that the Residuals bounce randomly around the 0 line and should be roughly equally spaced around the regression line.

The QQ Plot is used to visualize the normality of the residuals, a straight line indicates that the residuals are normally distributed.

Here the Residuals vs Fitted plot shows somewhat of a pattern, but the QQ Plot shows us a normal distribution. We will transform the data the and create a new model to see if we can improve this before we continue with our analysis.

Model 2:

Our second Linear Regression Model, Model 2 follows a new formula of Y = β0 + β1(log10(X)). Where Y is Median Female Income and X is Time in years.

The coefficients for Model 2 are as follows:

Table 1.3: Linear Regression Model 2 with log10() transformation. The intercept (constant) β0 and the slope (coefficient) β1, which are used with the predictor X to achieve the response variable Y.
β0 (Intercept) β1 (Slope)
4.221912 0.0068563

For our Linear Regression Model 2 our constant β0 (Intercept) is 4.221912 and the slope (coefficient) β1 is 0.0068563.

Making our Formula:

log10(Predicted Y) = 4.221912 + 0.0068563(X)

Where:

  • log10(Predicted Y) is the logarithmic exponents of Predicted Median Female Income Given X

  • X is the Year Number starting with 1967 as 1

Using a log10 Transformation on our Y variable means that β1 is the percentage change to income with 1 unit increase in our scale, with the starting percentage increase equal to the (intercept) β0 value of 4.221912.

Diagnostic plots 2:

Figure 1.3: Diagnostic plots for our Linear Regression Model 2: With a logarithmic transformation of our Y variable equal to log10(Median Female Earnings) and X equal to Time in Years. The plot is done to check for the assumptions of homoscedasticity and normality of residuals.

The Residuals vs fitted shows the same pattern as before, yet again it could be considered homoscedastic. Whereas the QQ Plot shows that the residuals are normally distributed.

Since the same outcome or almost the same outcome has been shown with a logarithmic transformation. We will see if adding another variable will make any improvements our model.

Model 3:

Inflation is added to the model and the data is scaled before the model is created. This is done by subtracting the mean of each variable and dividing by the standard deviation for each observation.

The coefficients for Model 3 are as follows:

Table 1.4: Linear Regression Model 3 with new predictor Inflation (x2) . The intercept (constant) β0 and the slope (coefficient) β1 + β2, which are used with the scaled predictors x1 + x2 to achieve the response variable Y.
β0 (Intercept) β1 (Slope for x1) β2 (Slope for x2)
26769.02 6656.602 -230.7875

For our Linear Regression Model 3 our constant β0 (Intercept) is 26769.02 and the slope (coefficient for x1) β1 is 6656.602, and the slope (coefficient for x2) β2 is -230.7875.

Making our Formula:

Predicted Y = 26769.02 + 6656.602(X1) + -230.7875(X2)

Where:

  • Predicted Y Predicted Median Female Income Given X

  • X1 is the Scaled Transformation of Year Number starting with 1967 as 1

  • X2 is the Scaled Transformation of Inflation for each year

Diagnostic plots 3:

Figure 1.4: Diagnostic plots for our Linear Regression Model 3: With our Y variable equal to Median Female Earnings and X1 equal to Time in Years. An additional Predictor (X2) variable has been added to our model, inflation. Making our model multilinear. The plot is done to check for the assumptions of homoscedasticity and normality of residuals.

The Residuals vs fitted again shows the same pattern as before, yet again it could be considered homoscedastic, meaning there has not been improvements on homoscedasticity when adding the variable of inflation to our model. The QQ Plot shows that the residuals are normally distributed.

Since the same pattern has been visible the third time, we will determine which model to use by comparing the Multiple R squared and Adjusted R squared values.

R Squared Values

Table 1.5: Comparing the Multiple R Sqaured and Adjusted R Squared of Model 1 and 2 to determine whcih is best fit for our data.
Model 1 Model 2 Model 3
MultipleR^2 0.9697445 0.9639512 0.9705015
AdjustedR^2 0.9691736 0.9632710 0.9693669

Multiple R squared is used to determine how much variance in Y can be determined by X. Adjusted R squared is a modified version which takes into account the number of predictors in the model.

The R squared values are highest in Model 1 and 3. When adding Inflation to Model 1 (Model 3), we get an increase in both values.

To determine whether we will use this extra predictor in our model or not we will conduct an ANOVA.

ANOVA

Analysis of Variance Table

Model 1: Female.Earnings ~ YearNum
Model 2: Female.Earnings ~ YearScaled + InflationScaled
  Res.Df      RSS Df Sum of Sq      F Pr(>F)
1     53 71504295                           
2     52 69715240  1   1789056 1.3344 0.2533

The p-value of 0.2533 is greater than 0.05 our level of significance and we can determine that the variance of the two models is not significantly different.

Therefore, we will conduct our analysis with our linear regression Model 1.

Results

We have determined that we will be using our linear regression Model 1.

Model 1:

Predicted Y = 15383.08 + 405.6405(X)

Where:

Report: F(1,53) = 1699, P-Value < 0.05


Call:
lm(formula = Female.Earnings ~ YearNum, data = FemaleEarnings)

Residuals:
     Min       1Q   Median       3Q      Max 
-2385.19  -780.07   -40.66   753.00  2766.14 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15383.085    317.560   48.44   <2e-16 ***
YearNum       406.640      9.866   41.22   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1162 on 53 degrees of freedom
Multiple R-squared:  0.9697,    Adjusted R-squared:  0.9692 
F-statistic:  1699 on 1 and 53 DF,  p-value: < 2.2e-16

Residual Standard Error: 1162

Standard Error: 9.866

Multiple R-Squared: 0.9697 and Adjusted R-Squared: 0.9692

The output of the model shows us the p-value to be a very significant value lower than 0.05, with an F stat of 1699 on 1 and 53 Degrees of Freedom.

As we are looking at whether there has been a significant increase in Median Female Income from 1967 - 2021 this a is a one tailed test.

P-Value for this test is as follows:

[1] 3.035456e-42

Meaning we can determine that on any one test of the Null Hypothesis we can say that there is a less than 5% probability that the results will be due to chance alone. Therefore we can reject the Null Hypothesis and determine that the Research Hypothesis is more attractive.

Discussion

As we have rejected the Null Hypothesis, we can say that there has been a significant increase in Median Female Income in the US from 1967 - 2021.

With our new formula (Predicted Y = 15383.08 + 405.6405(X)) we can also predict the value of Y if we know the value of X. As we have used Year Number as a count starting from 1967, to predict Median Female Income in the US for 2024, we would simply have to replace X with 57 (for 2024).

Visualization:

Figure 1.5: Median Female Earnings plotted against time in Years, with line of best fit and Residuals plotted to the predicted values. With higher residual values shown with red and a larger circle and the smaller values shown with a smaller circle and blue.

The Multiple R squared value of 0.9697 for the regression Model 1 indicates that the predictor X (Time in Years) explains 97% of the variation in Median Female Income in the US. The Adjusted R squared value provides a more conservative value but still indicates a similar result of 0.9691.

The Residual Standard Error of 1162 means that there is a 95% the predicted value will fall between the mean of all Median Female Income and 1162

The Std.Error β1 of 9.866 shows the precision of the slope. The small value indicates less variability in the slope coefficient for this linear regression model.

As we have determined that Median Female Income has significantly increased over the years in the US from 1967-2021, we can say that things are going in the right direction as compared to how they were in the late 1800s. One can say that from this analysis he standard of living has increased for women in the US over these years.

However this study does not account for inflation, as this was determined to not be relevant enough for our regression model due to output of our anova test. Unemployment, and other socioeconomic factors have also not been accounted for.

This study has only analysed the total working population, so to analyse the standard of living for those without work could also be an interesting topic.

Further analysis and research should be conducted to get a better understanding of how equality is evolving in society and the effect is has on Female quality of life. Not simply through their economical gain but other factors such as education, life expectancy etc.







Part 2 - Gender and Income in the US from 1967 - 2021.

Introduction

Feminism can be defined as the advocacy of Woman’s rights on the basis of gender equality (Future Learn, 2021). To further understand the positive impact feminism has had on womens rights and on society in the US from 1967-2021, the difference of median incomes by gender will be an interesting topic to cover.

From Part 1 of this study we revealed that there had been a significant increase in Female Median Income in the US from 1967-2021. This is a step in the right direction, showing that women fighting for their rights has not gone unnoticed over the years. Yet, looking deeper into the equality between genders. How does this figure compare to what Men had earnt during the same time period?

In 1963, the Equal Pay Act was put in place, which was set out to protect against wage discrimination based on sex (DOL, 2024). With this act in place, women, by law should be paid the same as men for the same work. It is stated by the US Department of Labour website (2024) that employers must raise wages to equalize pay but may not reduce the wages of other individuals. Meaning two people working the same job should be paid the higher amount and not the lower.

Part two of this study is therefore set out to determine whether there is a significant difference in Male and Female Median Income from the years of 1967 - 2021.

Therefore, the research question to be answered in this study is as follows:

Is there a significant difference in Median Male and Female Income from 1967 - 2021?

If there is a significant difference in the Median Income by Gender from 1967 - 2021 in the US and men are earning more, then we can determine that there is a pay gap between the two genders during this time period.

Since the Equal Pay Act was put into place before the time period that the data was recorded (put into place in 1963). The reason for a pay gap would therefore not be due to the law itself but rather other outside factors at play which have had a negative effect on the Median Income of Females.

If there is not a significant difference in the Median Income between the two Genders or if there is but it shows that women are earning more, then we can say that the Equal Pay Act of 1963 solved the issues regarding equal pay based on sex in the US.

T-Test for Independent Samples

The statistical test which will be used in this study is a t-test for independent samples, as we are looking to compare the central tendencies of the two independent groups

  • Group 1: Median Male Income ($) 1967-2021

  • Group 2: Median Female Income ($) 1967-2021

Hypothesis:

In order to answer this question we need to set our Null and Research Hypothesis and level of significance.

As we are only interested to see if there is a difference this will be a two tailed test.

Null Hypothesis:

H0: Female Median Income US 1967-2021 (Central Tendency) = Female Median Income US 1967-2021 (Central Tendency)

Research Hypothesis:

H1: Female Median Income US 1967-2021 (Central Tendency) ≠ Male Median Income Us 1967-2021 (Central Tendency)

For this study we set a significance level of 0.05, which means we are willing to say that the there is a 5% chance that the results will occur due to chance alone. Only when our p-value is lower than our significance level of 0.05 can we reject the Null Hypothesis and determine that the Research Hypothesis is more attractive.

Method

To start our analysis the relevant packages are loaded into R Studio.

#Packages
library(ggplot2)
library(ggpubr)
library(knitr)
library(kableExtra)
library(Rcmdr)
RcmdrMsg: [1] NOTE: R Commander Version 2.9-0: Wed Feb 7 17:46:52 2024
RcmdrMsg: [2] NOTE: R Version 4.3.1
RcmdrMsg: [3] NOTE: Hello edren
library(dplyr)
library(rstatix)

Then the data is sourced and downloaded from www.census.gov. Table P-8. Age–All People, by Median Income and Sex: 1947 to 2022. The data used in this study are in 2022 dollars, so inflation is accounted for and the table used out of the multiple available in this data set (P-8) is for all people 15 years and older.

The data is then loaded and transformed into excel, only keeping the necessary values and columns along with the values for each year. As before, rows for 2017 and 2013 have been repeated in the data set downloaded from US Census, so when imported into excel I have calculated the average of these rows instead of allowing two instances for the same year.

Summary of the Data:

Table 2.1: Initial look at how our data is laid out. Providing the Median Income for both Genders 15 Years and Over in the US from 1967 - 2021. The Income in this dataset accounts for inflation and is in 2022 Dollars ($).
Year Gender Median.Income
1 2021 Male 49520
2 2020 Male 48130
3 2019 Male 50470
4 2018 Male 48100
5 2017 Male 47630
6 2016 Male 46640
7 ... ... ...
105 1972 Female 15320
106 1971 Female 14670
107 1970 Female 14210
108 1969 Female 14210
109 1968 Female 14020
110 1967 Female 13000

The data is laid out with the different genders in one columns and the “Median.Income” in its own column. This being the case the Year values will therefore be repeated going from 2021-1967 twice to match the two genders, making our n = 110, or 55 for each. Double what is was in the previous study. Degrees of freedom for this study is therefore, (n1 + n2) - 2 = 108.

Table 2.2: Summary Statistics for our data regarding Median Income for the seperate Genders 15 Years and Over in the US from 1967-2021. The count is the number of years in the study which is our n, median of the median income is provided for each, and the interquartile range.
Gender count median IQR
Female 55 21160 10785
Male 55 42510 3735

The Summary Statistics here confirm the n1 and n2 = 55. Therefore our DF is 108. The Median of Median Income is seen to be a lot higher for Males than Females in the US, it works out to be over double. For Females there is also a high Interquartile Range (IQR), which means that there is a high range where most of the values lie. As we are working with Yearly data we can only hope that this large Interquartile range is due to improvements over the years in Median Female Income which has caused this large spread.

In order to determine whether the difference between the two genders is significant a T Test for Independent samples will be conducted to measure whether there is statistical evidence that the central tendencies of our two groups are statistically different.

Assumptions Check:

Firstly, we must check that all the assumptions are met before we can continue with our analysis. For a T Test for Independent Samples, as stated by Amanda J. Shaker (2024) the assumptions required to conduct the parametric version of this test are that the data are independent (which they are), the data is normally distributed and that there are equal variances between groups.

Normality:

Using a Histogram, Density Plot, QQ-Plot we can determine whether the data is normally distributed or not. There should be equal distribution around the mean for the first two plots (Histogram and Density plot) and the QQ-Plot show should the data in a straight line.

[1] 110 109

Figure 2.1: Histogram, Density Plot and QQ-Plot combined to check the assumption of normal distribution for our data set. The histogram shows the median further to the right than the mean which suggests a negative skew. The Density plot also shows the that the data is more centered around the right hand side rather than the mean, again showing that the data is skewed and not normally distributed. The QQ-Plot does not follow a straight line and is therefore not showing normal distribution.

Shapiro-Wilk Test

Looking at the three different plots visualizing our data, we can determine that the data does not follow a normal distribution. One further can check this with the Shapiro Wilk Test which is a test of normality. The Null Hypothesis states that the data is not different (equal to) a normal distribution. Therefore if we reject the Null we must determine that the data is not normally distributed. For this test we will use a significance level of 0.05.


    Shapiro-Wilk normality test

data:  MaleFemaleEarnings$Median.Income
W = 0.89134, p-value = 0.0000001983

The output of this test is a p-value is extremely low and < 0.05. Therefore, we must conclude that our data is not normally distributed.

Equal Variances:

The next assumption for the parametric version of this test is Equal Variances between groups. We do not need to do this assumption check as we already know the data is not normally distributed, and will continue with the non-parametric version of the test the Mann-Whitney U Test.

Yet, we will do it anyway as it is very simple. Using the Levene Test for equal variances the output is similar, the Null Hypothesis for this test is that there is equal variances between groups. Therefore a p-value < 0.05 (our level of significance) means that we can determine that there is no equal variance between the groups in our sample.

Levene's Test for Homogeneity of Variance (center = median)
       Df F value         Pr(>F)    
group   1  42.683 0.000000002178 ***
      108                           
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value here is extremely low and < 0.05. Therefore, we can confirm that there is not equal variances between the two groups in our sample.

Mann-Whitney U test:

The Mann-Whitney U test is used as a non-parametric version of the T-test for Independent samples, it is also known as the Wilcoxon rank sum test. As we did not meet the assumptions needed for the parametric version of our test, we will continue with the non-parametric version of our test.

Results

Report: W = 0, P-Value < 0.05


    Wilcoxon rank sum test with continuity correction

data:  Median.Income by Gender
W = 0, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0

The output of our non parametric test does not include degrees of freedom as we are working with median values and the purpose of this test is to evaluate whether or not the median of the differences is equal to 0.

The P-Value for this test is < 0.05, which means we can reject the Null Hypothesis that there is no difference between the central tendencies of Median Income of Male and Females 15 years and Over in the US from 1967 - 2021, and determine that the Research Hypothesis that there is a significant difference is more attractive.

Discussion

The median value of the Median Income for 15 Years and Over in the US from 1967-2021 was $21,160 and $42,510 for Women and Men respectively. We previously determined that this was a large difference but now we can say the difference is significant as the P-Value < 0.05 and on any one test of the Null Hypothesis we can say that there is a less than 5% probability that the results will be due to chance alone.

Visualization

Figure 2.2: Box Plot for Median Income by Gender in the US between throughout the years of 1967-2021, for all people aged 15 years and Older in 2022 Dollars (Inflation accounted for). Interquartile range shown as the box and the whiskers are the total range. With the line through the middle as the median. Female plot shows a larger interquartile range, which shows a larger spread of values. Where as the Male plot does not.

Comparing both these both of these plots shows that Women have had a larger increase over the years, which is visualized by the greater overall range (minimum and maximum values). Also the interquartile range for Females is larger than that of the Males which shows a greater spread of values, which also indicate progressions over the years as there is more variability.

Yet, the plot for Males is much more clustered together showing less change but also more concentration around a higher value. The minimum value for Males is also higher than that of the Females maximum. Since we are looking at this data in 2022 Dollars, we see that Female Median Income is yet to catch up to the Males Minimum Income from 1967-2021.

Effect Size

The effects size tells us the magnitude of the difference between the two groups with regards to our y variable which is Median.Income in this case. With an r less than 0.3 there is a small effect, between 0.3 and 0.5 is a medium effect and greater than 0.5 shows a large effect (DATAtab, 2024).

The effect size for our test is r=0.862 meaning that there is a large magnitude of effect size. Which shows a substantial strength of difference between the two groups Male and Female by the variable Median Income. The difference we have already stated to be significant, and now the effect of which gender has on the difference in Median Income we can explain to be of a large magnitude.

The significant difference of Median Income between the two Genders can be concluded to show that there is still a pay gap even with the Equal Pay Act in place. Which we therefore determine to be due to other factors rather than law of equal pay itself.

One may consider that, women are likely to have a lower risk tolerance than men which may cause them to not take on higher paying jobs. Yet, with risk there is also failure and that should lead to greater variability in the median income for men.

Other factors which should be considered are segregation in the work place and work experience (Aragao, 2023). Since, the improvements regarding Woman’s rights are not super new (Equal Pay Act only happened 61 years ago in time of writing this paper), there may still be some prejudice in certain fields towards women which make it hard for them to find work. More research should be done to analyse this topic.

Educational attainment is another outside factor which will have an effect on the pay gap between Men and Women as stated by Carolina Aragao (2023). Socially-constructed norms that define the roles that Men and Women should play is a reason why girls are more likely to be out of school than boys across the world (Sida, 2017). Pressures of society and life outside of work have an effect on both men and women but could also be a reason why women have a lower Median Income than Men in the US during these years.

Further data collection and analysis regarding; educational attainment for both genders, differences in social pressures (both in and out of work) by gender, and the differences in risk tolerance for both genders, may help us further answer the “why?” to our pay gap question.







Part 3 - Gender, Income and Degrees in the US from 1991 - 2021.

Introduction

The previous studies in this project have provided us with more of an understanding about Feminism and its positive effect on the Female Median Income in the US. As shown in part one of this study, the improvements in society in the US regarding Woman’s rights have led to a significant increase in Median Female Income.

Part two of this study finished off with an analysis detailing that even though according to The U.S. Department of Labor (2024) the Equal pay act was put in place in 1963, there was still a significant difference between the median income between the two genders in the US.

Therefore, this study will look to delve deeper into the pay gap between genders in the US and analyse more of the factors that might have had an effect on the difference in median income in the US. One factor in particular being educational attainment, specifically that of a degree.

With more laws being put into place in the US to foster a better learning environment for women, such as The Women’s Educational Equity Act of 1974 (NWHP, 2023) which was set out to encourage full educational opportunities for girls and women, and the Gender Equity Act of 1994 (NWHP, 2023) which was created to promote gender equality and eliminate discrimination. One would suspect that more women are likely to take on a degree than before.

With data for both genders and the different degree types we will be able to uncover which degree type is the highest paying and whether there is still a difference in medium income between the genders when both gender have the same degree type.

Therefore, part three of this project will cover an analysis of the median income for men and women in the US from 1991-2021, across three different types of degrees; bachelors, masters and doctorate. Professional degree will not be included in this study as there was lacking data for 2007 and 2008.

Our research question for this study is as follows:

Is there an interaction between Gender and Degree Type on Median Income Level in the US from 1991-2021?

Two-Way ANOVA:

This research question will be answered with a Two-Way ANOVA. The Two-Way Analysis of Variance will be used as we are comparing the effect our two independent variables (Gender, Degree Type) have on a dependent variable (Median Income in the US).

Firstly, we will be able to analyse whether Median income is equal across the two genders, which we discovered in part 2 that is is not for the total population of men and women 15 years and over in the US. Yet, now we will be analyzing whether there is a difference in median income for those who specifically have a bachelors degree or over (not including professional degree).

With our Two-Way anova, we will also be able to analyse whether the median income is equal across the different degrees, or whether there is a significant difference.

Furthermore, we can do a cross treatment analysis to uncover the differences in median income between genders across the different degrees and also between the different degrees across the different genders

Hypothesis 1, 2 + 3:

1st Hypothesis and Research Question:

  • H0: Median Income is equal across genders with a bachelors or over.

  • H1: Median Income is different across genders with a bachelors or over

2nd Hypothesis and Research Question:

  • H0: Median Income is equal across the degree types

  • H1: Median Income is equal across the degree types

3rd Hypothesis and Research Question:

  • H0: All 6 groups created by cross treatment have equal means

  • H1: All 6 groups created by cross treatment do not have equal means

For this study we set a significance level of 0.05, which means we are willing to say that the there is a 5% chance that the results will occur due to chance alone. Only when our p-value is lower than our significance level of 0.05 can we reject the Null Hypothesis and determine that the Research Hypothesis is more attractive.

Method

To start our analysis the relevant packages are loaded into R Studio.

#Loading Packages
library(knitr)
library(kableExtra)
library(psych)
library(car)
library(ggplot2)
library(multcomp)
library(ggeffects)
library(phia)
library(effects)

Then the data is sourced and downloaded from www.census.gov. Table P-16. Educational Attainment–People 25 Years Old and Over by Median Income and Sex: 1991 to 2022. The data used in this study are in 2022 dollars, so that inflation is accounted for over the different years. The three tables used out of the multiple available in this data set (P-16) is for both genders, with a bachelors degree, masters degree and doctorate degree.

As previously stated the Professional Degree table will not be used as it was missing data for 2007 and 2008. The data is then loaded and transformed into excel, only keeping the necessary values and columns along with the values for each year.

As in both part one and two of this study, the rows for 2017 and 2013 have been repeated in the data set downloaded from US Census, so when imported into excel I have calculated the average of these rows instead of allowing two instances for the same year.

Lastly, further data cleaning was performed in R transforming the variables Gender and Degree to factors and Median.Income to integer.

Summary of the Data:

Table 3.1: Initial look at how our data is laid out. Providing Income data for both Genders and 3 different degree types in the US from 1991 - 2022. The Income in this dataset accounts for inflation and is in 2022 Dollars ($).
Year Gender Median.Income Degree
1 2021 Male 127500 Doctorate
2 2020 Male 127700 Doctorate
3 2019 Male 121300 Doctorate
4 2018 Male 116300 Doctorate
5 ... NA ... NA
63 2021 Male 101300 Masters
64 2020 Male 98350 Masters
123 1992 Female 58310 Masters
124 1991 Female 58950 Masters
10 ... NA ... NA
183 1994 Female 43200 Bachelors
184 1993 Female 42350 Bachelors
185 1992 Female 43260 Bachelors
186 1991 Female 41550 Bachelors

The data is separated into four columns, providing information for both Genders and their median income in the US from the years 1991-2021 with regards to the three Degree types we have selected for this study; Bachelors, Masters, Doctorate.

Again, as we are analyzing two Genders and three Degree types, the year values have been repeated six times as we are matching the Median Income data for each year to each Gender and each Degree type.

Table 3.2: Summary Statistics of the median income data for both Genders and 3 degree types in the US from 1991-2021.
n mean sd median min max range skew kurtosis se
F:Bachelors 31 47420.48 2849.075 47590 41550 52830 11280 -0.1154015 -0.4021528 511.7090
F:Doctorate 31 82704.84 6444.466 81910 71800 98010 26210 0.5315128 -0.4469547 1157.4603
F:Masters 31 63353.23 3061.564 63240 58310 69020 10710 0.1741897 -0.8611165 549.8731
M:Bachelors 31 74872.26 3286.385 75470 69610 80190 10580 -0.0434888 -1.5008370 590.2521
M:Doctorate 31 114628.71 7326.512 114500 99890 127700 27810 -0.1151121 -0.7865415 1315.8805
M:Masters 31 93255.48 4456.962 93810 85470 101300 15830 -0.1752641 -0.9594464 800.4940

The n values of each group show the total number of observations for each, which is 31 as this is the number of years of data we have for each pairing of group 1991-2021. The mean values for the Doctorate Degree type have shown to be the highest from our summary statistics for both males and females with values of 114628.71 and 82704.84 respectively and lowest for the Bachelors Degree type for both males and females with values of 74872.26 and 47420.48 respectively.

By comparing the means of each group pairing, we not only see that Bachelors is the lowest and Doctorate is the highest. We also can see that Males have a higher mean than Females for each Degree Type. Whether this difference is significant or not is what we will further analyse in this study.

The standard deviation (sd), is highest in Males meaning there is more variability around the mean of the data. When comparing the sd and range of both genders who have a Bachelors degree, you can see that Males have an sd value of 3286.385 and Females have a sd value of 2849.075 but the ranges are 10580 and 11280 respectively. Showing that Males who have a Bachelors Degree have an income less concentrated around the mean than Females with the same degree, but is less spread out.

The higher range for Females with a Bachelors degree shows that there is a larger gap between the highest and lowest values. Yet, the lower sd also indicates that most of the values are closer to the mean than that of Males.

When comparing the sd values and range for both Genders and for the Doctorate and Masters Degree, we can see that both the sd and range is higher for Males with regards to these pairings, showing that the data is concentrated for Females with Masters and Doctorate Degrees and less spread out.

Females have a positive skew for both Doctorate and Masters Degrees, whereas the rest of the pairings indicate a negative skew. A skew value from -0.5 to 0.5 indicates symmetry (Gawali, 2023). Most of the parings are therefore symmetrical except the Females with a Doctorate Degree group, which has slightly above 0.5 with a skew value of 0.5315, showing that the outliers are towards the upper quartile, meaning that there are a few higher incomes for this pairing which has skewed the data towards the right.

The kurtosis values < 3 indicate that the distribution is platykurtic which is less peaked when compared to a normal distribution, which is what we see for each pairing in this data set.

Lastly, the standard error (se) is larger in both groupings of the Doctorate Degree meaning the data here is less precise than that of the other groupings of Degree types.

Assumptions Check:

Firstly the model is created using R, with Median Income as our Independent variable and Gender and Degree as our Dependent variables. In order to continue with our Two-Way Anova model, we must check first that the required assumptions are met.

  • Independent observations (which we have)

  • Homoscedasticity - Residuals have equal variance

  • Normality of Residuals - Residuals follow a normal distribution

Homoscedasticity:

Figure 3.1: Diagnostic plot of our Two-Way ANOVA model. Residuals vs Fitted plot is used to check for the assumption of homoscedasticity. The Residuals should have an equal variance when plotted against the fitted values. The Q-Q Residuals plot is used to check for the normality of residuals, which is indicated by a straight line.

The Residuals vs Fitted plot here shows homoscedasticity of our Two-Way ANOVA model as the largest spread is not three times the size of the smallest. The Q-Q Residuals plot also shows that the Residuals are normally distributed.

Normality of residuals:

We can further check the normality of residuals with the Shapiro Wilk test, with the residuals taken from our Two-Way ANOVA test. The Null Hypothesis here is that the data follows a normal distribution. Therefore, a P-Value < 0.05 means that we reject the Null Hypothesis and determine that the data is not normally distributed.

Shapiro Wilk test


    Shapiro-Wilk normality test

data:  Res_Epsilon
W = 0.98875, p-value = 0.1488

The P-Value from the Shapiro-Wilk test is 0.1488, which is > 0.05. Therefore we do not reject the Null Hypothesis that the data is normally distributed.

Histogram

Figure 3.2: Histogram plot of the Residuals from our Two-Way ANOVA Model 1. A normal distribution shows symmetry around the mean, which is what we can see here.

Since checking that all assumptions have been met we do not need to transform our data or change our model and can therefore continue with our analysis.

Results

Analysis of Variance Table

Response: Median.Income
               Df      Sum Sq     Mean Sq   F value  Pr(>F)    
Gender          1 41181144023 41181144023 1723.4910 < 2e-16 ***
Degree          2 43747307275 21873653637  915.4444 < 2e-16 ***
Gender:Degree   2   155472437    77736219    3.2534 0.04093 *  
Residuals     180  4300925127    23894028                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Report 3: F(Gender:Degree, 2,180) = 3.2534, P-Value < 0.05

Firstly looking at the interaction, the significance level is 0.04. Meaning we can reject the 3rd Null Hypothesis that “all six groups created by cross treatment have equal means” and that they are all significantly different. The interaction being significant does not particularly mean that the main effects Gender and Degree will also be significant.

Report 1: F(Gender, 1,180) = 1723.4910, P-Value < 0.05

Report 2: F(Degree, 2,180) = 915.4444, P-Value < 0.05

In this case the P-Value < 0.05 for each Main Effect Gender and Degree. Meaning we can reject the 1st and the 2nd Null Hypothesis that “Median Income is equal across genders with a bachelors or over” and “Median Income is equal across the degree types”, and determine that the Research Hypothesis is the most attractive for all the Main Effects and the Interaction.

A P-Value lower than 0.05 allows us to determine that the differences we have seen in this data are statistically significant and that the probability that these results will occur due to chance alone is less than 5%.

Discussion

Post-Hoc Analysis

Table 3.3: Pairwise tests to test for interaction between Genders within Degree type (adjustment = Holms test).
Value Df Sum of Sq F Pr(>F)
Female-Male : Bachelors -27451.77 1 11680798549 488.8585 3.43e-53
Female-Male : Doctorate -31923.87 1 15796569832 661.1095 1.10e-61
Female-Male : Masters -29902.26 1 13859248079 580.0298 6.80e-58
Residuals NA 180 4300925127 NA NA

This Post-Hoc Analysis shows values for differences between Genders within the different Degree Types we have used in this study. We can say that there is a significant difference (P-Value < 0.05) in Median Income in the US for each Interaction.

Since the order goes Female - Male, the values in the “Value” column are relevant to Female as they come first. Meaning the negative value for each interaction shows that the difference lies with Men having a higher average with regards to Median Income in the US, for our three Degree Types.

Table 3.4: Pairwise tests to test for interaction between Degree type within Genders (adjustment = Holms test).
Value Df Sum of Sq F Pr(>F)
Bachelors-Doctorate : Female -35284.35 1 19297278294 807.6193 9.49e-68
Bachelors-Masters : Female -15932.74 1 3934710117 164.6734 3.46e-27
Doctorate-Masters : Female 19351.61 1 5804516290 242.9275 9.55e-35
Bachelors-Doctorate : Male -39756.45 1 24498919395 1025.3156 1.83e-75
Bachelors-Masters : Male -18383.23 1 5238116361 219.2228 1.17e-32
Doctorate-Masters : Male 21373.23 1 7080629111 296.3347 2.75e-39
Residuals NA 180 4300925127 NA NA

Furthermore, this Post-Hoc Analysis with regards to the differences between Degrees within Genders shows negative values for Bachelors-Doctorate and Bachelors-Masters within each Gender type. Since the rules regarding the order are the same this shows that for both Genders, the Median Income for those with a Bachelors Degree has a lower mean value than the mean value of Median Income for those with a Masters or Doctorate Degree (for the same Gender).

Doctorate-Masters shows a positive value, meaning that for both Genders, the Median Income for those with a Doctorate Degree has a higher mean value than the mean value of Median Income for those with a Masters Degree (for the same Gender).

As all of the interactions show a significance level where P-Value < 0.05, we determine them to all be statistically significant differences and with less than a 5% probability to occur due to chance alone.

Effect Plot

Figure 3.3: An Effect plot showing the mean values for each grouping of Median Income by two categorical variables, Gender and Degree. With the mean values for Females by Degree type as a red line and blue for Males. With points plotted for each value.

As stated by (Jawaid, Thariq and Saba, 2019) an Effect Plot shows the mean response values at each level of a design parameter or process variable. We can see from our Effect Plot that the mean values are higher for each individual Degree Type if the Gender is male. The difference between the median income of genders (with bachelors or over) is one of our main effects, which we have determined to be statistically significant and not due to chance alone.

For each Gender we can see that Bachelors Degree have the lowest values and Doctorate has the highest. Which would make sense as the level of difficulty and time required to get one of these Degrees goes in order from Bachelors, Masters and then Doctorate. The difference between the different levels of Degree is our second main effect which we again determined to be statistically significant and not due to chance alone.

We can also see from our Effect Plot that there is a cross over where Females who achieve a Doctorate Degree have a higher mean value than that of Males who have a Bachelors.

Predicted Values

Figure 3.4: Predicted values plot, showing the mean values for Median Income ($) for each Gender and Degree type. With Bachelors as red, Doctorate as blue and Masters as green.

If there is a significant interaction the predicted value is the value of the response variable (Median Income) which is equal to the mean of all observations having that combination of factor levels (Owens, 2022). Which means that since we have a significant interaction (P-Value < 0.05) the predicted value for each pairing is equal to the mean of each pairing of independent variables (Gender and Degree).

This plot allows us to see more of a clearer picture as to how the values only cross over at the lowest mean value for Males and the highest mean value for Females. As we can see where the point for the Female side overtakes the lowest on the Male side.

We can determine that there is a significant interaction between Gender and Degree Type on Median income in the US from 1991-2021, yet the results were not quite as expected as it has only uncovered more that needs to be answered with regards to the “why?”. The significance of the interaction mainly shows that men are earning more when comparing the central tendencies of our pairings of groups, with only one cross over (being that of Female:Doctorate - Male:Bachelors).

One would expect from that on average a Doctorate Degree meant that you would earn more than someone with a Bachelors, but the data is interesting in that this is the only time the mean of Median Female Income surpasses that of the mean of Median Male Income.

If this is the case for mean values for Females with Doctorate degrees compared to Males with Bachelors then why is mean of Median Income for Females with a Masters Degree not higher than that of the mean value of Males with a Bachelors degree? A significant difference between Degree type itself is one main effect, but since there is also a significant difference between gender and the interaction between both.

On one hand, one may conclude that when looking at the interaction of our two independent variables, there must be more factors at play which have an effect of Median Income in the US which cause women to earn less despite their degree level.

On the other hand, it may not be the degree level itself that affects the Median Income Level in the US, but rather the number of years in education as a Doctorate Degree can take a further 5-8 years to achieve after completing a Masters Degree. Yet, this would not explain why Females with a Doctorate Degree have only slightly higher mean values than the mean values of Males with a Bachelors.

Further research and analysis of other factors which have an effect on Median Income levels in the US will help us uncover our question for the “why?”. Since we have determined that education attainment does not guarantee equal pay for both Genders, as there is a significant difference between and within both combinations of our interactions (between Genders within Degree, between Degree Type within Genders).

Other factors which may affect Median Income in the US are for e.g. social pressures in and outside of the workplace and would be an interesting topic to uncover. Unemployment rate for those people with a bachelors degree or higher could as lead us into the right direction as to answering why the significant difference in Median Income occurs between Genders in the US.







Part 4 - Employment Status and Gender for those with a Degree in the US from 1991 - 2021.

Introduction

The final part of this project will be a continuation of what we have looked at so far. As we have discovered that there has been a significant rise in Median Female Income in the US, which has seen to be significantly different from that of Males.

During the previous part of this project we found a significant interaction between the Degree and Genders which has shown that Females earn less than Males for every Degree Type out of Bachelors, Masters and Doctorate, except for the one mean value for Females Median Income with a Doctorate which was a higher mean of Median Income that of a Male with Bachelors

We will look deeper into the numbers regarding Males and Females over the age of 25 in the US with a Bachelors Degree or higher (Including professional degree in this part of the study), and the counts of the population regarding their employment status for each of the years.

As stated by Carey and Hacket (2022). In 2021 in the US, the ratio by Gender was 98 males per 100 females. From this ratio, we would expect more Females 25 and over, who have a Bachelors or higher to be in employment in comparison to males.

The Research Question for this study is as follows:

Is there an association between Gender and Employment Status for those who are 25 and over, with a Bachelor’s Degree or higher in the US from 1991-2021?

To answer the question whether there is any significant association between the Gender of those who have a Bachelors Degree or higher and their Employment Status. A hypothesis will be made and analysed using the relevant test statistic

Chi-Square Test:

As we are working with population data, we will use the population as counts for each pairings of category. Our data consists of two categorical variables. Gender: Male, Female. Employment Status: Employed, Unemployed.

The Chi Square Test of Independence is a nonparametric test which determines whether there are associations or differences between two categorical or nominal variables. In this case of our data set, the expected counts are unknown and therefore a test of association will be done to determine whether the character frequencies of our two groups Gender and Employment Status, have a significant association with one another.

Hypothesis:

  • H0: There is no association between Gender and Employment Status for those 25 and over, with a Bachelor’s Degree or higher in the US.

  • H1: There is an association between Gender and Employment Status for those 25 and over, with a Bachelor’s Degree or higher in the US.

For this study we set a significance level of 0.05, which means we are willing to say that the there is a 5% chance that the results will occur due to chance alone. Only when our p-value is lower than our significance level of 0.05 can we reject the Null Hypothesis and determine that the Research Hypothesis is more attractive.

Method

To start our analysis the relevant packages are loaded into R Studio.

#Loading Packages
library(kableExtra)
library(dplyr)
library(knitr)
library(ggplot2)
library(corrplot)
library(vcd)

Then the data has been sourced from www.beta.bls.gov. Multiple tables have been downloaded for both Genders with the key words of “25 Years and Over”, “Bachelors Degree”, “Employed”, “Unemployed”. Since the tables were separated by Gender, Employment Status and for the Female Data by race. Once downloaded the data was transformed in Excel and loaded into R studio for analysis.

Data Set

Population Data

Table 4.1: BLS Beta Labs data regarding Total Employed/Unemployed from 2015-2021 (Q1-Q4) by Gender. The count of this data is in thousands
Year Q Emp.Status Gender Total.Count
2015 Q1 Unemployed Male 485
2015 Q2 Unemployed Male 78
2015 Q3 Unemployed Male 71
2015 Q4 Unemployed Male 71
2016 Q1 Unemployed Male 504
2016 Q2 Unemployed Male 70
... NA NA NA ...
2020 Q3 Employed Female 22868
2020 Q4 Employed Female 22886
2021 Q1 Employed Female 22932
2021 Q2 Employed Female 23380
2021 Q3 Employed Female 23579
2021 Q4 Employed Female 23746

This is how the data looked once it was transformed and loaded into R Studio. It provides count data of the total population in the US (in thousands), for those 25 Years and Over with a Bachelors Degree or higher. Separated into Q1-4 (quarter) for each year, and by Gender and Employment Status.

In order for keep our Chi-Square Test of Independence fair and since we are looking at population data throughout the different years, we will take a sample of this data set by looking specifically at one year and one quarter. This will be done by using R to filter the data by a random year and quarter

Sample Data

Table 4.2: Random Year and Q selected to generate a Sample from our original data from BLS Beta Labs data. Showing Total Employed/Unemployed by Gender. The count of this data is in thousands (K)
Year Q Emp.Status Gender Total.Count
2016 Q2 Unemployed Male 70
2016 Q2 Unemployed Female 106
2016 Q2 Employed Male 21386
2016 Q2 Employed Female 19867

Our sample data is for 2016 in Q2.

Contingency Table

Table 4.3: Contingency Table taken from our Random Sample, for the year 2016 and Q2, showing counts (in thousands) for the pair of our categorical vairables.
Female Male
Employed 19867 21386
Unemployed 106 70

From the initial look at our table we can that for the year 2016 Q2, there are more Males who are employed than Females, 21,386 for Employed males and 19867 for Employed Females (in thousands). More Females are seen to be Unemployed than Males for those 25 years and Over with a Bachelors Degree or higher, with values of 106 for Females and 70 for Males (in thousands).

Since this data is in thousands, when converting it back to real numbers the difference in Employment status can seem a lot more substantial. With the numbers of Employed 21,386,000 and 19,867,000 for Males and Females Respectively. Whether this difference of over 1,000,000 is a sign of a significant association or not is what we will determine in this study.

Observed Values

Yates Correction

Observed values are the values which are seen in our data within each category pair.

To determine whether we should use Yates Correction or not we need to determine if the total N of our population is lower than 40. As stated by (Giannini, 2005) Yates correction is used to compensated for deviation from the theoretical probability distribution when the total N for a 2x2 Chi-Square table is less than 40. When the sum of all our observed values is less than 40 Yates correction is used to make a more accurate analysis.

There is no need to calculate this number, as we can clearly see the N is larger than 40. Since each group is larger than 40. Therefore we can continue with the next stage of our analysis.

Expected Values

Fishers Exact Test

To determine whether we should use Fishers exact test or not we will check the Expected Values. Expected Values are the frequency that we would expect in a cell on average if the variables are independent (Minitab, 2023).

Fishers exact test is used when 20% or more of the Expected Values have a count of 5 or lower (Nowacki, 2017). Again, it is used to make a more statistically accurate analysis when comparing the Observed and Expected values.

Therefore, we run a Chi-Square Test with our contingency table and plot the expected values to determine whether we should use Fishers Exact Test or not.

Table 4.4: Expected Values of our random sample of the population data. The count of this data is in thousands (K)
Employment
Expected Values
Female Male
Employed 19888.15 21364.85
Unemployed 84.85 91.15

None of the Expected values are lower than 5 and all the assumptions of the Chi-Square Test of Independence are met, so we continue we will continue our analysis with the regular Chi-Square Test of Independence to analyse whether there is an association or not between our two categorical variables.

Results

Report: X-squared(1) = 10.223, P Value < 0.05


    Pearson's Chi-squared test

data:  ContingencyTable
X-squared = 10.223, df = 1, p-value = 0.001387

From our results of Pearson’s Chi-Sqaured Test of Independence we get the result of a P-Value < 0.05. Therefore we are able to reject the Null Hypothesis that there is a significant association between Gender and Employment Status for those 25 and over, with a Bachelors Degree or higher in the US from 1991-2021.

Discussion

Observed Values, Expected Values and Chi-Square Components

Table 4.5: Observed Values, Expected Values and the Chi-Square Components of our random sample of the population data. The count of this data is in thousands (K)
Observed Values
Expected Values
Chi-Squared Components
Female Male Female Male Female Male
Employed 19867 21386 19888.15 21364.85 0.0225 0.0209
Unemployed 106 70 84.85 91.15 5.2720 4.9076

When comparing the Values of our Chi-Square Test it is good to look first at the Chi-Squared Components. Chi Square Components are the residuals of our test squared. The residuals are how much the Observed Values differ from the Expected Values.

For Females a Chi-Squared Component value of 5.2720 in the Unemployed Row, shows that the Observed Values for Unemployed Females are very different from what is Expected Values. When looking at the same row but for Males, we see a Chi-Squared Component of 4.9, which provides the same explanation of a large difference between Observed and Expected Values for Males.

Since the Employed column shows Chi-Squared Values of 0.0225 and 0.0209 for Men and Women respectively, we will not worry about this column as this will not be where the significant association lies.

Although, Chi-Squared Components do not show the direction by which the difference occurs. This can simply be done by comparing the Observed and Expected values of each categorical variable.

Again just looking at Unemployed for both Gender, we can see that for Females the Observed Value is 106,000 (106, Translated to real world numbers) as compared to the Expected Value of 84,950 (or 84.85). Showing a Positive Correlation between Females and Unemployment.Since our P-Value < 0.05 this means that there are significantly more Unemployed Females than we would Expect and are due to factors other than chance alone.

The reverse is shown to be true for Males, with an Observed Value of 70,000 (70) and an Expected Value of 91,150 (91.15), this shows that there are significantly less Unemployed Males than we would Expect. A negative correlation. Which again we can conclude that these results are due to factors other than chance alone (P-Value < 0.05).

Visualization

Figure 4.1: Association Plot showing Pearson residuals for our Chi Squared test of the association between Gender and Employment Status. A positive association between Gender and Employment status is indicated with a bar above the dotted line, A negative association between Gender and Employment status is indicated with a bar below the dotted line. The strength of the association is indicated by the size of the bar.

This plot shows the Pearson Residuals for our Chi-Square Test for each pair of our two categorical variables Gender and Employment Status. It allows us to get more of a visual understanding as to where the association lies between Gender and Employment Status for our sample data set.

The line above the bar shows that more Females are Observed to be Unemployed than Expected, and the line below the bar shows that less Males are Observed to be Unemployed than expected.

The P-Value of 0.001387 shows a high significance in the correlation between Gender and Employment status. As highlighted by the blue and red bars, we can see that the the significance lies in Unemployment. The grey highlights than there is no significant association in these results with regards to those who are Employed.

The line above the bar shows that more Females are Observed to be Unemployed than Expected, and the line below the bar shows that less Males are Observed to be Unemployed than expected.

Conclusion

From our results we conclude that there is a significant association (P-Value < 0.05) between Gender and Employment Status in the US for those who are 25 and over, with a Bachelors Degree or higher.

Although, the significance of our association only shows on the Unemployed side. We will not use Gender as a determining factor which decides whether or whether gets gets a job (If they have a degree). Which is still a positive result from a societal point of view. Furthermore, there is still a significant positive correlation between Females and Unemployed.

Yet, we must realize that these results alone do not explain all the factors at play. Simply they show a correlation within our sample of the population data, rather than the cause of unemployment is due to the fact that they are Female

We conclude from this analysis that the reason “why?” this might occur would not be due to Gender alone, but due to more social factors and norms which Females face in society. Further analysis should be done to test these social factors on an index, determine if they have a significant association to Employment Status and make a comparison by Gender.

One must also consider our differences in nature, such as having a baby. Which may cause some more women to leave Employment. This can also be studied by comparing Unemployed Females to due to being fired/quit to those who have left for maternal reasons. If these studies prove that maternal reasons are why there is a correlation, then hopefully as time goes on and more people can work remotely, we won’t see a significant correlation anymore.

Also, from this study we have not concluded that one Gender is more likely to get a degree than the other gender. Our results could have possibly be skewed simply due to the fact that there a less Females with a Bachelors Degree or higher than Male. Therefore, further research could be done to test the significance of differences in educational attainment by gender.

Even though the Educational Equity Act was put in place in 1974, and Gender Equity Act in 1994 (NWHP, 2023), we still observed a lot more Males with a Bachelors than Females observed in our sample data set (Employed or Unemployed). More research could therefore be done to determine what social factors the different genders may face on an index and compare this number to their educational attainment.







References

