In this vignette, we explore the association between socioeconomic indicators and longevity for the 1910 cohort of males using the CenSoc-DMF data. Ordinary Least Squares (OLS) Regression on age of death is an easy and effective way to analyze the CenSoc mortality data.
This vignette is written in the style of the “tidyverse.” There are
many other ways to analyze data in R, including base R and
data.table
.
Before starting on the tutorial, please follow the instructions in the Getting Started with CenSoc vignette to extract the 1940 census from IPUMS and match it onto your CenSoc file. Make sure to add the following variables to your cart while making your IPUMS extract:
HISTID
EDUCD
RACE
OWNERSHP
INCWAGE
## Library Packages
library(data.table)
library(ipumsr)
library(tidyverse)
library(gridExtra)
library(cowplot)
library(broom)
## Read in CenSoc-DMF file
## Restrict to conservative matches (fewer false matches)
<- fread("/data/josh/CenSoc/censoc_data/censoc_linked_to_census/v2.1/censoc_dmf_v2.1_linked.csv") %>%
censoc_dmf filter(link_abe_exact_conservative == 1)
## Restrict to cohort of 1910
<- censoc_dmf %>%
censoc_dmf filter(byear == 1910)
IPUMS variables often have value labels, which are text that
corresponds to numeric codes (e.g., the SEX
variable has
value labels: 1 = “Male”, 2 = “Female”).
The IPUMSR
package includes helpful functions for
assigning values labels (see the ipumsr website for
more information and tutorials). In this vignette, for simplicity, we
will manually code our value labels.
Note that incwage
only measures salaried or wage earners
(non-agricultural/business owners) and is topcoded at $5001; all
reported wage and salary incomes over $5000 are coded to the 5001 value.
Values listed as “999998” and “999999” represent missing values.
Additionally, several incwage
values are 0. For this
vignette, we will ignore those values, but the implications are
important to consider for your own research. In addition to the
aforementioned non-wage earners, “0” could also represent those outside
of the labor force due to injury or school.
## Restrict to men with a non-missing value of income and education
<- censoc_dmf %>%
censoc_dmf filter(INCWAGE > 0 & INCWAGE < 5002) %>%
filter((EDUCD > 0 & EDUC < 999)) ## Education is topcoded at 999
## Recode EDUCD (education, detailed) variable to years of education
<- censoc_dmf %>%
censoc_dmf mutate(educ_yrs = case_when(
== 2 ~ 0,
EDUCD == 12 ~ 0,
EDUCD == 14 ~ 1,
EDUCD == 15 ~ 2,
EDUCD == 16 ~ 3,
EDUCD == 17 ~ 4,
EDUCD == 22 ~ 5,
EDUCD == 23 ~ 6,
EDUCD == 25 ~ 7,
EDUCD == 26 ~ 8,
EDUCD == 30 ~ 9,
EDUCD == 40 ~ 10,
EDUCD == 50 ~ 11,
EDUCD == 60 ~ 12,
EDUCD == 70 ~ 13,
EDUCD == 80 ~ 14,
EDUCD == 90 ~ 15,
EDUCD == 100 ~ 16,
EDUCD == 110 ~ 17,
EDUCD == 111 ~ 17,
EDUCD == 112 ~ 17,
EDUCD == 113 ~ 17
EDUCD
))
## Recode RACE var to string
<- censoc_dmf %>%
censoc_dmf mutate(race_string = case_when(
== 1 & HISPAN == 0 ~ "White",
RACE == 2 & HISPAN == 0 ~ "Black",
RACE TRUE ~ "Other"
))
## Recode OWNERSHP variable to string
<- censoc_dmf %>%
censoc_dmf mutate(ownership_string = case_when(
== 1 ~ "Owner",
OWNERSHP == 2 ~ "Renter")) OWNERSHP
Exploratory data analysis is an important part of the modeling process and can help us understand the main characteristics of the data and their relationship with our dependent variable, age at death. Here, we will briefly explore some associations between our independent and dependent variables.
## Additional years of life expectancy by education
<- censoc_dmf %>%
educ_df group_by(educ_yrs) %>%
summarize(death_age_educ = mean(death_age), sd = sd(death_age) * (1/sqrt(n())) ) %>%
mutate(add_yrs_life = death_age_educ - mean(death_age_educ)) %>%
ungroup()
## Education Plot
<- ggplot(data = educ_df, mapping = aes(x = educ_yrs, y = add_yrs_life)) +
educ_plot geom_vline(xintercept = c(12, 16), lwd = .7, lty = 2, color = "grey", alpha = .5) +
geom_pointrange(aes(ymin = add_yrs_life - 1.96*sd, ymax = add_yrs_life + 1.96*sd)) +
theme_cowplot() +
scale_x_continuous(breaks = seq(0, 23, 2)) +
labs(title = "Educational pattern of longevity at age 65",
x = "Years of Education",
y = "Additional years of life")
## Additional years of life expectancy by income decile
<- censoc_dmf %>%
income_df filter(INCWAGE < 5500 & INCWAGE > 0) %>%
filter(!is.na(INCWAGE)) %>%
mutate(wage_decile = ntile(INCWAGE,10)) %>%
group_by(wage_decile) %>%
summarize(death_age_decile = mean(death_age), sd = sd(death_age) * (1/sqrt(n())) ) %>%
mutate(add_yrs_life = death_age_decile - mean(death_age_decile))
## Income Plot
<- ggplot(data = income_df, mapping = aes(x = wage_decile, y = add_yrs_life)) +
income_plot geom_pointrange(aes(ymin = add_yrs_life - 1.96*sd, ymax = add_yrs_life + 1.96*sd)) +
labs(title = "Income pattern of longevity at age 65",
x = "Income Decile",
y = "Additional years of life") +
theme_cowplot() +
scale_x_continuous(breaks = seq(0, 10, 2))
## Display Plots
plot_grid(educ_plot, income_plot, labels = "auto")
There are two specific considerations for using regression on age of death to analyze the CenSoc Mortality data. First, we are only observing deaths for a narrow window. As the left and right truncation ages vary by birth cohort, it is important to include fixed effect terms for each year of birth. Models of the form
\[ Age\_at\_death = birth\_year\_dummy + covariates\_of\_interest \]
provide estimates for the effect of the covariates on the age of death in the sample, controlling for birth cohort truncation effects.
In this case, we are only looking at the cohort of 1910, so we do not need to include a fixed effect term for year of birth.
## Prepare for Regression
<- censoc_dmf %>%
censoc_dmf mutate(race_string = as.factor(race_string)) %>%
mutate(race_string = relevel(race_string, ref = "White"))
## Plot Model
<- lm(death_age ~ INCWAGE + educ_yrs + ownership_string + race_string,
model data = censoc_dmf,
weights = weight)
## View model summary
summary(model)
##
## Call:
## lm(formula = death_age ~ INCWAGE + educ_yrs + ownership_string +
## race_string, data = censoc_dmf, weights = weight)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -32.89 -12.04 0.36 12.23 36.93
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.591e+01 9.473e-02 801.262 < 2e-16 ***
## INCWAGE 3.850e-04 3.914e-05 9.837 < 2e-16 ***
## educ_yrs 1.945e-01 8.906e-03 21.836 < 2e-16 ***
## ownership_stringRenter -2.365e-01 5.323e-02 -4.444 8.85e-06 ***
## race_stringBlack 5.612e-03 1.371e-01 0.041 0.967
## race_stringOther 1.273e+00 1.841e-01 6.918 4.60e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.05 on 98263 degrees of freedom
## (4730 observations deleted due to missingness)
## Multiple R-squared: 0.009818, Adjusted R-squared: 0.009768
## F-statistic: 194.9 on 5 and 98263 DF, p-value: < 2.2e-16
The broom
package takes the results from the model we
ran in R and conveniently formats it as a data.frame representation.
Using this package, we both extract and plot our coefficients and
associated confidence intervals.
For a more comprehensive review of working with model-based graphics
in R
, we recommend Kieran Healy’s Data Visualization: A Practical
Introduction.
## Construct a data.frame of the results of our statistical model
<- tidy(model, conf.int = T)
tidy.model <- subset(tidy.model, !term %in% "(Intercept)")
tidy.model
## Plot Regression Coefficients
ggplot(tidy.model, mapping = aes(x = term, y = estimate, ymax = conf.high, ymin = conf.low)) +
geom_pointrange() +
coord_flip() +
theme_cowplot() +
labs(y = "OLS Estimate")