Investigating Socioeconomic Disparities in Longevity for the 1910 Cohort

Summary

In this vignette, we explore the association between socioeconomic indicators and longevity for the 1910 cohort of males using the CenSoc-DMF data. Ordinary Least Squares (OLS) Regression on age of death is an easy and effective way to analyze the CenSoc mortality data.

This vignette is written in the style of the “tidyverse.” There are many other ways to analyze data in R, including base R and data.table.

Getting Started

Before starting on the tutorial, please follow the instructions in the Getting Started with CenSoc vignette to extract the 1940 census from IPUMS and match it onto your CenSoc file. Make sure to add the following variables to your cart while making your IPUMS extract:

HISTID
EDUCD
RACE
OWNERSHP
INCWAGE

## Library Packages
library(data.table)
library(ipumsr)
library(tidyverse)
library(gridExtra)
library(cowplot)
library(broom)

## Read in CenSoc-DMF file 
## Restrict to conservative matches (fewer false matches)
censoc_dmf <- fread("/data/josh/CenSoc/censoc_data/censoc_linked_to_census/v2.1/censoc_dmf_v2.1_linked.csv") %>% 
  filter(link_abe_exact_conservative == 1)
  
## Restrict to cohort of 1910
censoc_dmf <- censoc_dmf %>% 
  filter(byear == 1910)

Data Preparation

IPUMS variables often have value labels, which are text that corresponds to numeric codes (e.g., the SEX variable has value labels: 1 = “Male”, 2 = “Female”).

The IPUMSR package includes helpful functions for assigning values labels (see the ipumsr website for more information and tutorials). In this vignette, for simplicity, we will manually code our value labels.

Note that incwage only measures salaried or wage earners (non-agricultural/business owners) and is topcoded at $5001; all reported wage and salary incomes over $5000 are coded to the 5001 value. Values listed as “999998” and “999999” represent missing values. Additionally, several incwage values are 0. For this vignette, we will ignore those values, but the implications are important to consider for your own research. In addition to the aforementioned non-wage earners, “0” could also represent those outside of the labor force due to injury or school.

## Restrict to men with a non-missing value of income and education
censoc_dmf <- censoc_dmf %>%
  filter(INCWAGE > 0 & INCWAGE < 5002) %>%
  filter((EDUCD > 0 & EDUC < 999)) ## Education is topcoded at 999

## Recode EDUCD (education, detailed) variable to years of education
censoc_dmf <- censoc_dmf %>%
  mutate(educ_yrs = case_when(
   EDUCD == 2 ~ 0,
   EDUCD == 12 ~ 0,
   EDUCD == 14 ~ 1,
   EDUCD == 15 ~ 2,
   EDUCD == 16 ~ 3,
   EDUCD == 17 ~ 4,
   EDUCD == 22 ~ 5,
   EDUCD == 23 ~ 6,
   EDUCD == 25 ~ 7,
   EDUCD == 26 ~ 8,
   EDUCD == 30 ~ 9,
   EDUCD == 40 ~ 10,
   EDUCD == 50 ~ 11,
   EDUCD == 60 ~ 12,
   EDUCD == 70 ~ 13,
   EDUCD == 80 ~ 14,
   EDUCD == 90 ~ 15,
   EDUCD == 100 ~ 16,
   EDUCD == 110 ~ 17,
   EDUCD == 111 ~ 17,
   EDUCD == 112 ~ 17,
   EDUCD == 113 ~ 17
)) 

## Recode RACE var to string
censoc_dmf <- censoc_dmf %>% 
  mutate(race_string = case_when(
    RACE == 1 & HISPAN == 0 ~ "White",
    RACE == 2 & HISPAN == 0 ~ "Black", 
    TRUE ~ "Other"
  ))

## Recode OWNERSHP variable to string
censoc_dmf <- censoc_dmf %>% 
  mutate(ownership_string = case_when(
    OWNERSHP == 1 ~ "Owner",
    OWNERSHP == 2 ~ "Renter"))

Exploratory Data Analysis

Exploratory data analysis is an important part of the modeling process and can help us understand the main characteristics of the data and their relationship with our dependent variable, age at death. Here, we will briefly explore some associations between our independent and dependent variables.

## Additional years of life expectancy by education
educ_df <- censoc_dmf %>% 
  group_by(educ_yrs) %>% 
  summarize(death_age_educ = mean(death_age), sd = sd(death_age) * (1/sqrt(n())) ) %>% 
  mutate(add_yrs_life = death_age_educ - mean(death_age_educ)) %>% 
  ungroup()

## Education Plot
educ_plot <- ggplot(data = educ_df, mapping = aes(x = educ_yrs, y = add_yrs_life)) +
  geom_vline(xintercept = c(12, 16), lwd = .7, lty = 2, color = "grey", alpha = .5) + 
  geom_pointrange(aes(ymin = add_yrs_life - 1.96*sd, ymax = add_yrs_life + 1.96*sd)) +
  theme_cowplot() + 
  scale_x_continuous(breaks = seq(0, 23, 2)) +
  labs(title = "Educational pattern of longevity at age 65",
       x = "Years of Education",
       y = "Additional years of life") 

## Additional years of life expectancy by income decile
income_df <- censoc_dmf %>% 
  filter(INCWAGE < 5500 & INCWAGE > 0) %>% 
  filter(!is.na(INCWAGE)) %>% 
  mutate(wage_decile = ntile(INCWAGE,10)) %>% 
  group_by(wage_decile) %>% 
  summarize(death_age_decile = mean(death_age), sd = sd(death_age) * (1/sqrt(n())) ) %>% 
  mutate(add_yrs_life = death_age_decile - mean(death_age_decile)) 

## Income Plot
income_plot <- ggplot(data = income_df, mapping = aes(x = wage_decile, y = add_yrs_life)) +
  geom_pointrange(aes(ymin = add_yrs_life - 1.96*sd, ymax = add_yrs_life + 1.96*sd)) +
  labs(title = "Income pattern of longevity at age 65",
       x = "Income Decile",
       y = "Additional years of life") + 
  theme_cowplot() + 
  scale_x_continuous(breaks = seq(0, 10, 2))

## Display Plots
plot_grid(educ_plot, income_plot, labels = "auto")

OLS Regression on Age of Death

There are two specific considerations for using regression on age of death to analyze the CenSoc Mortality data. First, we are only observing deaths for a narrow window. As the left and right truncation ages vary by birth cohort, it is important to include fixed effect terms for each year of birth. Models of the form

\[ Age\_at\_death = birth\_year\_dummy + covariates\_of\_interest \]

provide estimates for the effect of the covariates on the age of death in the sample, controlling for birth cohort truncation effects.

In this case, we are only looking at the cohort of 1910, so we do not need to include a fixed effect term for year of birth.

## Prepare for Regression
censoc_dmf <- censoc_dmf %>% 
  mutate(race_string = as.factor(race_string)) %>% 
  mutate(race_string = relevel(race_string, ref = "White"))  

## Plot Model
model <- lm(death_age ~ INCWAGE + educ_yrs + ownership_string + race_string,
            data = censoc_dmf,
            weights = weight)

## View model summary 
summary(model)

## 
## Call:
## lm(formula = death_age ~ INCWAGE + educ_yrs + ownership_string + 
##     race_string, data = censoc_dmf, weights = weight)
## 
## Weighted Residuals:
##    Min     1Q Median     3Q    Max 
## -32.89 -12.04   0.36  12.23  36.93 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             7.591e+01  9.473e-02 801.262  < 2e-16 ***
## INCWAGE                 3.850e-04  3.914e-05   9.837  < 2e-16 ***
## educ_yrs                1.945e-01  8.906e-03  21.836  < 2e-16 ***
## ownership_stringRenter -2.365e-01  5.323e-02  -4.444 8.85e-06 ***
## race_stringBlack        5.612e-03  1.371e-01   0.041    0.967    
## race_stringOther        1.273e+00  1.841e-01   6.918 4.60e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.05 on 98263 degrees of freedom
##   (4730 observations deleted due to missingness)
## Multiple R-squared:  0.009818,   Adjusted R-squared:  0.009768 
## F-statistic: 194.9 on 5 and 98263 DF,  p-value: < 2.2e-16

Visualizing Model Coefficients

The broom package takes the results from the model we ran in R and conveniently formats it as a data.frame representation. Using this package, we both extract and plot our coefficients and associated confidence intervals.

For a more comprehensive review of working with model-based graphics in R, we recommend Kieran Healy’s Data Visualization: A Practical Introduction.

## Construct a data.frame of the results of our statistical model
tidy.model <- tidy(model, conf.int = T)
tidy.model <- subset(tidy.model, !term %in% "(Intercept)")

## Plot Regression Coefficients
ggplot(tidy.model, mapping = aes(x = term, y = estimate, ymax = conf.high, ymin = conf.low)) +
  geom_pointrange() + 
  coord_flip() + 
  theme_cowplot() + 
  labs(y = "OLS Estimate")