library(haven)
library(survival)
library(dplyr)
library(biostat3)Replicating a Cox Model with Poisson Regression in R
Introduction
In survival analysis, the Cox proportional hazards model is the standard method for modeling time-to-event data. However, it’s also possible to fit an equivalent model using Poisson regression using a time-split dataset (Bendix Carstensen. “Who needs the Cox model anyway?”).
This approach is not only a great way to understand the underlying mechanics of the Cox proportional hazards (PH) model but can also be useful for fitting more complex models (e.g., with time-varying effects or multiple time scales).
This post demonstrates how to replicate the results of a Cox PH model using Poisson regression in R. The analysis is a replication of Paul Dickman’s tutorial, which was originally done in Stata.
We will fit three models:
- A Cox PH model.
- A Poisson GLM on data split into one-year intervals.
- A Poisson GLM on data split at each event time, which should closely replicate the Cox model.
Long story short, the coefficients from the Cox model and the Poisson model with splitting at event times should be virtually identical, and the one-year split model should be a close approximation to the Cox model.
Setup and Data Preparation
This is basically stset the data in Stata. We’ll focus on patients with localised colon carsinoma (stage 1). We create an event indicator dead for cancer-specific death, and censor follow-up time at 10 years (120 months).
# Load the colon dataset
colon <- read_dta("http://pauldickman.com/data/colon.dta")
# Prepare the data for analysis
colon_data <- colon |>
filter(stage==1) |>
mutate(# Event: 1 if Dead: cancer; 0 otherwise
dead = as.numeric(status==1),
# Female as the reference group
female = ifelse(sex == 2, 1, 0),
# Age group as a factor
agegrp_factor = as.factor(agegrp)) |>
mutate(# Right-censored at 10 years (120 months)
surv_mm = pmin(surv_mm, 120),
dead = ifelse(surv_mm >= 120, 0, dead))1. Cox PH Model
Each model below was fitted with three covariates: sex (female), period of diagnosis (1975–1984 vs. 1985–1994), and age group (0–44, 45–59, 60–74, or 75+ years). Because the primary objective is to assess sex differences in cancer-specific survival, we focus on the coefficient for female, and period of diagnosis and age group are considered confounders.
The Cox PH model is defined by the hazard function, which describes the hazard at time
Here:
is the baseline hazard function, which is unspecified. is the relative hazard, which is assumed to be constant over time (the proportional hazards assumption). are the coefficients we estimate.
The model is fit by maximizing the partial likelihood, which compares the risk of the individual who has an event at a given time to the total risk of all individuals still at risk at that time.
We begin by fitting a Cox PH model. An important detail here is specifying ties = "breslow". The default in R’s survival package coxph is "efron", but Stata’s default is "breslow". To replicate the Stata results and the theoretical equivalence with the Poisson model, we must use the Breslow method for handling tied event times.
fit1 <- coxph(Surv(surv_mm, dead) ~ female + year8594 + agegrp_factor,
ties = "breslow",
data = colon_data)
summary(fit1)Call:
coxph(formula = Surv(surv_mm, dead) ~ female + year8594 + agegrp_factor,
data = colon_data, ties = "breslow")
n= 6274, number of events= 1687
coef exp(coef) se(coef) z Pr(>|z|)
female -0.10101 0.90392 0.04995 -2.022 0.0432 *
year8594 -0.28851 0.74938 0.04941 -5.839 5.25e-09 ***
agegrp_factor1 -0.08517 0.91836 0.13969 -0.610 0.5420
agegrp_factor2 0.22279 1.24956 0.12666 1.759 0.0786 .
agegrp_factor3 0.75229 2.12185 0.12664 5.941 2.84e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
female 0.9039 1.1063 0.8196 0.9969
year8594 0.7494 1.3344 0.6802 0.8256
agegrp_factor1 0.9184 1.0889 0.6984 1.2076
agegrp_factor2 1.2496 0.8003 0.9749 1.6017
agegrp_factor3 2.1219 0.4713 1.6555 2.7196
Concordance= 0.609 (se = 0.007 )
Likelihood ratio test= 189.8 on 5 df, p=<2e-16
Wald test = 190.1 on 5 df, p=<2e-16
Score (logrank) test = 195.3 on 5 df, p=<2e-16
The hazard ratio for female is:
0.9039
2. Poisson GLM with One-Year Time-Splitting
The core idea is to model the number of events,
The rate
Combining these, we model the mean of the number of events
Now, I’ll demonstrate how to approximate the Cox model with a piecewise exponential model. First, we split each patient’s follow-up time into one-year intervals. Within each interval, the hazard is assumed to be constant. The baseline hazard is modeled by including the time-split interval (fu) as a factor in the model.
First, we split the data using survSplit().
# This creates a piecewise exponential model with 1-year (12-month) time bands.
max_time <- max(colon_data$surv_mm)
cut_points <- seq(0, max_time, by = 12)
cut_points <- cut_points[cut_points < max_time] # Ensure cut points are within max time
colon_data_splityr <- survSplit(data = colon_data,
cut=cut_points,
start="start",
end="surv_mm",
event="dead")
## Calculate person-time (pt) and the follow-up interval (fu) as a factor
colon_data_splityr2 <- transform(colon_data_splityr,
pt = surv_mm - start,
fu = as.factor(start))A key step here is to collapse the data. Since many individuals will now share the same covariate pattern within a given time interval, we can aggregate the data to get total person-time and total events for each unique combination of covariates and follow-up intervals. This makes the model fitting much more efficient (faster).
## Collapse
## This is the key of using aggregate level data to model non-parametric baseline
## If we can obtain this level of data, we can fit a glm
colon_data_splityr3 <- colon_data_splityr2 |>
group_by(female, fu, year8594, agegrp_factor) |>
summarise(pt=sum(pt),
dead=sum(dead)) |>
data.frame()`summarise()` has grouped output by 'female', 'fu', 'year8594'. You can
override using the `.groups` argument.
Next, we fit a Poisson GLM. The number of events (dead) is the outcome, and we use the log of person-time (log(pt)) as an offset.
# Fit GLM with time band factor to model non-constant baseline
# time_years - start is the exact duration in the split interval
fit2 <- glm(dead ~ female + fu + year8594 + agegrp_factor + offset(log(pt)),
family = poisson,
data = colon_data_splityr3)
summary(fit2)
Call:
glm(formula = dead ~ female + fu + year8594 + agegrp_factor +
offset(log(pt)), family = poisson, data = colon_data_splityr3)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.01335 0.12977 -38.632 < 2e-16 ***
female -0.10496 0.04996 -2.101 0.03565 *
fu12 -0.19276 0.06611 -2.916 0.00355 **
fu24 -0.44877 0.07602 -5.904 3.56e-09 ***
fu36 -0.64656 0.08691 -7.439 1.01e-13 ***
fu48 -0.82848 0.09998 -8.287 < 2e-16 ***
fu60 -0.97550 0.11464 -8.509 < 2e-16 ***
fu72 -1.31371 0.14434 -9.102 < 2e-16 ***
fu84 -1.97217 0.21337 -9.243 < 2e-16 ***
fu96 -2.16954 0.25430 -8.532 < 2e-16 ***
fu108 -2.20205 0.28152 -7.822 5.20e-15 ***
year8594 -0.28641 0.04943 -5.795 6.84e-09 ***
agegrp_factor1 -0.08334 0.13969 -0.597 0.55074
agegrp_factor2 0.22766 0.12665 1.798 0.07225 .
agegrp_factor3 0.77046 0.12663 6.084 1.17e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 912.56 on 159 degrees of freedom
Residual deviance: 219.23 on 145 degrees of freedom
AIC: 734.1
Number of Fisher Scoring iterations: 5
The resulting hazard ratio for female from this model is:
0.9004
This is fairly close to the Cox model result 0.9039. This offers possibility for cross-country survival comparison (introduced in subsequent tutorials) without sharing individual-level data as along as we can have aggregate-level data for each time interval.
3. Poisson GLM Split at Event Times
By splitting the data at each unique event time, we are creating a series of time intervals where the baseline hazard is assumed to be constant within each interval, but allowed to change between them. Because the intervals are defined by the event times themselves, this approach creates a non-parametric step-function for the baseline hazard, which is analogous to the unspecified baseline hazard,
The likelihood for an individual
The sum of log-likelihood is the sum over all individuals and all intervals:
Let’s focus on a specific event time
Substituting this back into the log-likelihood function and summing over all event times gives a “profile likelihood” that only depends on
Because the likelihoods are proportional, maximizing one is equivalent to maximizing the other, which is why the resulting estimates for
In practice, first, we extract all unique event times.
event_times <- colon_data |>
filter(dead == 1) |>
pull(surv_mm) |>
unique() |>
sort()We split the data again, this time using the unique event times as cut-points.
colon_data_splitevent <- survSplit(data = colon_data,
cut = event_times,
start = "start",
end = "surv_mm",
event = "dead")
colon_data_splitevent2 <- transform(colon_data_splitevent,
pt = surv_mm - start,
fu = as.factor(start))Next, we collapse the data to get total person-time and total events for each unique combination of covariates and follow-up intervals.
colon_data_splitevent3 <- colon_data_splitevent2 |>
group_by(female, fu, year8594, agegrp_factor) |>
summarise(pt=sum(pt),
dead=sum(dead),
.groups = 'drop') |>
# Sort by the covariates first, then by fu
arrange(female, year8594, agegrp_factor, fu) |>
# Order the variables as female, year8594, agegrp_factor, fu, pt, dead
dplyr::select(female, year8594, agegrp_factor, fu, pt, dead) |>
data.frame()
head(colon_data_splitevent3) female year8594 agegrp_factor fu pt dead
1 0 0 0 0 37.5 1
2 0 0 0 0.5 74.0 0
3 0 0 0 1.5 74.0 0
4 0 0 0 2.5 74.0 0
5 0 0 0 3.5 74.0 0
6 0 0 0 4.5 74.0 0
Finally, we fit the Poisson GLM on the aggregated data.
fit3 <- glm(dead ~ female + fu + year8594 + agegrp_factor + offset(log(pt)),
family = poisson,
data = colon_data_splitevent3)The hazard ratio for female to male is: 0.9039
Conclusion
Let’s compare the hazard ratios for the female coefficient from all three models.
| Model | HR (Female to male) |
|---|---|
| 1. Cox PH | 0.9039 |
| 2. Poisson GLM with One-Year Time-Splitting | 0.9004 |
| 3. Poisson GLM Split at Event Times | 0.9039 |
The coefficient estimated from the Poisson model with splitting at every event time (fit3) is virtually identical to that from the Cox proportional hazards model fitted with (fit1). The very minor discrepancy arises from numerical rounding and differences in the optimization routines used by the two fitting procedures. The piecewise exponential model with one-year intervals (fit2) also yields a close approximation to the Cox model.