Replicating a Cox Model with Poisson Regression in R

Survival Analysis

Author

Enoch Yi-Tung Chen

Published

October 15, 2025

Introduction

In survival analysis, the Cox proportional hazards model is the standard method for modeling time-to-event data. However, it’s also possible to fit an equivalent model using Poisson regression using a time-split dataset (Bendix Carstensen. “Who needs the Cox model anyway?”).

This approach is not only a great way to understand the underlying mechanics of the Cox proportional hazards (PH) model but can also be useful for fitting more complex models (e.g., with time-varying effects or multiple time scales).

This post demonstrates how to replicate the results of a Cox PH model using Poisson regression in R. The analysis is a replication of Paul Dickman’s tutorial, which was originally done in Stata.

We will fit three models:

A Cox PH model.
A Poisson GLM on data split into one-year intervals.
A Poisson GLM on data split at each event time, which should closely replicate the Cox model.

Long story short, the coefficients from the Cox model and the Poisson model with splitting at event times should be virtually identical, and the one-year split model should be a close approximation to the Cox model.

Setup and Data Preparation

library(haven)
library(survival)
library(dplyr)
library(biostat3)

This is basically stset the data in Stata. We’ll focus on patients with localised colon carsinoma (stage 1). We create an event indicator dead for cancer-specific death, and censor follow-up time at 10 years (120 months).

# Load the colon dataset
colon <- read_dta("http://pauldickman.com/data/colon.dta")

# Prepare the data for analysis
colon_data <- colon |>
              filter(stage==1) |>
              mutate(# Event: 1 if Dead: cancer; 0 otherwise
                     dead = as.numeric(status==1),
                     # Female as the reference group
                     female = ifelse(sex == 2, 1, 0),
                     # Age group as a factor
                     agegrp_factor = as.factor(agegrp)) |>
              mutate(# Right-censored at 10 years (120 months)
                     surv_mm = pmin(surv_mm, 120),
                     dead = ifelse(surv_mm >= 120, 0, dead))

1. Cox PH Model

Each model below was fitted with three covariates: sex (female), period of diagnosis (1975–1984 vs. 1985–1994), and age group (0–44, 45–59, 60–74, or 75+ years). Because the primary objective is to assess sex differences in cancer-specific survival, we focus on the coefficient for female, and period of diagnosis and age group are considered confounders.

The Cox PH model is defined by the hazard function, which describes the hazard at time $t$ , given a set of covariates $X$ :

$h (t | X) = h_{0} (t) \exp (X β)$

Here:

$h_{0} (t)$ is the baseline hazard function, which is unspecified.
$\exp (X β)$ is the relative hazard, which is assumed to be constant over time (the proportional hazards assumption).
$β$ are the coefficients we estimate.

The model is fit by maximizing the partial likelihood, which compares the risk of the individual who has an event at a given time to the total risk of all individuals still at risk at that time.

We begin by fitting a Cox PH model. An important detail here is specifying ties = "breslow". The default in R’s survival package coxph is "efron", but Stata’s default is "breslow". To replicate the Stata results and the theoretical equivalence with the Poisson model, we must use the Breslow method for handling tied event times.

fit1 <- coxph(Surv(surv_mm, dead) ~ female + year8594 + agegrp_factor,
              ties = "breslow", 
              data = colon_data)

summary(fit1)

Call:
coxph(formula = Surv(surv_mm, dead) ~ female + year8594 + agegrp_factor, 
    data = colon_data, ties = "breslow")

  n= 6274, number of events= 1687 

                   coef exp(coef) se(coef)      z Pr(>|z|)    
female         -0.10101   0.90392  0.04995 -2.022   0.0432 *  
year8594       -0.28851   0.74938  0.04941 -5.839 5.25e-09 ***
agegrp_factor1 -0.08517   0.91836  0.13969 -0.610   0.5420    
agegrp_factor2  0.22279   1.24956  0.12666  1.759   0.0786 .  
agegrp_factor3  0.75229   2.12185  0.12664  5.941 2.84e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

               exp(coef) exp(-coef) lower .95 upper .95
female            0.9039     1.1063    0.8196    0.9969
year8594          0.7494     1.3344    0.6802    0.8256
agegrp_factor1    0.9184     1.0889    0.6984    1.2076
agegrp_factor2    1.2496     0.8003    0.9749    1.6017
agegrp_factor3    2.1219     0.4713    1.6555    2.7196

Concordance= 0.609  (se = 0.007 )
Likelihood ratio test= 189.8  on 5 df,   p=<2e-16
Wald test            = 190.1  on 5 df,   p=<2e-16
Score (logrank) test = 195.3  on 5 df,   p=<2e-16

The hazard ratio for female is:

0.9039

2. Poisson GLM with One-Year Time-Splitting

The core idea is to model the number of events, $D$ . For a small time interval from $t$ to $t + y$ , the number of events $D$ for an individual is assumed to follow a Poisson distribution with a mean equal to the rate $λ$ multiplied by the person-time at risk, $y$ :

$D \sim Poisson (λ \times y)$

The rate $λ$ is then modeled using a log-linear model, which includes the covariates $X$ with corresponding time intervals $β$ . This allows the baseline rate to vary across intervals. The model for the rate is:

$\log (λ) = β_{0} + X β$

Combining these, we model the mean of the number of events $D$ as independent Poisson variables with mean values $E [D] = λ \times y$ :

$\begin{aligned} \log (E [D]) & = \log (y) + γ_{interval} \cdot f u + X β \\ = \log (y) + γ_{1} \cdot f u_{1} + γ_{2} \cdot f u_{2} + γ_{3} \cdot f u_{3} + . . . + X β \end{aligned}$ where: - $\log (y)$ is an offset (log person-time) - $γ_{interval}$ is coefficient(s) for time intervals $f u$ - $X β$ is the product of covariates and coefficients

Now, I’ll demonstrate how to approximate the Cox model with a piecewise exponential model. First, we split each patient’s follow-up time into one-year intervals. Within each interval, the hazard is assumed to be constant. The baseline hazard is modeled by including the time-split interval (fu) as a factor in the model.

First, we split the data using survSplit().

# This creates a piecewise exponential model with 1-year (12-month) time bands.
max_time <- max(colon_data$surv_mm)
cut_points <- seq(0, max_time, by = 12)
cut_points <- cut_points[cut_points < max_time] # Ensure cut points are within max time

colon_data_splityr <- survSplit(data = colon_data,
                                cut=cut_points,
                                start="start",
                                end="surv_mm",
                                event="dead")

## Calculate person-time (pt) and the follow-up interval (fu) as a factor
colon_data_splityr2 <- transform(colon_data_splityr,
                                 pt = surv_mm - start,
                                 fu = as.factor(start))

A key step here is to collapse the data. Since many individuals will now share the same covariate pattern within a given time interval, we can aggregate the data to get total person-time and total events for each unique combination of covariates and follow-up intervals. This makes the model fitting much more efficient (faster).

## Collapse
## This is the key of using aggregate level data to model non-parametric baseline
## If we can obtain this level of data, we can fit a glm
colon_data_splityr3 <- colon_data_splityr2 |>
                       group_by(female, fu, year8594, agegrp_factor) |>
                       summarise(pt=sum(pt), 
                                 dead=sum(dead)) |>
                       data.frame()

`summarise()` has grouped output by 'female', 'fu', 'year8594'. You can
override using the `.groups` argument.

Next, we fit a Poisson GLM. The number of events (dead) is the outcome, and we use the log of person-time (log(pt)) as an offset.

# Fit GLM with time band factor to model non-constant baseline
# time_years - start is the exact duration in the split interval 
fit2 <- glm(dead ~ female + fu + year8594 + agegrp_factor + offset(log(pt)),
            family = poisson,
            data = colon_data_splityr3)

summary(fit2)


Call:
glm(formula = dead ~ female + fu + year8594 + agegrp_factor + 
    offset(log(pt)), family = poisson, data = colon_data_splityr3)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -5.01335    0.12977 -38.632  < 2e-16 ***
female         -0.10496    0.04996  -2.101  0.03565 *  
fu12           -0.19276    0.06611  -2.916  0.00355 ** 
fu24           -0.44877    0.07602  -5.904 3.56e-09 ***
fu36           -0.64656    0.08691  -7.439 1.01e-13 ***
fu48           -0.82848    0.09998  -8.287  < 2e-16 ***
fu60           -0.97550    0.11464  -8.509  < 2e-16 ***
fu72           -1.31371    0.14434  -9.102  < 2e-16 ***
fu84           -1.97217    0.21337  -9.243  < 2e-16 ***
fu96           -2.16954    0.25430  -8.532  < 2e-16 ***
fu108          -2.20205    0.28152  -7.822 5.20e-15 ***
year8594       -0.28641    0.04943  -5.795 6.84e-09 ***
agegrp_factor1 -0.08334    0.13969  -0.597  0.55074    
agegrp_factor2  0.22766    0.12665   1.798  0.07225 .  
agegrp_factor3  0.77046    0.12663   6.084 1.17e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 912.56  on 159  degrees of freedom
Residual deviance: 219.23  on 145  degrees of freedom
AIC: 734.1

Number of Fisher Scoring iterations: 5

The resulting hazard ratio for female from this model is:

0.9004

This is fairly close to the Cox model result 0.9039. This offers possibility for cross-country survival comparison (introduced in subsequent tutorials) without sharing individual-level data as along as we can have aggregate-level data for each time interval.

3. Poisson GLM Split at Event Times

By splitting the data at each unique event time, we are creating a series of time intervals where the baseline hazard is assumed to be constant within each interval, but allowed to change between them. Because the intervals are defined by the event times themselves, this approach creates a non-parametric step-function for the baseline hazard, which is analogous to the unspecified baseline hazard, $h_{0} (t)$ , in the Cox model.

The likelihood for an individual $i$ $(d_{i j}, y_{i j})$ in time interval $j$ (from $t_{j - 1}$ to $t_{j}$ ) in a Poisson model is:

$L_{i j} = \frac{(λ_{i j} y_{i j})^{d_{i j}}}{d_{i j}!} \exp (- λ_{i j} y_{i j}),$ where $d_{i j}$ is the event indicator (0 or 1), $y_{i j}$ is the person-time at risk, and $λ_{i j}$ is the hazard rate for that individual in that interval. The rate is modeled as:

$λ_{i j} = λ_{j} \exp (X_{i} β),$ where $λ_{j}$ is the baseline hazard for interval $j$ and $\exp (X_{i} β)$ is the relative hazard for individual $i$ .

The sum of log-likelihood is the sum over all individuals and all intervals:

$ℓ (β, λ) = \sum_{j} \sum_{i} [d_{i j} \log (λ_{j} y_{i j} \exp (X_{i} β)) - λ_{j} y_{i j} \exp (X_{i} β)]$

Let’s focus on a specific event time $τ_{k}$ . Let $D_{k} = \sum_{i} d_{i k}$ be the total number of events at this time, and let $R_{k}$ be the set of all individuals at risk just before $τ_{k}$ . The key insight is that the maximum likelihood estimate for the baseline hazard ${\hat{λ}}_{k}$ in this interval is:

${\hat{λ}}_{k} = \frac{D_{k}}{\sum_{i \in R_{k}} y_{i k} \exp (X_{i} β)}$

Substituting this back into the log-likelihood function and summing over all event times gives a “profile likelihood” that only depends on $β$ . This profile likelihood is mathematically proportional to the log-partial likelihood of the Cox model using the Breslow approximation for tied data:

$ℓ_{C o x - B r e s l o w} (β) = \sum_{k} [(\sum_{i \in Events at τ_{k}} X_{i} β) - D_{k} \log (\sum_{i \in R_{k}} \exp (X_{i} β))]$

Because the likelihoods are proportional, maximizing one is equivalent to maximizing the other, which is why the resulting estimates for $β$ are virtually identical. The small differences observed are due to numerical precision in the different optimization algorithms.

In practice, first, we extract all unique event times.

event_times <- colon_data |>
               filter(dead == 1) |>
               pull(surv_mm) |>
               unique() |>
               sort()

We split the data again, this time using the unique event times as cut-points.

colon_data_splitevent <- survSplit(data = colon_data,
                                   cut = event_times,
                                   start = "start",
                                   end = "surv_mm",
                                   event = "dead")

colon_data_splitevent2 <- transform(colon_data_splitevent,
                                    pt = surv_mm - start,
                                    fu = as.factor(start))

Next, we collapse the data to get total person-time and total events for each unique combination of covariates and follow-up intervals.

colon_data_splitevent3 <- colon_data_splitevent2 |>
                          group_by(female, fu, year8594, agegrp_factor) |>
                          summarise(pt=sum(pt), 
                                    dead=sum(dead),
                                    .groups = 'drop') |>
                          # Sort by the covariates first, then by fu
                          arrange(female, year8594, agegrp_factor, fu) |>
                          # Order the variables as female, year8594, agegrp_factor, fu, pt, dead
                          dplyr::select(female, year8594, agegrp_factor, fu, pt, dead) |>
                          data.frame()
                          
head(colon_data_splitevent3)

  female year8594 agegrp_factor  fu   pt dead
1      0        0             0   0 37.5    1
2      0        0             0 0.5 74.0    0
3      0        0             0 1.5 74.0    0
4      0        0             0 2.5 74.0    0
5      0        0             0 3.5 74.0    0
6      0        0             0 4.5 74.0    0

Finally, we fit the Poisson GLM on the aggregated data.

fit3 <- glm(dead ~ female + fu + year8594 + agegrp_factor + offset(log(pt)),
            family = poisson,
            data = colon_data_splitevent3)

The hazard ratio for female to male is: 0.9039

Conclusion

Let’s compare the hazard ratios for the female coefficient from all three models.

Model	HR (Female to male)
1. Cox PH	0.9039
2. Poisson GLM with One-Year Time-Splitting	0.9004
3. Poisson GLM Split at Event Times	0.9039

The coefficient estimated from the Poisson model with splitting at every event time (fit3) is virtually identical to that from the Cox proportional hazards model fitted with (fit1). The very minor discrepancy arises from numerical rounding and differences in the optimization routines used by the two fitting procedures. The piecewise exponential model with one-year intervals (fit2) also yields a close approximation to the Cox model.