Estimating survival when individual-participant data can’t be pooled: potential solutions

Background

Analyzing data from multiple insitutions/countries allows us to understand disease patterns, treatment effectiveness, and survival outcomes on a larger scale (more power!). The best scenario (ideal world) is to pool individual-participant data (IPD) from all participating institutions/countries into a single dataset.

However, life is usually not so easy. Due to privacy regulations (like GDPR) and ethical considerations, sharing individual-level data across borders is more challenging or even not feasbile anymore. :(

Question

How can we perform a survival analysis without access to pooled individual-participant data?

Potential solutions

1. Federated analysis

Federated analysis is an approach where a statistical model is trained on data from multiple sources without the data ever leaving its source.

The basic concept is to combine locally analysed data on a central server, consisting of the following steps:

Local Analysis: Each country fits a survival model (e.g., a Cox PH model) on its own local data.
Share Summaries: Each country shares summary information, like model coefficients or aggregate statistics.
Central Aggregation: A central coordinator combines these summaries to build a single, global model.

References:

Lu, C.-L., et al. (2015). “WebDISCO: a web service for distributed cox model learning without patient-level data sharing.” Journal of the American Medical Informatics Association. https://doi.org/10.1093/jamia/ocv083
NORCAN: In NORDCAN, data were analyzed locally in each country, and aggregated data were sent to the International Agency for Research on Cancer (IARC) for comparison.
Lambert, P. C. (2024). “A practical approach to fitting cancer survival models when data can’t move across borders.” Presentation Slides and Codes.

Note

Lambert et al. have outlined a workflow for this topic:

Researcher A (Country A) fits a survival model on their local data. They then share the model output.
Researcher B (Country B) takes this model output from Country A and compares with Country B.

This allows Researcher B to carry out survival standardisation to compare survival across Countries A and B.

2. Meta-analysis

Two-stage meta-analysis is an alternative statistical method for combining results from multiple studies. Here, we can treat the analysis from each institution/country as a separate “study.”:

Stage 1: Each country calculates summary statistics (e.g., hazard ratios and their standard errors) from its local survival data.
Stage 2: These summary statistics are then pooled using meta-analysis techniques to estimate an overall result.

Note

In contrast to two-stage meta-analysis, one-stage meta-analysis meta-analysis on pooled individual-level data is not feasible, as it still requires all the data to be in one place.

References:

One-stage meta-analysis: Riley, R. D., et al. (2010). “Meta-analysis of individual participant data: rationale, conduct, and reporting.” BMJ. https://doi.org/10.1136/bmj.c221
Two-stage meta-analysis: FIXME

3. Synthetic data generation

An alternative to sharing model results is to share synthetic data. In this approach, each institution generates an artificial dataset that mimics the statistical properties of the real patient data without containing any actual patient information.

References:

Colleagues from Cancer Registry of Norway were developing this approach as well.
Rollo, A., et al. (2024). “SYNDSURV: A simple framework for survival analysis with data distributed across multiple institutions.” Computers in Biology and Medicine. https://doi.org/10.1016/j.compbiomed.2024.108288

Note

A generative Bayesian Network can be used to create synthetic time-to-event data. These synthetic data can then be safely shared and pooled centrally for analysis, providing a privacy-preserving alternative to federated analysis.

4. Aggregate data analysis

When individual-level data cannot be shared, one can use aggregate data (e.g., summary statistics) from each institution/country and perform survival analysis based on the summary statistics.

References FIXME

Summary table

The table below compares different solutions described above when individual-participant data (IPD) cannot be shared across institutions/countries.

Characteristics	Pooled IPD (Ideal world)	1. Federated analysis	2. Two-stage meta-analysis	3. Synthetic data generation	4. Aggregate data analysis
Required all IPD at one place	Yes	No	No	No (Only synthetic data shared)	No (Only aggregate counts shared)
Analysis level	Centralized, on combined IPD.	Model fitting locally; centralized aggregation of model summaries.	Local calculation of summary statistics; central pooling the results.	Centralized, on combined synthetic data.	Centralized, on pooled event and person time counts.
Facility Requirement	Low (statistical software)	High (Central coordination server)	Low (statistical software)	Medium (Data generating process + statistical software)	Low (statistical software)
Advantage	Maximum statistical power and flexibility. No info loss.	Preserves data privacy by sharing only model parameters/gradients	Simple, well-established approach	Synthetic data can be pooled freely and preserving power	Preserves data privacy + may approximate results from IPD
Disadvantage	Legally infeasible due to data privacy	Requires specialized software. Complex coordination.	Less flexible. Allowing report HR but more complicated to pool K-M curves	Requires validated methods for data generation with each country’s approval	Information loss from grouping. Small numbers may still not allowed to be reported.