5  Pooled Kaplan-Meier Estimates

Pooled Kaplan-Meier estimates after multiple imputation and propensity score analysis

In this vignette, we demonstrate how to use the km_pooling() function from the encore.analytics package to compute and visualize pooled Kaplan-Meier estimates across multiply imputed and propensity score matched/weighted datasets.

5.1 Overview

The km_pooling() function addresses a common challenge in survival analysis with missing data: how to properly combine Kaplan-Meier estimates across multiple imputed datasets after propensity score matching or weighting. The function implements the methodology recommended by1 for pooling survival probabilities using Rubin’s rules.

5.2 Key Features

The function provides three main outputs:

  1. Pooled median survival estimates with confidence intervals
  2. A detailed survival probability table with pooled estimates
  3. A visualization of the pooled Kaplan-Meier curve

5.3 Methodology

The function follows these key steps:

  1. Fits Kaplan-Meier survival functions to each imputed and matched/weighted dataset
  2. Transforms survival probabilities using complementary log-log transformation
  3. Pools transformed survival probabilities and computes total variance using Rubin’s rule
  4. Back-transforms pooled survival probabilities and computes 95% confidence intervals
  5. Extracts median survival time and corresponding 95% confidence intervals
  6. Plots the Kaplan-Meier curve with pooled survival probabilities and confidence intervals

5.4 Mathematical Details

The pooling process involves several mathematical transformations to ensure proper combination of survival estimates across imputed datasets:

5.4.1 1. Complementary Log-Log Transformation

For each time point \(t\) and imputed dataset \(m\), the survival probability \(S_m(t)\) is transformed using the complementary log-log transformation:

\[g_m(t) = \log(-\log(1-S_m(t)))\]

This transformation helps approximate normality, which is required for proper application of Rubin’s rules.

5.4.2 2. Pooling Transformed Estimates

The pooled estimate \(\bar{Q}\) (following Rubin’s notation) at each time point is computed as:

\[\bar{Q} = \frac{1}{M}\sum_{m=1}^M g_m(t)\]

where \(M\) is the number of imputed datasets.

5.4.3 3. Variance Estimation

The total variance \(T\) of the pooled estimate combines within-imputation variance \(\bar{U}\) and between-imputation variance \(B\):

\[\bar{U} = \frac{1}{M}\sum_{m=1}^M U_m\]

\[B = \frac{1}{M-1}\sum_{m=1}^M (g_m(t) - \bar{Q})^2\]

\[T = \bar{U} + (1 + \frac{1}{M})B\]

where \(U_m\) is the variance of \(g_m(t)\) in the \(m\)th imputed dataset.

5.4.4 4. Back-Transformation

The pooled survival probability and its confidence intervals are obtained by back-transforming:

\[S(t) = 1 - \exp(-\exp(\bar{Q}))\]

The 95% confidence intervals are computed as:

\[S_{lower}(t) = 1 - \exp(-\exp(\bar{Q} - 1.96\sqrt{T}))\] \[S_{upper}(t) = 1 - \exp(-\exp(\bar{Q} + 1.96\sqrt{T}))\]

5.5 Example Application

Let’s walk through a complete example using simulated data:

library(here)
library(dplyr)
library(survival)
library(mice)
library(MatchThem)
library(encore.analytics)

source(here("functions", "covariate_vectors.R"))

5.5.1 Data Generation

First, we’ll simulate a dataset with some missing values:

# load example dataset with missing observations
data_miss <- simulate_data(
  n_total = 3500, 
  seed = 42, 
  include_id = FALSE, 
  imposeNA = TRUE,
  propNA = .33
  )

5.5.2 Multiple Imputation

We’ll create 10 imputed datasets:

# impute data
data_imp <- futuremice(
  parallelseed = 42,
  n.core = parallel::detectCores()-1,
  data = data_miss,
  method = "rf",
  m = 10,
  print = FALSE
  )

5.5.3 Propensity Score Weighting

We’ll apply propensity score matching to each imputed dataset:

# apply propensity score matching on mids object
ps_form <- as.formula(paste("treat ~", paste(covariates_for_ps, collapse = " + ")))
ps_form
treat ~ dem_age_index_cont + dem_sex_cont + c_smoking_history + 
    c_number_met_sites + c_hemoglobin_g_dl_cont + c_urea_nitrogen_mg_dl_cont + 
    c_platelets_10_9_l_cont + c_calcium_mg_dl_cont + c_glucose_mg_dl_cont + 
    c_lymphocyte_leukocyte_ratio_cont + c_alp_u_l_cont + c_protein_g_l_cont + 
    c_alt_u_l_cont + c_albumin_g_l_cont + c_bilirubin_mg_dl_cont + 
    c_chloride_mmol_l_cont + c_monocytes_10_9_l_cont + c_eosinophils_leukocytes_ratio_cont + 
    c_ldh_u_l_cont + c_hr_cont + c_sbp_cont + c_oxygen_cont + 
    c_ecog_cont + c_neutrophil_lymphocyte_ratio_cont + c_bmi_cont + 
    c_ast_alt_ratio_cont + c_stage_initial_dx_cont + dem_race + 
    dem_region + dem_ses + c_time_dx_to_index
# matching
mimids_data <- matchthem(
  formula = ps_form,
  datasets = data_imp,
  approach = 'within',
  method = 'nearest',
  distance = "glm",
  link = "logit",
  caliper = 0.01,
  ratio = 1,
  replace = F
  )

# print summary for matched dataset #1
mimids_data
A `matchit` object
 - method: 1:1 nearest neighbor matching without replacement
 - distance: Propensity score [caliper]

             - estimated with logistic regression
 - caliper: <distance> (0.001)
 - number of obs.: 3500 (original), 2712 (matched)
 - target estimand: ATT
 - covariates: dem_age_index_cont, dem_sex_cont, c_smoking_history, c_number_met_sites, c_hemoglobin_g_dl_cont, c_urea_nitrogen_mg_dl_cont, c_platelets_10_9_l_cont, c_calcium_mg_dl_cont, c_glucose_mg_dl_cont, c_lymphocyte_leukocyte_ratio_cont, c_alp_u_l_cont, c_protein_g_l_cont, c_alt_u_l_cont, c_albumin_g_l_cont, c_bilirubin_mg_dl_cont, c_chloride_mmol_l_cont, c_monocytes_10_9_l_cont, c_eosinophils_leukocytes_ratio_cont, c_ldh_u_l_cont, c_hr_cont, c_sbp_cont, c_oxygen_cont, c_ecog_cont, c_neutrophil_lymphocyte_ratio_cont, c_bmi_cont, c_ast_alt_ratio_cont, c_stage_initial_dx_cont, dem_race, dem_region, dem_ses, c_time_dx_to_index

5.5.4 Computing Pooled Kaplan-Meier Estimates

Now we can use km_pooling() to compute and visualize the pooled survival curves:

# specify the survival model
km_fit <- as.formula(Surv(fu_itt_months, death_itt) ~ treat)

# estimate and pool median survival times and Kaplan-Meier curve
km_out <- km_pooling(
  x = mimids_data,
  surv_formula = km_fit
  )

# View median survival time estimates
km_out$km_median_survival
# A tibble: 2 × 4
  strata t_median t_lower t_upper
  <fct>     <dbl>   <dbl>   <dbl>
1 0          15.5    14.4    16.4
2 1          21.7    20.2    23.2
# Plot the Kaplan-Meier curve
km_out$km_plot
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_ribbon()`).
Figure 5.1: Pooled Kaplan-Meier curve after multiple imputation and propensity score matching.

5.5.5 Interpretation

The output provides:

  1. Median Survival Times (km_median_survival): Pooled median survival estimates and 95% confidence intervals for each treatment group.
  • strata = stratum
  • t_median = median survival time
  • t_lower = lower 95% CI of median survival time
  • t_upper = upper 95% CI of median survival time
  1. Kaplan-Meier survival table (km_survival_table):
  • strata = stratum

  • time = observed time point

  • m = number of imputed datasets

  • qbar = pooled univariate estimate of the complementary log-log transformed survival probabilities, see formula (3.1.2) Rubin (1987)

  • t = total variance of the pooled univariate estimate of the complementary log-log transformed survival probabilities, formula (3.1.5) Rubin (1987)

  • se = total standard error of the pooled estimate (derived as sqrt(t))

  • surv = back-transformed pooled survival probability

  • lower = Wald-type lower 95% confidence interval of back-transformed pooled survival probability

  • upper = Wald-type upper 95% confidence interval of back-transformed pooled survival probability

  1. Kaplan-Meier Curve (km_plot): A visualization showing the pooled survival probabilities over time with confidence bands (ggplot2 object)

5.5.6 Technical Details

The function handles several nuances in survival analysis:

  • Transforms survival probabilities to approximate normality using complementary log-log transformation
  • Applies Rubin’s rules for pooling estimates and computing variance
  • Handles edge cases in median survival time estimation
  • Incorporates weights and cluster membership for matched/weighted datasets

5.5.7 Computing and pooling survival probabilities at specific time points

Another option is to specify time points at which survival probabilities should be estimated and pooled. This can be done by providing a vector of time points to the times argument in the km_pooling() function. Here, we estimate and pool survival probabilities for each stratum at each month from 0 to 120 months:

# estimate and pool median survival times and Kaplan-Meier curve
km_out <- km_pooling(
  x = mimids_data,
  surv_formula = km_fit,
  times = seq(0, 120, by = 1)
  )

# survival probabilities at specific time points
head(km_out$km_survival_table)
# A tibble: 6 × 9
  strata  time     m    qbar         t  surv       se lower upper
  <fct>  <dbl> <int>   <dbl>     <dbl> <dbl>    <dbl> <dbl> <dbl>
1 0          0    10 Inf     NaN       1     NaN      1     1    
2 0          1    10   1.29    0.00223 0.974   0.0473 0.964 0.982
3 0          2    10   1.02    0.00156 0.938   0.0394 0.924 0.951
4 0          3    10   0.809   0.00132 0.894   0.0363 0.876 0.910
5 0          4    10   0.660   0.00127 0.856   0.0356 0.836 0.874
6 0          5    10   0.549   0.00123 0.823   0.0351 0.801 0.843

It should be noted that median survival times and Kaplan-Meier plots highly depend on the specification of the times argument. If only few time points are specified, the median survival time estimates may be less precise and the Kaplan-Meier plot may appear less smooth.

# median survival times based on time grid
km_out$km_median_survival
# A tibble: 2 × 4
  strata t_median t_lower t_upper
  <fct>     <dbl>   <dbl>   <dbl>
1 0            16      15      17
2 1            22      21      24
# Kaplan-Meier curve
km_out$km_plot
Figure 5.2: Kaplan-Meier curve with survival probabilities estimated at specific time points (0 to 120 months)

5.6 References

1.
Marshall A, Altman DG, Holder RL, Royston P. Methods of handling missing data in medical research in the UK: A systematic review. Journal of the Royal Statistical Society: Series A (Statistics in Society) 2009; 172: 427–448. doi:10.1111/j.1467-985X.2008.00578.x.

5.7 Session info

Script runtime: minutes.

pander::pander(subset(data.frame(sessioninfo::package_info()), attached==TRUE, c(package, loadedversion)))
  package loadedversion
dplyr dplyr 1.1.4
encore.analytics encore.analytics 0.2.1
here here 1.0.2
MatchThem MatchThem 1.2.1
mice mice 3.19.0
survival survival 3.8-3
pander::pander(sessionInfo())

R version 4.5.1 (2025-06-13)

Platform: x86_64-pc-linux-gnu

locale: LC_CTYPE=C.UTF-8, LC_NUMERIC=C, LC_TIME=C.UTF-8, LC_COLLATE=C.UTF-8, LC_MONETARY=C.UTF-8, LC_MESSAGES=C.UTF-8, LC_PAPER=C.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=C.UTF-8 and LC_IDENTIFICATION=C

attached base packages: stats, graphics, grDevices, datasets, utils, methods and base

other attached packages: encore.analytics(v.0.2.1), MatchThem(v.1.2.1), mice(v.3.19.0), survival(v.3.8-3), dplyr(v.1.1.4) and here(v.1.0.2)

loaded via a namespace (and not attached): tidyselect(v.1.2.1), farver(v.2.1.2), S7(v.0.2.1), smd(v.0.8.0), fastmap(v.1.2.0), digest(v.0.6.39), rpart(v.4.1.24), lifecycle(v.1.0.4), magrittr(v.2.0.4), compiler(v.4.5.1), rlang(v.1.1.6), tools(v.4.5.1), utf8(v.1.2.6), yaml(v.2.3.12), knitr(v.1.51), labeling(v.0.4.3), htmlwidgets(v.1.6.4), RColorBrewer(v.1.1-3), withr(v.3.0.2), purrr(v.1.2.0), nnet(v.7.3-20), grid(v.4.5.1), jomo(v.2.7-6), future(v.1.68.0), ggplot2(v.4.0.1), globals(v.0.18.0), scales(v.1.4.0), iterators(v.1.0.14), MASS(v.7.3-65), cli(v.3.6.5), survey(v.4.4-8), rmarkdown(v.2.30), reformulas(v.0.4.3), generics(v.0.1.4), sessioninfo(v.1.2.3), minqa(v.1.2.8), DBI(v.1.2.3), pander(v.0.6.6), splines(v.4.5.1), assertthat(v.0.2.1), parallel(v.4.5.1), mitools(v.2.4), vctrs(v.0.6.5), WeightIt(v.1.5.1), boot(v.1.3-31), glmnet(v.4.1-10), Matrix(v.1.7-3), sandwich(v.3.1-1), jsonlite(v.2.0.0), ggsurvfit(v.1.2.0), mitml(v.0.4-5), listenv(v.0.10.0), locfit(v.1.5-9.12), foreach(v.1.5.2), tidyr(v.1.3.2), glue(v.1.8.0), parallelly(v.1.46.0), nloptr(v.2.2.1), pan(v.1.9), chk(v.0.10.0), codetools(v.0.2-20), shape(v.1.4.6.1), gtable(v.0.3.6), lme4(v.1.1-38), tibble(v.3.3.0), pillar(v.1.11.1), furrr(v.0.3.1), htmltools(v.0.5.9), R6(v.2.6.1), Rdpack(v.2.6.4), rprojroot(v.2.1.1), evaluate(v.1.0.5), lattice(v.0.22-7), rbibutils(v.2.4), backports(v.1.5.0), tictoc(v.1.2.1), MatchIt(v.4.7.2), broom(v.1.0.11), renv(v.1.1.5), simsurv(v.1.0.0), Rcpp(v.1.1.0), nlme(v.3.1-168), xfun(v.0.55), zoo(v.1.8-15) and pkgconfig(v.2.0.3)

pander::pander(options('repos'))
  • repos:

    Table continues below
    CRAN
    https://cran.rstudio.com
    PositPackageManager
    https://packagemanager.posit.co/cran/linux/noble/latest