1 Data generation

Minimal example dataset

Author

Janick Weberpals

Published

November 1, 2024

2 Objective

The objective of this repository is to simulate a dataset with few missingness scenarios and evaluate bias for an unadjusted analysis, complete case analysis and analysis including imputed data after propensity score matching.

The first step involves the generation of a complete dataset with realistic distributions and associations between covariates, which is showcased here.

3 Data generation

The desired distributions as observed in the dataset.
Desired overall sample size: 10^{4}

3.1 Generate data using `generate_data()` function

We first create simulated data that approximates the covariate distributions in the observed empirical cohort.

Show the code

# data generating step
source(here::here("functions", "generate_data.R"))

cohort <- generate_data(
  seed = 42, 
  n_total = n_total
  )

cohort |> 
  glimpse()

Rows: 10,000
Columns: 17
$ X                      <int> 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0,…
$ Z2                     <dbl> 0.6153516, 0.6765416, 0.8913081, 0.6161119, 0.7…
$ U                      <int> 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,…
$ c_ed                   <dbl> 0.546994461, -0.255923198, 2.759562171, 0.40503…
$ c_gnrc_cnt             <dbl> 13.692274, 5.448430, 7.034793, 16.137476, 6.140…
$ c_flu_vaccine          <int> 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0,…
$ c_foot_ulcer           <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ c_glaucoma_or_cataract <int> 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,…
$ c_ischemic_stroke      <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ c_h2ra                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ c_acei                 <int> 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,…
$ c_arb                  <int> 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
$ c_statin               <int> 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1,…
$ c_spironolocatone      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ dem_age                <dbl> 75.73247, 73.64546, 63.07144, 71.97991, 86.9776…
$ eventtime              <dbl> 10.00000000, 10.00000000, 10.00000000, 10.00000…
$ status                 <int> 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,…

3.1.1 Table 1 by exposure

Display covariate distributions by exposure.

Show the code

cohort |> 
  tbl_summary(by = X) |> 
  add_difference()

Table 3.1: Covariate distributions by exposure.

Characteristic	0, N = 3,213¹	1, N = 6,787¹	Difference²	95% CI^2,3	p-value²
Z2	0.85 (0.69, 1.00)	0.82 (0.66, 0.98)	0.03	0.02, 0.04	<0.001
U	366 (11%)	1,381 (20%)	-9.0%	-10%, -7.5%	<0.001
c_ed	1.01 (-1.02, 2.94)	1.00 (-0.37, 2.33)	-0.03	-0.15, 0.08	0.6
c_gnrc_cnt	9.8 (6.5, 13.1)	11.0 (7.6, 14.3)	-1.2	-1.4, -0.95	<0.001
c_flu_vaccine	2,068 (64%)	4,340 (64%)	0.42%	-1.6%, 2.5%	0.7
c_foot_ulcer	76 (2.4%)	337 (5.0%)	-2.6%	-3.4%, -1.8%	<0.001
c_glaucoma_or_cataract	1,768 (55%)	3,573 (53%)	2.4%	0.27%, 4.5%	0.027
c_ischemic_stroke	297 (9.2%)	929 (14%)	-4.4%	-5.8%, -3.1%	<0.001
c_h2ra	235 (7.3%)	450 (6.6%)	0.68%	-0.42%, 1.8%	0.2
c_acei	998 (31%)	2,198 (32%)	-1.3%	-3.3%, 0.65%	0.2
c_arb	527 (16%)	1,258 (19%)	-2.1%	-3.7%, -0.53%	0.010
c_statin	1,964 (61%)	3,884 (57%)	3.9%	1.8%, 6.0%	<0.001
c_spironolocatone	53 (1.6%)	179 (2.6%)	-0.99%	-1.6%, -0.38%	0.003
dem_age	75 (70, 80)	76 (71, 81)	-0.99	-1.3, -0.68	<0.001
eventtime	10.00 (8.49, 10.00)	10.00 (6.61, 10.00)	0.35	0.23, 0.48	<0.001
status	913 (28%)	2,316 (34%)	-5.7%	-7.7%, -3.8%	<0.001
¹ Median (IQR); n (%)
² Welch Two Sample t-test; Two sample test for equality of proportions
³ CI = Confidence Interval

3.1.2 Odds ratios

Odds rations for the strength of association between covariates and exposure.

Show the code

odds <- stats::glm(
  formula = paste0("X ~ ", paste0(names(cohort |> select(-c(X, eventtime, status))), collapse = " + ")),
  data = cohort,
  family = "binomial"
  )

odds |> 
  gtsummary::tbl_regression(exponentiate = T)

Table 3.2: Odds rations for the strength of association between covariates and exposure.

Characteristic	OR¹	95% CI¹	p-value
Z2	0.54	0.45, 0.65	<0.001
U	2.01	1.77, 2.28	<0.001
c_ed	1.01	0.99, 1.03	0.5
c_gnrc_cnt	1.05	1.04, 1.06	<0.001
c_flu_vaccine	0.97	0.89, 1.06	0.5
c_foot_ulcer	2.18	1.70, 2.84	<0.001
c_glaucoma_or_cataract	0.89	0.82, 0.97	0.010
c_ischemic_stroke	1.57	1.36, 1.81	<0.001
c_h2ra	0.91	0.77, 1.07	0.2
c_acei	1.05	0.95, 1.15	0.3
c_arb	1.16	1.04, 1.30	0.010
c_statin	0.85	0.78, 0.93	<0.001
c_spironolocatone	1.54	1.13, 2.12	0.007
dem_age	1.02	1.01, 1.02	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

3.2 Generate outcome

Generate the outcome with a desired event rate of ~40 events/1,000 person-years of follow-up.

Show the code

# event-rate per 1000 person-years of follow-up
# in original data: ~40 events per 1000 person-years
event_rate_per_1000py <- round(sum(cohort$status)/sum(cohort$eventtime)*1000, 2)

Resulting/simulated event rate per 1,000 person-years: 39.26

3.3 Survival

Show the code

km_exposure <- survfit2(formula = survival::Surv(eventtime, status) ~ X, data = cohort)

ggsurvfit(
  x = km_exposure, 
  type = "survival",
  linewidth = 1
  ) +
  add_confidence_interval() +
  add_pvalue(caption = "Log-rank {p.value}")

Figure 3.1: Kaplan-Meier plot (unadjusted) by exposure.

Next, let’s compute the univariate and multivariate beta coefficients (hazard ratios [HR]).

Beta unadjusted:

Show the code

cox_fit_unadjusted <- survival::coxph(formula = Surv(eventtime, status) ~ X, data = cohort)

cox_fit_unadjusted |>  
  gtsummary::tbl_regression(exponentiate = T)

Table 3.3: Univariate hazard ratio.

Characteristic	HR¹	95% CI¹	p-value
X	1.25	1.16, 1.35	<0.001
¹ HR = Hazard Ratio, CI = Confidence Interval

Beta adjusted:

Show the code

lhs <- "Surv(eventtime, status)"
rhs <- paste0(colnames(cohort |> select(-c(eventtime, status))), collapse = " + ")
cox_form <- as.formula(paste0(lhs, "~", rhs))

cox_fit_adjusted <- survival::coxph(formula = cox_form, data = cohort)

cox_fit_adjusted |>  
  gtsummary::tbl_regression(exponentiate = T)

Table 3.4: Multivariate hazard ratio.

Characteristic	HR¹	95% CI¹	p-value
X	1.03	0.96, 1.12	0.4
Z2	5.42	4.66, 6.31	<0.001
U	1.43	1.32, 1.56	<0.001
c_ed	1.04	1.02, 1.05	<0.001
c_gnrc_cnt	1.08	1.07, 1.09	<0.001
c_flu_vaccine	0.83	0.77, 0.89	<0.001
c_foot_ulcer	2.07	1.80, 2.38	<0.001
c_glaucoma_or_cataract	0.77	0.72, 0.82	<0.001
c_ischemic_stroke	1.51	1.37, 1.66	<0.001
c_h2ra	0.67	0.57, 0.78	<0.001
c_acei	1.36	1.26, 1.46	<0.001
c_arb	1.40	1.29, 1.52	<0.001
c_statin	0.85	0.79, 0.91	<0.001
c_spironolocatone	1.67	1.38, 2.03	<0.001
dem_age	1.05	1.04, 1.05	<0.001
¹ HR = Hazard Ratio, CI = Confidence Interval

3.4 Session info

Script runtime: minutes.

Show the code

pander::pander(subset(data.frame(sessioninfo::package_info()), attached==TRUE, c(package, loadedversion)))

	package	loadedversion
arrow	arrow	14.0.0.2
dplyr	dplyr	1.1.4
ggplot2	ggplot2	3.4.4
ggsurvfit	ggsurvfit	1.0.0
gtsummary	gtsummary	1.7.2
here	here	1.0.1
simsurv	simsurv	1.0.0
survival	survival	3.5-8
tibble	tibble	3.2.1

Show the code

pander::pander(sessionInfo())

R version 4.4.0 (2024-04-24)

Platform: x86_64-pc-linux-gnu

locale: LC_CTYPE=C.UTF-8, LC_NUMERIC=C, LC_TIME=C.UTF-8, LC_COLLATE=C.UTF-8, LC_MONETARY=C.UTF-8, LC_MESSAGES=C.UTF-8, LC_PAPER=C.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=C.UTF-8 and LC_IDENTIFICATION=C

attached base packages: stats, graphics, grDevices, datasets, utils, methods and base

other attached packages: arrow(v.14.0.0.2), here(v.1.0.1), gtsummary(v.1.7.2), ggplot2(v.3.4.4), ggsurvfit(v.1.0.0), survival(v.3.5-8), dplyr(v.1.1.4), simsurv(v.1.0.0) and tibble(v.3.2.1)

loaded via a namespace (and not attached): gtable(v.0.3.4), xfun(v.0.41), htmlwidgets(v.1.6.4), lattice(v.0.22-5), vctrs(v.0.6.5), tools(v.4.4.0), generics(v.0.1.3), fansi(v.1.0.6), pkgconfig(v.2.0.3), Matrix(v.1.6-4), assertthat(v.0.2.1), gt(v.0.10.1), lifecycle(v.1.0.4), compiler(v.4.4.0), farver(v.2.1.1), stringr(v.1.5.1), tictoc(v.1.2), munsell(v.0.5.0), htmltools(v.0.5.7), sass(v.0.4.8), yaml(v.2.3.8), pillar(v.1.9.0), tidyr(v.1.3.1), ellipsis(v.0.3.2), broom.helpers(v.1.14.0), sessioninfo(v.1.2.2), commonmark(v.1.9.1), tidyselect(v.1.2.0), digest(v.0.6.33), stringi(v.1.8.3), purrr(v.1.0.2), pander(v.0.6.5), labeling(v.0.4.3), forcats(v.1.0.0), splines(v.4.4.0), labelled(v.2.12.0), rprojroot(v.2.0.4), fastmap(v.1.1.1), grid(v.4.4.0), colorspace(v.2.1-0), cli(v.3.6.2), magrittr(v.2.0.3), utf8(v.1.2.4), broom(v.1.0.5), withr(v.3.0.0), scales(v.1.3.0), backports(v.1.4.1), bit64(v.4.0.5), rmarkdown(v.2.28), bit(v.4.0.5), hms(v.1.1.3), evaluate(v.0.23), knitr(v.1.45), haven(v.2.5.4), markdown(v.1.12), rlang(v.1.1.2), Rcpp(v.1.0.12), glue(v.1.6.2), xml2(v.1.3.6), renv(v.1.0.3), jsonlite(v.1.8.8) and R6(v.2.5.1)

Show the code

pander::pander(options('repos'))

repos:

REPO_NAME

https://packagemanager.posit.co/cran/latest

2 Objective

3 Data generation

3.1 Generate data using generate_data() function

3.1.1 Table 1 by exposure

3.1.2 Odds ratios

3.2 Generate outcome

3.3 Survival

3.4 Session info

3.1 Generate data using `generate_data()` function