I have a dataset with both missing-at-random (MAR) and censored data. The variables are correlated and I am trying to impute the missing data conditionally so that I can estimate the distribution parameters for a correlated multivariate normal distribution. I would like to use Gibbs MCMC, but am having difficulty implementing the procedure. My dataframe has 5 variables (denoted x1:x5), 1099 samples which contain some combination of MAR, censored and observed values. This is what I have tried so far:
# packages
library(msm, tmvtnorm, MCMCpack)
# priors
theta0<-c(rep(0, 5))
Sigma0<-S0<-diag(5)
nu0<-4
# initialize parameters
theta<-c(rep(0, 5))
Tau<-diag(5)
# initialize output matrix
n_samples <- 1000
mu_MCMC <- matrix(0, nrow = n_samples, ncol = 5)
mu_MCMC[1,] <- theta
cov_MCMC <- matrix(0, nrow = n_samples, ncol = 25)
cov_MCMC[1,] <- c(diag(5))
# detection limits
det_lim <- matrix(c(-1.7, 0, 0, 0, 0), nrow = 1, ncol = 5)
# function to detect NaN (i.e., below detection data)
is.nan.data.frame <- function(x)
do.call(cbind, lapply(x, is.nan))
for(i in 2:n_samples){
imputedDF <- data.frame()
for(r in 1:nrow(originalDF)){
# variables that are MAR or censored
mis <- r[, is.na(r) & is.nan(r)]
# variables that are observed
obs <- r[, !is.na(r)]
# subset mu for missing, observed
mu1 <- mu[, names(r) %in% names(mis)]
mu2 <- mu[, names(r) %in% names(obs)]
# calculate sigmas for MVN partitions of mis, obs
sigma11 <- sigma[names(r) %in% names(mis), names(r) %in% names(mis)]
sigma22 <- sigma[names(r) %in% names(obs), names(r) %in% names(obs)]
sigma12 <- sigma[names(r) %in% names(obs), names(r) %in% names(mis)]
sigma21 <- t(sigma12)
# create matrix for detection limits based on missing values
## if NaN, use detection limit; if NA use Inf
dl <- c(ifelse("x1" %in% names(is.nan(r)), det_lim[1, "x1"], Inf),
ifelse("x2" %in% names(is.nan(r)), det_lim[1, "x2"], Inf),
ifelse("x3" %in% names(is.nan(r)), det_lim[1, "x3"], Inf),
ifelse("x4" %in% names(is.nan(r)), det_lim[1, "x4"], Inf),
ifelse("x5" %in% names(is.nan(r)), det_lim[1, "x5"], Inf))
# compute mu, sigma to use for conditional MVN
## if all values are missing
if(length(names(obs) == 0) {
mu_mis <- mu1
sigma_mis <- sigma11
## otherwise
} else {
mu_mis <- mu1 + sigma12 %*% solve(sigma22) * (obs - t(mu2))
sigma_mis <- sigma11 - sigma12 %*% solve(sigma22) %*% sigma21
}
# imputation
## if all data are observed, missing is empty
if(length(obs) == 0) {
mis_impute <- data.frame()
## only need to impute a single value
} else if(length(names(mis)) == 1) {
mis_impute <- rtnorm(1, mean = mu_mis, sd = sigma_mis, lower = -Inf, upper = dl)
## have more than one missing value
} else {
mis_impute <- rtmvnorm(1, mean = mu_mis, sigma = sigma_mis, lower = rep(-Inf, length = length(names(mis))), upper = dl)
}
# merge observed values with simulated
## if all values observed
if(length(names(mis)) == 0) {
sim_result <- obs
} else {
sim_result <- cbind(mis_impute, obs)
}
imputedDF <- rbind(imputedDF, sim_result)
}
# update theta
v <- solve(solve(Sigma0) + nrow(sim_result)*Tau)
m <- v %*% (solve(Sigma0) %*% theta0 + Tau %*% apply(sim_result,2,sum))
mu <- as.data.frame(rmvnorm(1,m,v))
mu_MCMC[i,] <- mu
# update Sigma
tmp <- t(sim_result) - mu
Tau <- rwish(nu0 + nrow(sim_result), solve(S0 + t(tmp) %*% tmp))
sigma <- matrix(c(solve(Tau)), nrow = 5, ncol = 5, byrow = TRUE)
cov_MCMC[i,] <- c(solve(Tau))
}
I keep running into errors because the imputation returns NaN and NA values, but I can't figure out what is going wrong because when I test it just using the inner loop to impute the data, it seems to work. Thus, the issue seems to be the parameter updating but I can't figure it out!
Preamble:
My sense is that part of the problem here is we do not have a good example dataset to work off.
My feeling is we can address this by creating an example dataset to frame the solution discussion. A useful package to this end is the Wakefield package that allows for the creation of simulated datasets.
We might, for example, create a dataset of 2000 people, where some of the ages, gender, employment status, education data, and marital status information is missing.
Imputation
The core question is can we impute the age or gender from other data in the data set?
For example, if we do not know someone's age, can we impute it from their marital status, employment type and or their Education level? At a very simplistic level, we might simply search for entries with NA for age, and look at Marital status. If the marital status is "married", then we impute that our data set is for American's and look up at the average age for marriage and replace with an estimated age for a married person.
We can expand on this and make our estimates more accurate by taking into account more variables. For example, we might look at both Marital Status, Education level and Employment status to further improve our age estimate. If a person is married, with a Ph.D., and retired we push the age upwards. If a person is single, a student we push the age lower. Further to this, we can look at the distribution of the ages in the data set to impute data about missing values.
Generate an Example Data Set.
# packages
requiredPackages <- c("wakefield", "dplyr", "BaylorEdPsych", "mice", "MCMCpack")
ipak <- function(pkg) {
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg)) {
install.packages(new.pkg, dependencies = TRUE)
}
sapply(pkg, require, character.only = TRUE)
}
ipak(requiredPackages)
# generate some data for Males with a 8% missing value for
# age
set.seed(10)
f.df <- r_data_frame(n = 1000, age,
gender(x = c("M", "F"), prob = c(0, 1), name = "Gender"),
employment(x = c("Full Time", "Part Time", "Unemployed", "Retired", "Student"), prob = c(0.6, 0.1, 0.1, 0.1, 0.1), name = "Employment"),
education(x = c("No Schooling Completed", "Nursery School to 8th Grade", "9th Grade to 12th Grade, No Diploma",
"Regular High School Diploma", "GED or Alternative Credential", "Some College, Less than 1 Year", "Some College, 1 or More Years, No Degree",
"Associate's Degree", "Bachelor's Degree", "Master's Degree", "Professional School Degree", "Doctorate Degree"),
prob = c(0.013, 0.05, 0.085, 0.246, 0.039, 0.064, 0.15, 0.075, 0.176, 0.072, 0.019, 0.012), name = "Education"),
marital(x = c("Married", "Divorced", "Widowed", "Separated", "Never Married"), prob = NULL, name = "Marital")) %>% r_na(cols = 1 - 3, prob = 0.05)
# str(f.df)
summary(f.df)
set.seed(20)
# generate some data for Females with a 5% missing value for
# age
m.df <- r_data_frame(n = 1000, age,
gender(x = c("M", "F"), prob = c(1, 0), name = "Gender"),
employment(x = c("Full Time", "Part Time", "Unemployed", "Retired", "Student"), prob = c(0.6, 0.1, 0.1, 0.1, 0.1), name = "Employment"),
education(x = c("No Schooling Completed", "Nursery School to 8th Grade", "9th Grade to 12th Grade, No Diploma",
"Regular High School Diploma", "GED or Alternative Credential", "Some College, Less than 1 Year", "Some College, 1 or More Years, No Degree",
"Associate's Degree", "Bachelor's Degree", "Master's Degree", "Professional School Degree", "Doctorate Degree"),
prob = c(0.013,0.05, 0.085, 0.246, 0.039, 0.064, 0.15, 0.075, 0.176, 0.072,0.019, 0.012), name = "Education"),
marital(x = c("Married", "Divorced", "Widowed", "Separated", "Never Married"), prob = NULL, name = "Marital")) %>% r_na(cols = 1 - 3, prob = 0.03)
summary(m.df)
all.df = rbind.data.frame(m.df, f.df)
summary(all.df)
Data Summary
> summary(all.df)
Age Gender Employment Education Marital
Min. :18.00 M:1000 Full Time :1142 Regular High School Diploma :459 Married :394
1st Qu.:35.00 F:1000 Part Time : 207 Bachelor's Degree :356 Divorced :378
Median :54.00 Unemployed: 193 Some College, 1 or More Years, No Degree:284 Widowed :411
Mean :53.76 Retired : 182 9th Grade to 12th Grade, No Diploma :156 Separated :379
3rd Qu.:72.00 Student : 196 Associate's Degree :145 Never Married:358
Max. :89.00 NA's : 80 (Other) :520 NA's : 80
NA's :80 NA's : 80
>
Is data Missing completely at Random or Not Missing at Random?
# Test for MCAR - Missing at Completely at Random...
test_mcar <- LittleMCAR(all.df)
print(test_mcar$amount.missing)
print(test_mcar$p.value)
Console Output
> # Test for MCAR - Missing at Completely at Random...
> test_mcar <- LittleMCAR(all.df)
this could take a while> print(test_mcar$amount.missing)
Age Gender Employment Education Marital
Number Missing 80.00 0 80.00 80.00 80.00
Percent Missing 0.04 0 0.04 0.04 0.04
> print(test_mcar$p.value)
[1] 0.02661428
Imputation of Data
Ok, let us first look at the distribution of missing values. We can run mice::md.pattern() function, to show the distribution of the missing values over the other columns in the dataframe. The md.pattern() function output is useful for suggesting which variables might be good candidates to use for imputing the missing values:
> md.pattern(all.df)
Gender Age Employment Education Marital
1696 1 1 1 1 1 0
73 1 1 1 1 0 1
73 1 1 1 0 1 1
2 1 1 1 0 0 2
71 1 1 0 1 1 1
3 1 1 0 1 0 2
2 1 1 0 0 1 2
71 1 0 1 1 1 1
2 1 0 1 1 0 2
3 1 0 1 0 1 2
4 1 0 0 1 1 2
0 80 80 80 80 320
Ok, from this we can now move to impute the missing values:
imp <- mice(all.df, m = 5, maxit = 50, seed = 1234, printFlag = FALSE)
The m=5 parameter specifies that you end up with five plausible imputations for the variable
The maxit=50 parameter specifies that there will be up to 50 iterations of the algorithm before it converges to a solution and this can be adjusted upward or downward to the desired precision
The mice() function may take a while depending upon the number of iterations we specify. In this case, upon completion we can see some of the imputed values for Age using head() function:
head(imp$imp$Age)
1 2 3 4 5
7 28 49 37 70 89
33 55 54 52 88 24
56 89 83 68 71 61
84 43 43 24 30 31
96 28 64 89 41 50
120 47 34 36 22 77
To actually complete the imputation, we have to run the complete() function and assign the results to a new dataframe. This version of complete() function will collect all imputations in the assigned dataframe via the "long" parameter:
all_imputed_df <- complete(imp, "long", include = TRUE)
table(all_imputed_df$.imp, is.na(all_imputed_df$Age))
Console:
> all_imputed_df <- complete(imp, "long", include = TRUE)
> table(all_imputed_df$.imp, is.na(all_imputed_df$Age))
FALSE TRUE
0 1920 80
1 2000 0
2 2000 0
3 2000 0
4 2000 0
5 2000 0
Now we have a dataset of 12000 age values, across 5 age inputted values.
Let's try a regression with imputation #3.
First, extract impute #3
impute.3 <- subset(all_imputed_df,.imp=='3')
summary(impute.3)
Console:
> impute.3 <- subset(all_imputed_df, .imp == "3")
> summary(impute.3)
.imp .id Age Gender Employment
Min. :3 Min. : 1.0 Min. :18.00 M:1000 Full Time :1192
1st Qu.:3 1st Qu.: 500.8 1st Qu.:35.00 F:1000 Part Time : 211
Median :3 Median :1000.5 Median :54.00 Unemployed: 202
Mean :3 Mean :1000.5 Mean :53.89 Retired : 191
3rd Qu.:3 3rd Qu.:1500.2 3rd Qu.:72.00 Student : 204
Max. :3 Max. :2000.0 Max. :89.00
Education Marital
Regular High School Diploma :478 Married :416
Bachelor's Degree :376 Divorced :390
Some College, 1 or More Years, No Degree:295 Widowed :425
9th Grade to 12th Grade, No Diploma :168 Separated :393
Associate's Degree :150 Never Married:376
Master's Degree :141
(Other) :392
Now we can run a linear regression:
> lm(Age ~ Education + Gender + Employment + Marital, data = impute.3)
Call:
lm(formula = Age ~ Education + Gender + Employment + Marital,
data = impute.3)
Coefficients:
(Intercept) EducationNursery School to 8th Grade
51.6733 1.4100
Education9th Grade to 12th Grade, No Diploma EducationRegular High School Diploma
1.3675 0.7611
EducationGED or Alternative Credential EducationSome College, Less than 1 Year
1.0365 -2.6069
EducationSome College, 1 or More Years, No Degree EducationAssociate's Degree
0.3563 0.9506
EducationBachelor's Degree EducationMaster's Degree
1.2505 -1.6372
EducationProfessional School Degree EducationDoctorate Degree
1.1774 0.4936
GenderF EmploymentPart Time
-0.3190 1.1316
EmploymentUnemployed EmploymentRetired
3.1622 -0.6855
EmploymentStudent MaritalDivorced
3.0850 0.2934
MaritalWidowed MaritalSeparated
2.3162 1.6833
MaritalNever Married
1.6169
MCMCRegress
library(MCMCpack) # b0 = prior mean, B0 = prior precision = 1/variance
fitBayes <- MCMCregress(Age ~ Education + Gender + Employment + Marital, data = impute.3, mcmc = 10000, seed = 1234, b0 = 0, B0 = 0.01, drop.unused.levels = TRUE)
summary(fitBayes)
Console Output
> fitBayes <- MCMCregress(Age ~ Education + Gender + Employment + Marital, data = impute.3, mcmc = 10000, seed = 1234, b0 = 0, B0 = 0.01, drop.unused.levels = TRUE)
> summary(fitBayes)
Iterations = 1001:11000
Thinning interval = 1
Number of chains = 1
Sample size per chain = 10000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
(Intercept) 48.67377 2.5337 0.025337 0.025337
EducationNursery School to 8th Grade 3.77088 3.0514 0.030514 0.030514
Education9th Grade to 12th Grade, No Diploma 3.81009 2.7794 0.027794 0.027794
EducationRegular High School Diploma 3.24531 2.4933 0.024933 0.025412
EducationGED or Alternative Credential 3.38733 3.2155 0.032155 0.032155
EducationSome College, Less than 1 Year -0.08419 2.9104 0.029104 0.029577
EducationSome College, 1 or More Years, No Degree 2.82889 2.6092 0.026092 0.026092
EducationAssociate's Degree 3.32932 2.8410 0.028410 0.028410
EducationBachelor's Degree 3.72272 2.5228 0.025228 0.025659
EducationMaster's Degree 0.87738 2.8611 0.028611 0.028611
EducationProfessional School Degree 3.27542 4.0199 0.040199 0.040199
EducationDoctorate Degree 2.43794 4.5996 0.045996 0.045996
GenderF -0.11321 0.9327 0.009327 0.009327
EmploymentPart Time 1.25556 1.5756 0.015756 0.016170
EmploymentUnemployed 3.27395 1.6213 0.016213 0.015708
EmploymentRetired -0.52614 1.6394 0.016394 0.016394
EmploymentStudent 3.17027 1.6058 0.016058 0.016889
MaritalDivorced 0.72379 1.4715 0.014715 0.014715
MaritalWidowed 2.73130 1.4394 0.014394 0.014706
MaritalSeparated 2.10423 1.4608 0.014608 0.014608
MaritalNever Married 2.00781 1.4960 0.014960 0.014960
sigma2 448.01488 14.0715 0.140715 0.140715
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
(Intercept) 43.75477 46.9556 48.6619 50.3967 53.609
EducationNursery School to 8th Grade -2.19290 1.7079 3.7701 5.8216 9.718
Education9th Grade to 12th Grade, No Diploma -1.59323 1.9586 3.8326 5.6676 9.349
EducationRegular High School Diploma -1.61001 1.5641 3.2474 4.9296 8.155
EducationGED or Alternative Credential -2.88523 1.2095 3.4173 5.5405 9.691
EducationSome College, Less than 1 Year -5.75364 -2.0617 -0.1009 1.8986 5.614
EducationSome College, 1 or More Years, No Degree -2.28754 1.0853 2.8608 4.5718 7.895
EducationAssociate's Degree -2.27611 1.4311 3.3285 5.2330 8.978
EducationBachelor's Degree -1.21780 2.0258 3.7275 5.4203 8.655
EducationMaster's Degree -4.61270 -1.0872 0.8601 2.8484 6.456
EducationProfessional School Degree -4.63027 0.5900 3.2767 5.9475 11.059
EducationDoctorate Degree -6.47767 -0.6371 2.4553 5.4188 11.705
GenderF -1.95673 -0.7298 -0.1067 0.4903 1.727
EmploymentPart Time -1.82784 0.1849 1.2597 2.3160 4.354
EmploymentUnemployed 0.09335 2.1988 3.2674 4.3557 6.433
EmploymentRetired -3.80162 -1.6316 -0.5147 0.5953 2.706
EmploymentStudent 0.03387 2.0713 3.1502 4.2227 6.342
MaritalDivorced -2.15073 -0.2732 0.7249 1.7266 3.602
MaritalWidowed -0.13488 1.7817 2.7367 3.6961 5.567
MaritalSeparated -0.76396 1.1177 2.1118 3.0700 5.001
MaritalNever Married -0.92230 0.9950 1.9976 3.0248 4.898
sigma2 420.98019 438.4621 447.7222 457.2730 476.481
Hopefully, the above observations point in the right direction.
Citations:
R Package: Mice - Multivariate Imputation by Chained Equations Reference Manual By: Stef van Buuren
Flexible Imputation of Missing Data By Stef van Buuren (Online Book)
Practical Predictive Analytics by: Ralph Winters
Simulation for Data Science with R by: Matthias Templ
Bayesian Data Analysis, Third Edition, 3rd Edition By: Andrew Gelman; John B. Carlin; Hal S. Stern; David B. Dunson; Aki Vehtari; Donald B. Rubin
Related
Suppose I have every month an income of 1200
The interest rate is 1% - so after 1 year the price will increase 1%
I would like to find out how many years it will take until each investment will break even
Suppose an investment costs 200,000 with a momthly income of 1200
My first year and subsequent years annual income will be:
firstYear = 1200 * 12
additionalYears = (1200*12)*(1+0.01)^c(1:5)
c(firstYear, additionalYears)
14400.00 14544.00 14689.44 14836.33 14984.70 15134.54
I would like to make the "5" in the above example dynamic until it find the breakeven point.
In this example I have:
sum(c(firstYear, additionalYears))
198854.3
So the investment did not breakeven yet. Adjusting it to "12" gives me the breakeven point:
firstYear = 1200 * 12
additionalYears = (1200*12)*(1+0.01)^c(1:12)
sum(c(firstYear, additionalYears))
If possible I would like to determine the month of that year it will break even (so given this example it breakseven in month 12 of year 12, but others might break even in month 8 of year 6 etc.
Instead of using a loop, you can use vectors. Set the max years to 100, create a vector of 100 incomes to grow and a vector of 100 growth factors. Multiply the two and get a cumulative sum of the cost (negative) and the incomes. Count the number of times the sum is negative, that is your break even.
cost = -200000 # negative cost
income = 1200*12 # annual income
i = 0.01 # interest rate to grow income after year 0.
# repeat 14400 101 times, multiply it by (1+r)^n - R is vectorised
income100 = rep(income, 101) * ((1+i) ^ seq(0,100))
# subtract the cost from the cumulative sum of income
cumincome = cost + cumsum(income100)
# how many are negative?
yrs = sum(cumincome < 0)
# how much left to recover in next year
yrs + (-cumincome[yrs] / income100[yrs+1])
[1] 13.06991
Putting this into a function
break_even_years <- function(cost, income, interest=0, period = "monthly"){
if(cost >= 0) cost = -cost
if(period == "monthly") income = income * 12
income100 = rep(income, 101) * ((1+interest) ^ seq(0,100))
cumincome = cost + cumsum(income100)
# how many are negative?
yrs = sum(cumincome < 0)
# how much left to recover in next year
yrs + (-cumincome[yrs] / income100[yrs+1])
}
Using the function
purrr::map2(
.x = cost,
.y = investment,
~ break_even_years(.x, .y, interest = 0.03, "annual"))
[[1]]
[1] 28.90435
[[2]]
[1] 23.75858
[[3]]
[1] 6.391264
[[4]]
[1] 3.505453
Benchmarking
library(microbenchmark)
microbenchmark(break_even_years(200000,1000,0.01),
find_break_even_year(1000, 200000, 0.01, 100), times = 1000)
Unit: microseconds
expr min lq mean median uq
break_even_years(200000, 1000, 0.01) 50.9 87.10 257.4185 119.0 159.05
find_break_even_year(1000, 200000, 0.01, 100) 853.5 1247.05 3432.5157 1556.2 2391.35
max neval
36938.0 1000
145980.6 1000
I think this answers my question. If anybody can help with not using the forloop function that would be very helpful.
library(tidyverse)
investment = c(1000, 2000, 5000, 27000)
interest_rate = 0.03
cost = c(45000, 67900, 34678, 98367)
max_years = 100
future_value = list()
find_break_even_year <- function(CF, investment, interest_rate, max_years){
for (year in 1:max_years){
#print(year)
future_value[[year]] <- CF * (1 + interest_rate)^year
future_value_sums = sum(unlist(future_value))
if(future_value_sums >= investment)
return(year)
}
}
purrr::map2(
.x = investment,
.y = cost,
~ find_break_even_year(.x, .y, interest_rate = 0.03, max_years = 100)
)
I ask the community for help on the following problem.
Suppose we estimate the following regression model with fixed effects Age (factor variable) and Year (factor variable) as random effects.
set.seed(123)
# Packages
library(brms)
#install.packages("rstan")
library(rstan)
# Data simulation
age <- 10:20
year <- 1990:2000
n1 <- length(age)
n2 <- length(year)
n <- n1*n2
y <- rnorm(n = n,mean = 50,sd = 30 )
beta_0 <- 0.25
beta_age <- 0.25
beta_year <- - 0.85
y <-beta_0+beta_age*age+beta_year*year
# Final Dataframe
d <- data.frame(y=y,
Age=as.factor(rep(age,n2)),
Year=as.factor(rep(year,n1))
)
fit <- brm(y ~ 0+Age+(1|Year), d, iter = 50, warmup = 10, chains = 1)
fixef(fit)
> fixef(fit)
Estimate Est.Error Q2.5 Q97.5
Age10 -1.5333147 1.3878804 -3.839673 1.0539617
Age11 -0.1285437 0.9321314 -1.221775 1.9525542
Age12 -2.1895797 1.2388774 -4.829345 -1.0598727
Age13 3.7247485 2.4157734 1.408428 8.4137437
Age14 -3.0134557 2.1435395 -8.148383 -1.1134809
Age15 -0.6631218 1.2442192 -2.122722 1.8919087
Age16 -1.2382832 1.6596973 -4.145329 0.3973843
Age17 1.7116087 2.4962776 -1.389026 5.3579521
Age18 -1.1881552 1.3884096 -4.155563 0.2962675
Age19 -0.6924640 1.0659378 -1.558155 1.3804710
Age20 -0.4457888 0.6615983 -1.763389 0.8225215
ranef(fit)
$Year
, , Intercept
Estimate Est.Error Q2.5 Q97.5
1990 -1.3702433 1.7680814 -3.2623355 0.885315
1991 -2.2592311 2.3086167 -4.6791703 1.036105
1992 1.7228745 2.5290462 -0.4616315 4.225590
1993 -0.8166271 1.2671842 -2.1690639 1.809690
1994 1.7928488 2.2368656 -0.6837609 4.464356
1995 0.7169620 1.8096562 -0.5032051 3.002480
1996 -0.1112231 0.7797995 -0.7787910 1.614591
1997 2.1406982 2.2829444 -0.2656490 4.370186
1998 -0.3399042 1.0476426 -1.3453246 2.147565
1999 1.7134314 1.9736618 -0.5384686 4.063977
2000 -0.5178907 1.0331019 -1.5312055 1.854528
Obtained the coefficients, let’s suppose to get a 5-year forecast on the Year coefficient and obtain Year_F, let’s say c(1.0342, 0.9514, 0.9234, 0.8345, 0.7863) and length (Year_F) < length (Year)
How can I reconstruct the equation: y_F ~ 0 + Age + (1 | Year_F)
Thanks for your support
Here is the outline of my data. There are 500 students. Each student has final grade for math, physics, chemistry, music, history. The range of the final grade for each subject is from 0 to 100. For each subject, if student's grade is below a cutoff, then the student will fail this subject. However, the teacher of each subject may change a few students (less than 5%) assessment from fail to pass due to their good performance for class activity. If a student fail any subject, then the overall assessment is supposed to be fail. If a student pass all 5 subjects, then the overall assessment is pass.
Now suppose the cutoffs for math, physics, chemistry, music, history are 45, 45, 45, 60, 60, respectively. Then we will have the demo table below. The second student passed the history due to the history teacher is satisfied with his class performance.
ID math physics chemistry music history overall_assessment
1 95 96 70 65 75 pass
2 46 61 72 86 59 pass
3 55 32 21 95 96 fail
Now my question is that if I have the table above, how can I know the cutoff for each subject? I have the data below in R.
set.seed(1)
math <- sample(30:100, 500, replace=T)
physics <- sample(30:100, 500, replace=T)
chemistry<- sample(30:100, 500, replace=T)
music<- sample(30:100, 500, replace=T)
history<- sample(60:100, 500, replace=T)
grade <- as.data.frame(cbind(math,physics,chemistry,music,history))
grade$assess <- ifelse(grade$math > 45 & grade$physics >55 & grade$chemistry > 60 & grade$music > 50 & grade$history > 80, "pass","fail")
grade$ID <- seq(1,500,1)
change_grade <- sample(1:500, 25, replace=F)
grade$assess[grade$ID %in% change_grade] <- "pass"
Because there is randomness in who is selected to pass for good activity, it is not possible to find the exact cutoff values. But we can find upper and lower bounds for the cutoff. Note that I slightly adjust the data generation, but you can change it and confirm this method gives correct bounds no matter the true cutoffs.
library(tidyverse)
n <- 500
prop <- 0.05
set.seed(1)
math <- sample(30:100, n, replace = T)
physics <- sample(30:100, n, replace = T)
chemistry <- sample(30:100, n, replace = T)
music <- sample(30:100, n, replace = T)
history <- sample(30:100, n, replace = T)
grade <-
as.data.frame(cbind(math, physics, chemistry, music, history))
grade$assess <- ifelse(
grade$math >= 45 &
grade$physics >= 45 &
grade$chemistry >= 45 &
grade$music >= 60 &
grade$history >= 60,
"pass", "fail")
grade$ID <- seq(1, n, 1)
change_grade <- sample(1:n, n * prop, replace = F)
grade$assess[grade$ID %in% change_grade] <- "pass"
grade$assess <- factor(grade$assess)
To find the upper bound for a subject, we will consider all individuals who passed the assessment, and look at their grades in that subject. We know that at most 25 individuals were granted an exception for that subject (n * proportion of exceptions), so the grade of the 26th worst individual is an upper bound for the cutoff score.
# upper bound
get_upper_bound <- function(var, n, prop) {
var <- var[order(var)]
var[ceiling(n * prop) + 1]
}
upper_bound <- grade %>%
subset(assess == "pass") %>%
summarise(
math = get_upper_bound(math, n = n, prop = prop),
physics = get_upper_bound(physics, n = n, prop = prop),
chemistry = get_upper_bound(chemistry, n = n, prop = prop),
music = get_upper_bound(music, n = n, prop = prop),
history = get_upper_bound(history, n = n, prop = prop))
upper_bound
#> math physics chemistry music history
#> 1 57 53 58 68 67
Having now found an upper bound, we can look at the lower bounds. Consider all individuals who passed Math, Physics, Chemistry, and Music by achieving at least the upper bound in those subjects, but who also failed the assessment. Then we know that they must have failed the History subject. Looking at the maximum History grade in those students gives us a lower bound for the cutoff score for History. We can apply this for all different subjects.
This code is inelegant, but I believe it works.
# lower bound
get_lower_bound <- function(varnum, data, upper_bound) {
varnames = c("math", "physics", "chemistry", "music", "history")
vars_using <- c(1:5)
vars_using <- vars_using[-varnum]
indexes <- rep(TRUE, nrow(data))
for (i in vars_using) {
indexes <-
indexes & (data[, varnames[i]] >= as.numeric(upper_bound[i]))
}
indexes <- indexes & (data$assess == "fail")
ifelse(is.finite(max(data[indexes, varnum])),
max(data[indexes, varnum]) + 1,
min(data[, varnum]))
}
lower_bound <- data.frame(
"math" = get_lower_bound(1, grade, upper_bound),
"physics" = get_lower_bound(2, grade, upper_bound),
"chemistry" = get_lower_bound(3, grade, upper_bound),
"music" = get_lower_bound(4, grade, upper_bound),
"history" = get_lower_bound(5, grade, upper_bound))
lower_bound
#> math physics chemistry music history
#> 1 45 44 45 58 60
Then the final bounds for the cutoff scores are:
rbind("lower" = lower_bound,
"upper" = upper_bound)
#> math physics chemistry music history
#> lower 45 44 45 58 60
#> upper 57 53 58 68 67
Created on 2022-08-30 by the reprex package (v2.0.1)
Note that by increasing n and decreasing prop, eventually the lower bound and upper bound are equal, and we have found the cutoff score exactly.
I want to test for an interaction ( for Cox proportional hazard model) between type of transplant and disease type using main effects and interaction terms on the data bone marrow transplant study at Ohio State University.
Here is the used code for the data:
time_Allo_NHL<- c(28,32,49,84,357,933,1078,1183,1560,2114,2144)
censor_Allo_NHL<- c(rep(1,5), rep(0,6))
df_Allo_NHL <- data.frame(group = "Allo NHL",
time = time_Allo_NHL,
censor = censor_Allo_NHL,
Z1 = c(90,30,40,60,70,90,100,90,80,80,90),
Z2 = c(24,7,8,10,42,9,16,16,20,27,5))
time_Auto_NHL<- c(42,53,57,63,81,140,176,210,252,476,524,1037)
censor_Auto_NHL<- c(rep(1,7), rep(0,1), rep(1,1), rep(0,1), rep(1,1), rep(0,1))
df_Auto_NHL <- data.frame(group = "Auto NHL",
time = time_Auto_NHL,
censor = censor_Auto_NHL,
Z1 = c(80,90,30,60,50,100,80,90,90,90,90,90),
Z2 = c(19,17,9,13,12,11,38,16,21,24,39,84))
time_Allo_HOD<- c(2,4,72,77,79)
censor_Allo_HOD<- c(rep(1,5))
df_Allo_HOD <- data.frame(group = "Allo HOD",
time = time_Allo_HOD,
censor = censor_Allo_HOD,
Z1 = c(20,50,80,60,70),
Z2 = c(34,28,59,102,71))
time_Auto_HOD<- c(30,36,41,52,62,108,132,180,307,406,446,484,748,1290,1345)
censor_Auto_HOD<- c(rep(1,7), rep(0,8))
df_Auto_HOD <- data.frame(group = "Auto HOD",
time = time_Auto_HOD,
censor = censor_Auto_HOD,
Z1 = c(90,80,70,60,90,70,60,100,100,100,100,90,90,90,80),
Z2 = c(73,61,34,18,40,65,17,61,24,48,52,84,171,20,98))
myData <- Reduce(rbind, list(df_Allo_NHL, df_Auto_NHL, df_Allo_HOD, df_Auto_HOD))
Here is the code for interaction, but I'm not sure what it should be written in here (myData$(here?) from the following code to be able to run it.
n<-length(myData$time)
n
for (i in 1:n){
if (myData$(here?)[i]==2)
myData$W1[i] <-1
else myData$W1[i]<-0
}
for (i in 1:n){
if (myData$(here?)[i]==2)
myData$W2[i] <-1
else myData$W2[i]<-0
}
myData
Coxfit.W<-coxph(Surv(time,censor)~W1+W2+W1*W2, data = myData)
summary(Coxfit.W)
An easy way is to separate the four groups variables using the separate function from the tidyr package.
library(tidyr)
myData <- separate(myData, col=group, into=c("disease","transpl"))
head(myData)
disease transpl time censor Z1 Z2
1 Allo NHL 28 1 90 24
2 Allo NHL 32 1 30 7
3 Allo NHL 49 1 40 8
4 Allo NHL 84 1 60 10
5 Allo NHL 357 1 70 42
6 Allo NHL 933 0 90 9
Then you can put these two new variables (disease and transpl) into the Cox model, with interaction term.
Coxfit.W<-coxph(Surv(time,censor)~transpl*disease, data = myData)
summary(Coxfit.W)
Call:
coxph(formula = Surv(time, censor) ~ transpl * disease, data = myData)
n= 43, number of events= 26
coef exp(coef) se(coef) z Pr(>|z|)
transplNHL -1.8212 0.1618 0.6747 -2.699 0.00695 **
diseaseAuto -1.6628 0.1896 0.6188 -2.687 0.00721 **
transplNHL:diseaseAuto 2.3050 10.0244 0.8494 2.714 0.00665 **
exp(coef) exp(-coef) lower .95 upper .95
transplNHL 0.1618 6.17946 0.04312 0.6073
diseaseAuto 0.1896 5.27387 0.05638 0.6377
transplNHL:diseaseAuto 10.0244 0.09976 1.89700 52.9720
Let's consider this data set:
df <- data.frame(age= sample(c(20:90), 20, rep=T),
sex = sample(c('m', 'f'), 20, rep=T),
smoker=sample(c("never", "former", "active"), 20, rep=T),
size= sample (c(8:40), 20, rep=T),
fac = as.factor(sample(c("neg","lo","med","hi"), 20, rep=T)),
outcome = sample(c(0,1), 20, rep=T)
)
# let's introduce some missing data
for (i in (1:3)) {df[sample(c(1:20),1), sample(c(1:6),1)] <- NA}
In a medical manuscript the first table summarizes the population (or its subgroups as appropriate); here the rows would be age, sex, smoking status, etc and the two outcomes would be listed in separate columns. The continuous variables are reported as means; the categorical variables as counts.
I was wondering if there is a function that I am missing that
creates such contingency tables. I can do that manually but would love to be able to automatically update if the data set changes. Ultimately I need to output in latex.
the function would need to ignore missing data, but not delete those rows.
Asking too much?!
In medical articles, 'Table 1' summarizes the demographics of the study population, usually broken down between subgroups
Generate data set
n <- 100
df <- data.frame(
age = sample(c(20:90), n, rep = T),
sex = sample(c("m", "f"), 20, rep = T, prob = c(0.55, 0.45)),
smoker = sample(c("never", "former", "active"), n, rep = T, prob = c(0.4, 0.45, 0.15)),
size = abs(rnorm(n, 20, 8)),
logitest = sample(c(TRUE, FALSE), n, rep = T, prob = c(0.1, 0.9)),
labtest = as.factor(sample(c("neg", "lo", quot;med",quot;hi"), n, rep = T, prob = c(0.4, 0.3, 0.2, 0.1))),
outcome = sample(c(0, 1), n, rep = T, prob = c(0.8, 0.2))
)
# let's introduce some missing data
for (i in (1:floor(n/6))) {
df[sample(c(1:n), 1), sample(c(1:ncol(df)), 1)] <- NA
}
head(df)
## age sex smoker size logitest labtest outcome
## 1 70 m former 39.17 NA med NA
## 2 51 f former 33.64 FALSE hi 1
## 3 58 f former 10.10 FALSE neg 1
## 4 30 m former 43.24 FALSE med 0
## 5 54 m former 22.78 FALSE lo 0
## 6 86 f former 8.20 FALSE neg 0
if working a real data set, use it instead
# df <- read.csv()
#you may need to eliminate some columns
#colnames(df)
#df0<-df #backup
#df <- df[,-c(1,...,27:38)]
Change this as needed: the column with the diagnosis has to be removed from the variables list!
dx <- 7 #index of outcome/diagnosis
####################################
summary(df[, -dx])
## age sex smoker size logitest
## Min. :20.0 f :44 active:19 Min. : 0.91 Mode :logical
## 1st Qu.:42.5 m :54 former:49 1st Qu.:15.00 FALSE:85
## Median :58.0 NA's: 2 never :30 Median :20.12 TRUE :12
## Mean :57.3 NA's : 2 Mean :20.44 NA's :3
## 3rd Qu.:74.0 3rd Qu.:27.10
## Max. :88.0 Max. :43.24
## NA's :1 NA's :2
## labtest
## hi : 4
## lo :29
## med :20
## neg :45
## NA's: 2
##
##
attach(df)
Build list of vars
vars <- colnames(df)
vars
## [1] "age" "sex" "smoker" "size" "logitest" "labtest"
## [7] "outcome"
catvars <- NULL #categorical variables
contvars <- NULL #continuous variables
logivars <- NULL #logic variables
vars <- vars[-dx]
vars
## [1] "age" "sex" "smoker" "size" "logitest" "labtest"
for (i in 1:length(vars)) {
ifelse(is.factor(df[, i]), catvars <- c(catvars, vars[i]), ifelse(is.logical(df[,
i]), logivars <- c(logivars, vars[i]), contvars <- c(contvars, vars[i])))
}
contvars
## [1] "age" "size"
catvars
## [1] "sex" "smoker" "labtest"
logivars
## [1] "logitest"
Create the subgroups
bg <- df[df[, dx] == 0 & !is.na(df[, dx]), ]
nrow(bg) #; bg
## [1] 73
mg <- df[df[, dx] == 1 & !is.na(df[, dx]), ]
nrow(mg) #; mg
## [1] 23
indet <- df[is.na(df[, dx]), ]
nrow(indet)
## [1] 4
indet
## age sex smoker size logitest labtest outcome
## 1 70 m former 39.173 NA med NA
## 9 87 m former 23.621 FALSE lo NA
## 18 65 m former 2.466 FALSE <NA> NA
## 67 88 f former 17.575 FALSE med NA
For continuous variables
normality testing
normality <- NULL
for (i in 1:length(contvars)) {
j <- which(vars == contvars[i]) #find position of variable in the original data frame and its subsets
st <- shapiro.test(df[, j]) #normality testing on all patients, bg and mg alike
normality <- c(normality, st$p.value) #normality testing on all patients, bg and mg alike
}
normality
## [1] 0.00125 0.73602
comparing the means of two samples; if normal, use t-test, otherwise wilcoxon
ttpvalue <- NULL
for (i in 1:length(contvars)) {
j <- which(vars == contvars[i]) #find position of variable in the original data frame and its subsets
## if normal, use t-test, otherwise wilcoxon if shapiro p<.05 then pop
## likely NOT normally dist
ifelse(normality[i] < 0.05, tt <- wilcox.test(bg[, j], mg[, j]), tt <- t.test(bg[,
j], mg[, j]))
ttpvalue <- c(ttpvalue, tt$p.value) ##if t-test p<.05 then pop likely have different means
}
ttpvalue
## [1] 0.6358 0.3673
contvarlist <- list(variables = contvars, normality = normality, ttest.by.subgroup = ttpvalue)
For categorical variables
chisqpvalue <- NULL
for (i in 1:length(catvars)) {
j <- which(vars == catvars[i]) #find position of variable in the original data frame and its subsets
tbl <- table(df[, j], df[, dx])
chisqtest <- summary(tbl)
chisqpvalue <- c(chisqpvalue, chisqtest$p.value)
}
chisqpvalue
## [1] 0.01579 0.77116 0.39484
catvarlist <- list(variables = catvars, chisq.by.subgroup = chisqpvalue)
For logic variables
proppvalue <- NULL
for (i in 1:length(logivars)) {
j <- which(vars == logivars[i]) #find position of variable in the original data frame and its subsets
tbl <- table(df[, j], df[, dx])
chisqtest <- summary(tbl)
proppvalue <- c(proppvalue, chisqtest$p.value)
}
proppvalue
## [1] 0.5551
logivarlist = list(variables = logivars, chisq.by.subgroup = proppvalue)
And now, the results!
str(contvarlist) #if shapiro p<.05 then pop likely NOT normally dist; if t-test p<.05 then pop likely have different means
## List of 3
## $ variables : chr [1:2] "age" "size"
## $ normality : num [1:2] 0.00125 0.73602
## $ ttest.by.subgroup: num [1:2] 0.636 0.367
str(catvarlist) #if chisq p<.05 then variables are likely NOT independent
## List of 2
## $ variables : chr [1:3] "sex" "smoker" "labtest"
## $ chisq.by.subgroup: num [1:3] 0.0158 0.7712 0.3948
str(logivarlist) #if chisq p<.05 then variables are likely NOT independent
## List of 2
## $ variables : chr "logitest"
## $ chisq.by.subgroup: num 0.555