In advance I'd like to apologize for the length of this post :S
[EDITS BELOW]
I would like to do a sample size calculation for a cluster randomized trial with a binary outcome.
Outcome is whether patients make use of a help service or not; Y= use help service (Y/N)
Patients are nested within clinics (five types) which are nested within hospitals.
We have an education program directed at hospital staff to promote the help service, and the main question to be answered by the trial is ‘Does the education program increase the proportion of patients making use the of the help line in the first 3 months after the completion of the education program?’. The education program will be rolled out in ‘experimental hospitals’; nothing will be done in control hospitals.
Not all patients will be using the service, as not all of them will have unmet needs after talking to their clinician/nurses. The level of unmet needs is believed to differ between clinics (clinics providing more severe treatments have higher unmet needs, and thus a larger group of ‘potential helpline users’).
Expected percentage of patients using the help service in control sites = 10%
Expected percentage of patients using the help service in experimental sites = 20%
Intraclass coefficient is unknown, but other studies have shown ICC of 0.09 and 0.15. I am interested in assessing sample size over the ICC range of 0 to 0.20.
For now, I assume that the education program can be rolled out instantaneously in all hospitals randomized to the experimental treatment and that patients only visit 1 clinic. The data available to assess the effectiveness of the program are;
• Number of patients seen per clinic in the three month period
• Number of patients using the help service in the three month period
We do not have individual patient data (as some might like to stay anonymous when using the help service), but aggregate data at the clinic level.
I have found the R package ‘CRTSize’, that can calculate the number of clusters needed for two level data (patients within hospital):
library(CRTSize)
pc.=0.10
pe.=0.20
ICCsteps <- c(0.0000001, 0.05,0.10, 0.15)
Msteps=c(25, 50,100,200,300)
results <- data.frame(pe=c(), pc=c(), m=c(), ICC=c(), Clusters_contr=c(), Clusters_exp=c())
for(i in seq_along(ICCsteps)){
for(m in seq_along(Msteps)){
print(paste("ICC=",ICCsteps[i], "m=", Msteps[m]))
mm <- n4props(pe=pe., pc=pc., m=Msteps[m], ICC=ICCsteps[i], alpha=0.05, power = 0.80, AR=1, two.tailed=TRUE, digits=3)
results_temp <- data.frame(pe=pe., pc=pc., m=mm$m, ICC=mm$ICC, Clusters_contr=ceiling(mm$nC), Clusters_exp=ceiling(mm$nE))
results <- rbind(results, results_temp)
}
}
results
results$TotalClustersNeeded <- results$Clusters_contr + results$Clusters_exp
par(mar=c(4,4,3,4))
plot(results$m, results$TotalClustersNeeded, type="n", xlab="Number of samples within cluster", ylab="Total number of clusters needed (half control/half intervention)", las=1);grid()
title(paste("Total # clusters needed to have 80% power with \n alfa 0.05 to show a difference between", pc.*100, "% vs. ", pe.*100, "%"))
for(i in seq_along(ICCsteps)){
points(results$m[results$ICC ==ICCsteps[i]], results$TotalClustersNeeded[results$ICC ==ICCsteps[i]], type="l")
text(max(Msteps)*0.90, min(results$TotalClustersNeeded[results$ICC ==ICCsteps[i]])+3, paste("ICC=", ICCsteps[i]), las=1)
}
So when I have around 200 pts per hospital, and ICC of 0.15, id need about 65 hospitals (half randomized to control, half randomized to intervention)
I don’t think it can model the three levels I will have in my data (pts/clinics/hospitals). Additional complications will be that hospitals will vary in the number of patients seen and that not all hospitals will have all types of clinics as some hospitals might be more specialized.
So I am trying to simulate some data.
I don't know how to simulate data with specific ICCs, but I think I found a way around it (see below).
Patients in each control-hospital have 10% probability to call the help line (20% in intervention hospitals). I have added a hospital specific error around these percentages. Each hospital has between 1-4 clinics. The proportion used to draw x samples from the binomial distribution is the same between the clinics within a hospital (x is number of patients in a clinic).
I have tweaked the hospital error in a way that the ICC is around 0.15. However, the ICC varies for each simulation. I thought, that when I run many simulations I’d get enough simulations within say 0.145-0.155 to say something about power at ICC = 0.15.
This is what I have done so far:
library(reshape2)
library(lme4)
library(plyr)
# Parameters
hospitals <- 30;h=1 # I set it up this way to be able to loop over several sample sizes, however, at the moment i am not using it.
clinics <- data.frame(clinic =c("CT", "RT", "Surg", "Pal"),
prop.clin.hosp = c(0.70, 0.80, 0.50 , 0.40),
Unmet.needs = c("45", "30", "20" , "10" ), # unmet needs currently not included in simulation; assumed to be equal between clinics
prop.pts = c(0.35, 0.35, 0.20 , 0.10)); clinics # most pts go through CT and RT, least through palliative care
prop.Control <- 0.5 # Proportion of hospitals allocated to control
Min.Pts.Treated.Per.Clinic <- 50 # No pts for full 6 month period for comparison (actually, its per hospital)
Max.Pts.Treated.Per.Clinic <- 300 # No pts for full 6 month period for comparison (actually, its per hospital)
Prpo_pts_calls_CONTROL <- 0.10
Prpo_pts_calls_INTERVENTION <- 0.20
N_simulations <- 100 #
Hosp_error_min <- -0.12
Hosp_error_max <- abs(Hosp_error_min)
# Empty dataframe to fill
results <- data.frame(N=c(),
mean.pt.per.hosp=c(),
pts_contr=c(),
pts_exp=c(),
total.var=c(),
Sf=c(),
Sl=c(),
Se=c(),
Sd=c(),
ICC=c(),
p=c())
for(i in 1:N_simulations){
cat("\r", "Run =", i, "of", N_simulations)
# Data generation
data<- data.frame(Hosp=c(1:hospitals[h]),
CT=rbinom(hospitals[h],1,clinics$prop.clin.hosp[clinics$clinic== "CT"]) ,
RT=rbinom(hospitals[h],1,clinics$prop.clin.hosp[clinics$clinic== "RT"]) ,
Surg=rbinom(hospitals[h],1,clinics$prop.clin.hosp[clinics$clinic== "Surg"]) ,
Pal=rbinom(hospitals[h],1,clinics$prop.clin.hosp
[clinics$clinic== "Pal"]),
hosp_err= runif(hospitals[h], Hosp_error_min, Hosp_error_max))
data$Arm <- c(rep("Control", round(hospitals[h]*prop.Control, 0)), rep("Intervention", hospitals[h] - round(hospitals[h]*prop.Control, 0)))
data$Pts=round(runif(hospitals[h], Min.Pts.Treated.Per.Clinic, Max.Pts.Treated.Per.Clinic), 0)
data$CT[(data$CT+data$RT+data$Surg+data$Pal)==0] <- 1 # If no clinics assigned, assign CT
data <- melt(data, id=c("Hosp","Arm", "Pts", "hosp_err"), variable.name="clinic")
data <- subset(data, value==1); data <- data[-dim(data)[2]];
data <- merge(data, clinics[c("clinic", "Unmet.needs", "prop.pts")], by="clinic")
data$Pt.per.clin <- ceiling(data$Pts * data$prop.pts)
data$simpropcalls[data$Arm == "Control"] <- Prpo_pts_calls_CONTROL
data$simpropcalls[data$Arm == "Intervention"] <- Prpo_pts_calls_INTERVENTION
data$simpropcalls2 <- data$simpropcalls + data$hosp_err; data$simpropcalls2[data$simpropcalls2 <0] <- 0 # Add error to introduce between hospital variation
for(row in c(1: dim(data)[1])){data$Pts.called[row]<- rbinom(1, data$Pt.per.clin[row], data$simpropcalls2[row])}
# for(row in c(1: dim(data)[1])){data$Pts.called2[row]<- rbinom(1, data$Pt.per.clin[row], data$simpropcalls[row])}
data$propcalls <- round(data$Pts.called/data$Pt.per.clin , 2)
data <- data[with(data, order(data$Hosp)), ]
head(data, 20)
# Analysis and extraction of results
summary(m1 <- glmer(cbind(Pts.called, Pt.per.clin-Pts.called) ~ Arm + (1|Hosp), family=binomial, data=data))
# Few lines to estimate ICC
X <- model.matrix(m1)
n <- nrow(X)
Beta <- fixef(m1)
Sf <- var(X %*% Beta) # Variance explained by fixed effect
(Sigma.list <- VarCorr(m1))
Sl <- as.numeric(VarCorr(m1)) # Between subject variance
Se <- 1
Sd <- pi^2/3
total.var <- Sf + Sl + Se + Sd
(ICCtemp <- Sl/total.var)
#Store results
results.temp <- data.frame(N=hospitals[h],
mean.pt.per.hosp=mean(ddply(data, .(Hosp), summarize, mean.hosp.size = sum(Pt.per.clin))[[2]]),
pts_contr=sum(data$Pt.per.clin[data$Arm == "Control"]),
pts_exp=sum(data$Pt.per.clin[data$Arm == "Intervention"]),
total.var=total.var,
Sf=Sf,
Sl=Sl,
Se=Se,
Sd=Sd,
ICC=as.numeric(ICCtemp),
p=summary(m1)$coefficients[2,4])
results <- rbind(results, results.temp)
} # END LOOP
# After running the code above 100s of times,
hist(results$ICC)
results$ICC.class <- cut(results$ICC, seq(0, max(results$ICC), 0.01))
results$sign <- ifelse(results$p <0.05, 1, 0)
res <- cbind(data.frame(round(table(results$ICC.class[results$sign == 1])/table(results$ICC.class)*100, 2)),
data.frame(table(results$ICC.class))[2]); names(res) <- c("ICC", "Power", "similations"); res # Obvisously only works with high number of simulations
# write.csv(res, paste("_Power calculation_", Prpo_pts_calls_CONTROL*100, "% vs", Prpo_pts_calls_INTERVENTION*100, "%_",hospitals[h],"hosp_hosp_err",Hosp_error_max, ".csv"))
# write.csv(results, paste("_RAW DATA_Power calculation_", Prpo_pts_calls_CONTROL*100, "% vs", Prpo_pts_calls_INTERVENTION*100, "%_",hospitals[h],"hosp_hosp_err",Hosp_error_max, ".csv"))
When I run 10k simulations, I get the following (unexpected) result:
(this is for 10% vs 20%, N=30 hospitals (15+15), hospital specific deviation from target proportion is -0.15 to 0.15 (sampled from uniform distribution)
ICC Power simulations
(0,0.01] NA 0
(0.01,0.02] 100 1
(0.02,0.03] 100 3
(0.03,0.04] 62.5 16
(0.04,0.05] 66.67 45
(0.05,0.06] 79.63 108
(0.06,0.07] 81.37 161
(0.07,0.08] 75.46 269
(0.08,0.09] 82.02 356
(0.09,0.1] 79.31 464
(0.1,0.11] 80.47 558
(0.11,0.12] 82.58 706
(0.12,0.13] 82.13 705
(0.13,0.14] 83.44 779
(0.14,0.15] 83.25 794
(0.15,0.16] 85.37 752
(0.16,0.17] 85.79 760
(0.17,0.18] 88.64 678
(0.18,0.19] 92.78 609
(0.19,0.2] 88.81 536
(0.2,0.21] 88.31 402
(0.21,0.22] 87.98 341
(0.22,0.23] 92.94 255
(0.23,0.24] 93.07 202
(0.24,0.25] 86.96 138
(0.25,0.26] 92.44 119
(0.26,0.27] 87.36 87
(0.27,0.28] 96.77 62
(0.28,0.29] 97.78 45
(0.29,0.3] 90 20
(0.3,0.31] 100 9
(0.31,0.32] 100 6
(0.32,0.33] 88.89 9
(0.33,0.34] 100 1
The ‘power’ would only be valid over the range for which many simulations were run obviously. For example, we see that for an ICC of around 0.15, we would have about 83-85% power to detect a difference between control/experimental hospitals when using 30 hospitals in total (15 contr +15 exp). One thing that is clear, is that the power does NOT decrease with increasing ICC, which we would expect. So something is wrong in the coding…
Would it have to do with the way ICC is calculated? An increase in ICC is either an increase in between hospital variance, a decrease in within hospital variance, or both.
Not sure how to proceed…
Comments welcome!
EDIT 1: I guess its not really fair to look at the power within the observed ICC. I should be looking at the power under the true ICC. I guess then my biggest problem is how to simulate data sets with with a fixed underlying ICC...
Related
I have installed a bunch of CO2 loggers in water that log CO2 every hour for the open water season. I have characterized the loggers at 3 different concentrations of CO2 before and after installing them.
I assume that the seasonal drift in error will be linear
I assume that the error between my characterization points will be linear
My script is based on a for loop that goes through each timestamp and corrects the value, this works but is unfortuneately not fast enough. I know that this can be done within a second but I am not sure how. I seek some advice and I would be grateful if someone could show me how.
Reproduceable example based on basic R:
start <- as.POSIXct("2022-08-01 00:00:00")#time when logger is installed
stop <- as.POSIXct("2022-09-01 00:00:00")#time when retrieved
dt <- seq.POSIXt(start,stop,by=3600)#generate datetime column, measured hourly
#generate a bunch of values within my measured range
co2 <- round(rnorm(length(dt),mean=600,sd=100))
#generate dummy dataframe
dummy <- data.frame(dt,co2)
#actual values used in characterization
actual <- c(0,400,1000)
#measured in the container by the instruments being characterized
measured.pre <- c(105,520,1150)
measured.post <- c(115,585,1250)
diff.pre <- measured.pre-actual#diff at precharacterization
diff.post <- measured.post-actual#diff at post
#linear interpolation of how deviance from actual values change throughout the season
#I assume that the temporal drift is linear
diff.0 <- seq(diff.pre[1],diff.post[1],length.out=length(dummy$dt))
diff.400 <- seq(diff.pre[2],diff.post[2],length.out = length(dummy$dt))
diff.1000 <- seq(diff.pre[3],diff.post[3],length.out = length(dummy$dt))
#creates a data frame with the assumed drift at each increment throughout the season
dummy <- data.frame(dummy,diff.0,diff.400,diff.1000)
#this loop makes a 3-point calibration at each day in the dummy data set
co2.corrected <- vector()
for(i in 1:nrow(dummy)){
print(paste0("row: ",i))#to show the progress of the loop
diff.0 <- dummy$diff.0[i]#get the differences at characterization increments
diff.400 <- dummy$diff.400[i]
diff.1000 <- dummy$diff.1000[i]
#values below are only used for encompassing the range of measured values in the characterization
#this is based on the interpolated difference at the given time point and the known concentrations used
measured.0 <- diff.0+0
measured.400 <- diff.400+400
measured.1000 <- diff.1000+1000
#linear difference between calibration at 0 and 400
seg1 <- seq(diff.0,diff.400,length.out=measured.400-measured.0)
#linear difference between calibration at 400 and 1000
seg2 <- seq(diff.400,diff.1000,length.out=measured.1000-measured.400)
#bind them together to get one vector
correction.ppm <- c(seg1,seg2)
#the complete range of measured co2 in the characterization.
#in reality it can not be below 0 and thus it can not be below the minimum measured in the range
measured.co2.range <- round(seq(measured.0,measured.1000,length.out=length(correction.ppm)))
#generate a table from which we can characterize the measured values from
correction.table <- data.frame(measured.co2.range,correction.ppm)
co2 <- dummy$co2[i] #measured co2 at the current row
#find the measured value in the table and extract the difference
diff <- correction.table$correction.ppm[match(co2,correction.table$measured.co2.range)]
#correct the value and save it to vector
co2.corrected[i] <- co2-diff
}
#generate column with calibrated values
dummy$co2.corrected <- co2.corrected
This is what I understand after reviewing the code. You have a series of CO2 concentration readings, but they need to be corrected based on characterization measurements taken at the beginning of the timeseries and at the end of the timeseries. Both sets of characterization measurements were made using three known concentrations: 0, 400, and 1000.
Your code appears to be attempting to apply bilinear interpolation (over time and concentration) to apply the needed correction. This is easy to vectorize:
set.seed(1)
start <- as.POSIXct("2022-08-01 00:00:00")#time when logger is installed
stop <- as.POSIXct("2022-09-01 00:00:00")#time when retrieved
dt <- seq.POSIXt(start,stop,by=3600)#generate datetime column, measured hourly
#generate a bunch of values within my measured range
co2 <- round(rnorm(length(dt),mean=600,sd=100))
#actual values used in characterization
actual <- c(0,400,1000)
#measured in the container by the instruments being characterized
measured.pre <- c(105,520,1150)
measured.post <- c(115,585,1250)
# interpolate the reference concentrations over time
cref <- mapply(seq, measured.pre, measured.post, length.out = length(dt))
#generate dummy dataframe with corrected values
dummy <- data.frame(
dt,
co2,
co2.corrected = ifelse(
co2 < cref[,2],
actual[1] + (co2 - cref[,1])*(actual[2] - actual[1])/(cref[,2] - cref[,1]),
actual[2] + (co2 - cref[,2])*(actual[3] - actual[2])/(cref[,3] - cref[,2])
)
)
head(dummy)
#> dt co2 co2.corrected
#> 1 2022-08-01 00:00:00 537 416.1905
#> 2 2022-08-01 01:00:00 618 493.2432
#> 3 2022-08-01 02:00:00 516 395.9776
#> 4 2022-08-01 03:00:00 760 628.2707
#> 5 2022-08-01 04:00:00 633 507.2542
#> 6 2022-08-01 05:00:00 518 397.6533
I do not know what you are calculating (I feel that this could be done differently), but you can increase speed by:
remove print, that takes a lot of time inside loop
remove data.frame creation in each iteration, that is slow and not needed here
This loop should be faster:
for(i in 1:nrow(dummy)){
diff.0 <- dummy$diff.0[i]
diff.400 <- dummy$diff.400[i]
diff.1000 <- dummy$diff.1000[i]
measured.0 <- diff.0+0
measured.400 <- diff.400+400
measured.1000 <- diff.1000+1000
seg1 <- seq(diff.0,diff.400,length.out=measured.400-measured.0)
seg2 <- seq(diff.400,diff.1000,length.out=measured.1000-measured.400)
correction.ppm <- c(seg1,seg2)
s <- seq(measured.0,measured.1000,length.out=length(correction.ppm))
measured.co2.range <- round(s)
co2 <- dummy$co2[i]
diff <- correction.ppm[match(co2, measured.co2.range)]
co2.corrected[i] <- co2-diff
}
p.s. now the slowest part from my testing is round(s). Maybe that can be removed or rewritten...
The question:
A screening test for a disease, that affects 0.05% of the male population, is able to identify the disease in 90% of the cases where an individual actually has the disease. The test however generates 1% false positives (gives a positive reading when the individual does not have the disease). Find the probability that a man has the disease given that has tested positive. Then, find the probability that a man has the disease given that he has a negative test.
My wrong attempt:
I first started by letting:
• T be the event that a man has a positive test
• Tc be the event that a man has a negative test
• D be the event that a man has actually the disease
• Dc be the event that a man does not have the disease
Therefore we need to find P(D|T) and P(D|Tc)
Then I wrote this code:
set.seed(110)
sims = 1000
D = rep(0, sims)
Dc = rep(0, sims)
T = rep(0, sims)
Tc = rep(0, sims)
# run the loop
for(i in 1:sims){
# flip to see if we have the disease
flip = runif(1)
# if we got the disease, mark it
if(flip <= .0005){
D[i] = 1
}
# if we have the disease, we need to flip for T and Tc,
if(D[i] == 1){
# flip for S1
flip1 = runif(1)
# see if we got S1
if(flip1 < 1/9){
T[i] = 1
}
# flip for S2
flip2 = runif(1)
# see if we got S1
if(flip2 < 1/10){
Tc[i] = 1
}
}
}
# P(D|T)
mean(D[T == 1])
# P(D|Tc)
mean(D[Tc == 1])
I'm really struggling so any help would be appreciated!
Perhaps the best way to think through a conditional probability question like this is with a concrete example.
Say we tested one million individuals in the population. Then 500 individuals (0.05% of one million) would be expected to have the disease, of whom 450 would be expected to test positive and 50 to test negative (since the false negative rate is 10%).
Conversely, 999,500 would be expected to not have the disease (one million minus the 500 who do have the disease), but since 1% of them would test positive, then we would expect 9,995 people (1% of 999,500) with false positive results.
So, given a positive test result taken at random, it either belongs to one of the 450 people with the disease who tested positive, or one of the 9,995 people without the disease who tested positive - we don't know which.
This is the situation in the first question, since we have a positive test result but don't know whether it is a true positive or a false positive. The probability of our subject having the disease given their positive test is the probability that they are one of the 450 true positives out of the 10,445 people with positive results (9995 false positives + 450 true positives). This boils down to the simple calculation 450/10,445 or 0.043, which is 4.3%.
Similarly, a negative test taken at random either belongs to one of the 989505 (999500 - 9995) people without the disease who tested negative, or one of the 50 people with the disease who tested negative, so the probability of having the disease is 50/989505, or 0.005%.
I think this question is demonstrating the importance of knowing that disease prevalence needs to be taken into account when interpreting test results, and very little to do with programming, or R. It requires only a calculator (at most).
If you really wanted to run a simulation in R, you could do:
set.seed(1) # This makes the sample reproducible
sample_size <- 1000000 # This can be changed to get a larger or smaller sample
# Create a large sample of 1 million "people", using a 1 to denote disease and
# a 0 to denote no disease, with probabilities of 0.0005 (which is 0.05%) and
# 0.9995 (which is 99.95%) respectively.
disease <- sample(x = c(0, 1),
size = sample_size,
replace = TRUE,
prob = c(0.9995, 0.0005))
# Create an empty vector to hold the test results for each person
test <- numeric(sample_size)
# Simulate the test results of people with the disease, using a 1 to denote
# a positive test and 0 to denote a negative test. This uses a probability of
# 0.9 (which is 90%) of having a positive test and 0.1 (which is 10%) of having
# a negative test. We draw as many samples as we have people with the disease
# and put them into the "test" vector at the locations corresponding to the
# people with the disease.
test[disease == 1] <- sample(x = c(0, 1),
size = sum(disease),
replace = TRUE,
prob = c(0.1, 0.9))
# Now we do the same for people without the disease, simulating their test
# results, with a 1% probability of a positive test.
test[disease == 0] <- sample(x = c(0, 1),
size = 1e6 - sum(disease),
replace = TRUE,
prob = c(0.99, 0.01))
Now we have run our simulation, we can just count the true positives, false positives, true negatives and false negatives by creating a contingency table
contingency_table <- table(disease, test)
contingency_table
#> test
#> disease 0 1
#> 0 989566 9976
#> 1 38 420
and get the approximate probability of having the disease given a positive test like this:
contingency_table[2, 2] / sum(contingency_table[,2])
#> [1] 0.04040015
and the probability of having the disease given a negative test like this:
contingency_table[2, 1] / sum(contingency_table[,1])
#> [1] 3.83992e-05
You'll notice that the probability estimates from sampling are not that accurate because of how small some of the sampling probabilities are. You could simulate a larger sample, but it might take a while for your computer to run it.
Created on 2021-08-19 by the reprex package (v2.0.0)
To expand on Allan's answer, but relating it back to Bayes Theorem, if you prefer:
From the question, you know (converting percentages to probabilities):
Plugging in:
I am trying to simulate monthly panels of data where one variable depends on lagged values of that variable in R. My solution is extremely slow. I need around 1000 samples of 2545 individuals, each of whom is observed monthly over many years, but the first sample took my computer 8.5 hours to construct. How can I make this faster?
I start by creating an unbalanced panel of people with different birth dates, monthly ages, and variables xbsmall and error that will be compared to determine the Outcome. All of the code in the first block is just data setup.
# Setup:
library(plyr)
# Would like to have 2545 people (nPerson).
#Instead use 4 for testing.
nPerson = 4
# Minimum and maximum possible ages and birth dates
AgeMin = 10
AgeMax = 50
BornMin = 1950
BornMax = 1963
# Person-specific characteristics
ind =
data.frame(
id = 1:nPerson,
BornYear = floor(runif(length(1:nPerson), min=BornMin, max=BornMax+1)),
BornMonth = ceiling(runif(length(1:nPerson), min=0, max=12))
)
# Make an unbalanced panel of people over age 10 up to year 1986
# panel = ddply(ind, ~id, transform, AgeMonths = BornMonth)
panel = ddply(ind, ~id, transform, AgeMonths = (AgeMin*12):((1986-BornYear)*12 + 12-BornMonth))
# Set up some random variables to approximate the data generating process
panel$xbsmall = rnorm(dim(panel)[1], mean=-.3, sd=.45)
# Standard normal error for probit
panel$error = rnorm(dim(panel)[1])
# Placeholders
panel$xb = rep(0, dim(panel)[1])
panel$Outcome = rep(0, dim(panel)[1])
Now that we have data, here is the part that is slow (around a second on my computer for only 4 observations but hours for thousands of observations). Each month, a person gets two draws (xbsmall and error) from two different normal distributions (these were done above), and Outcome == 1 if xbsmall > error. However, if Outcome equals 1 in the previous month, then Outcome in the current month equals 1 if xbsmall + 4.47 > error. I use xb = xbsmall+4.47 in the code below (xb is the "linear predictor" in a probit model). I ignore the first month for each person for simplicity. For your information, this is simulating a probit DGP (but that is not necessary to know to solve the problem of computation speed).
# Outcome == 1 if and only if xb > -error
# The hard part: xb includes information about the previous month's outcome
start_time = Sys.time()
for(i in 1:nPerson){
# Determine the range of monthly ages to loop over for this person
AgeMonthMin = min(panel$AgeMonths[panel$id==i], na.rm=T)
AgeMonthMax = max(panel$AgeMonths[panel$id==i], na.rm=T)
# Loop over the monthly ages for this person and determine the outcome
for(t in (AgeMonthMin+1):AgeMonthMax){
# Indicator for whether Outcome was 1 last month
panel$Outcome1LastMonth[panel$id==i & panel$AgeMonths==t] = panel$Outcome[panel$id==i & panel$AgeMonths==t-1]
# xb = xbsmall + 4.47 if Outcome was 1 last month
# Otherwise, xb = xbsmall
panel$xb[panel$id==i & panel$AgeMonths==t] = with(panel[panel$id==i & panel$AgeMonths==t,], xbsmall + 4.47*Outcome1LastMonth)
# Outcome == 1 if xb > 0
panel$Outcome[panel$id==i & panel$AgeMonths==t] =
ifelse(panel$xb[panel$id==i & panel$AgeMonths==t] > - panel$error[panel$id==i & panel$AgeMonths==t], 1, 0)
}
}
end_time = Sys.time()
end_time - start_time
My thoughts for reducing computer time:
Something with cumsum()
Some wonderful panel data function that I do not know about
Find a way to make the t loop go through the same starting and ending points for each individual and then somehow use plyr::ddpl() or dplyr::gather_by()
Iterative solution: make an educated guess about the value of Outcome at each monthly age (say, the mode) and somehow adjust values that do not match the previous month. This would work better in my real application because xbsmall has a very clear trend in age.
Do the simulation only for smaller samples and then estimate the effect of sample size on the values I need (the distributions of regression coefficient estimates not calculated here)
One approach is to use a split-apply-combine method. I take out the for(t in (AgeMonthMin+1):AgeMonthMax) loop and put the contents in a function:
generate_outcome <- function(x) {
AgeMonthMin <- min(x$AgeMonths, na.rm = TRUE)
AgeMonthMax <- max(x$AgeMonths, na.rm = TRUE)
for (i in 2:(AgeMonthMax - AgeMonthMin + 1)){
x$xb[i] <- x$xbsmall[i] + 4.47 * x$Outcome[i - 1]
x$Outcome[i] <- ifelse(x$xb[i] > - x$error[i], 1, 0)
}
x
}
where x is a dataframe for one person. This allows us to simplify the panel$id==i & panel$AgeMonths==t construct. Now we can just do
out <- lapply(split(panel, panel$id), generate_outcome)
out <- do.call(rbind, out)
and all.equal(panel$Outcome, out$Outcome) returns TRUE. Computing 100 persons took 1.8 seconds using this method, compared to 1.5 minutes in the original code.
I am trying to analyse the significant differences between different car company performance values across different countries. I am using ANOVA to do this.
Running ANOVA on my real dataset (30 countries, 1000 car companies and 90000 measurement scores) gave every car a zero p-value.
Confused by this, I created a reproducible example (below) with 30 groups, 3 car companies, 90000 random scores. Purposely, I kept a score of 1 for the Benz company where you shouldn't see any difference between countries. After running anova, I see a pvalue of 0.46 instead of 1.
Does any one know why is this ?
Reproducible example
set.seed(100000)
qqq <- 90000
df = data.frame(id = c(1:90000), country = c(rep("usa",3000), rep("usb",3000), rep("usc",3000), rep("usd",3000), rep("use",3000), rep("usf",3000), rep("usg",3000), rep("ush",3000), rep("usi",3000), rep("usj",3000), rep("usk",3000), rep("usl",3000), rep("usm",3000), rep("usn",3000), rep("uso",3000), rep("usp",3000), rep("usq",3000), rep("usr",3000), rep("uss",3000), rep("ust",3000), rep("usu",3000), rep("usv",3000), rep("usw",3000), rep("usx",3000), rep("usy",3000), rep("usz",3000), rep("usaa",3000), rep("usab",3000), rep("usac",3000), rep("usad",3000)), tesla=runif(90000), bmw=runif(90000), benz=rep(1, each=qqq))
str(df)
out<-data.frame()
for(j in 3:ncol(df)){
amod2 <- aov(df[,j]~df$country)
out[(j-2),1]<-colnames(df)[j]
out[(j-2),2]<-summary(amod2, test = adjusted("bonferroni"))[[1]][[1,"Pr(>F)"]]
}
colnames(out)<-c("cars","pvalue")
write.table(out,"df.output")
df.output
"cars" "pvalue"
"1" "tesla" 0.245931589754359
"2" "bmw" 0.382730335188437
"3" "benz" 0.465083026215268
With respect to the "benz" p-value in your reproducible example: an ANOVA analysis requires positive variance (i.e., non-constant data). If you violate this assumption, the model is degenerate. Technically, the p-value is based on an F-statistic whose value is a normalized ratio of the variance attributable to the "country" effect (for "benz" in your example, zero) divided by the total variance (for "benz" in your example, zero), so your F-statistic has "value" 0/0 or NaN.
Because of the approach R takes to calculating the F-statistic (using a QR matrix decomposition to improve numerical stability in "nearly" degenerate cases), it calculates an F-statistic equal to 1 (w/ 29 and 89970 degrees of freedom). This gives a p-value of:
> pf(1, 29, 89970, lower=FALSE)
[1] 0.465083
>
but it is, of course, largely meaningless.
With respect to your original problem, with large datasets relatively small effects will yield very small p-values. For example, if you add the following after your df definition above to introduce a difference in country usa:
df = within(df, {
o = country=="usa"
tesla[o] = tesla[o] + .1
bmw[o] = bmw[o] + .1
benz[o] = benz[o] + .1
rm(o)
})
you will find that out looks like this:
> out
cars pvalue
1 tesla 9.922166e-74
2 bmw 5.143542e-74
3 benz 0.000000e+00
>
Is this what you're seeing, or are you seeing all of them exactly zero?
I have survival data from an experiment in flies which examines rates of aging in various genotypes. The data is available to me in several layouts so the choice of which is up to you, whichever suits the answer best.
One dataframe (wide.df) looks like this, where each genotype (Exp, of which there is ~640) has a row, and the days run in sequence horizontally from day 4 to day 98 with counts of new deaths every two days.
Exp Day4 Day6 Day8 Day10 Day12 Day14 ...
A 0 0 0 2 3 1 ...
I make the example using this:
wide.df2<-data.frame("A",0,0,0,2,3,1,3,4,5,3,4,7,8,2,10,1,2)
colnames(wide.df2)<-c("Exp","Day4","Day6","Day8","Day10","Day12","Day14","Day16","Day18","Day20","Day22","Day24","Day26","Day28","Day30","Day32","Day34","Day36")
Another version is like this, where each day has a row for each 'Exp' and the number of deaths on that day are recorded.
Exp Deaths Day
A 0 4
A 0 6
A 0 8
A 2 10
A 3 12
.. .. ..
To make this example:
df2<-data.frame(c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"),c(0,0,0,2,3,1,3,4,5,3,4,7,8,2,10,1,2),c(4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36))
colnames(df2)<-c("Exp","Deaths","Day")
What I would like to do is perform a Gompertz Analysis (See second paragraph of "the life table" here). The equation is:
μx = α*e β*x
Where μx is probability of death at a given time, α is initial mortality rate, and β is the rate of aging.
I would like to be able to get a dataframe which has α and β estimates for each of my ~640 genotypes for further analysis later.
I need help going from the above dataframes to an output of these values for each of my genotypes in R.
I have looked through the package flexsurv which may house the answer but I have failed in attempts to find and implement it.
This should get you started...
Firstly, for the flexsurvreg function to work, you need to specify your input data as a Surv object (from package:survival). This means one row per observation.
The first thing is to re-create the 'raw' data from the summary tables you provide.
(I know rbind is not efficient, but you can always switch to data.table for large sets).
### get rows with >1 death
df3 <- df2[df2$Deaths>1, 2:3]
### expand to give one row per death per time
df3 <- sapply(df3, FUN=function(x) rep(df3[, 2], df3[, 1]))
### each death is 1 (occurs once)
df3[, 1] <- 1
### add this to the rows with <=1 death
df3 <- rbind(df3, df2[!df2$Deaths>1, 2:3])
### convert to Surv object
library(survival)
s1 <- with(df3, Surv(Day, Deaths))
### get parameters for Gompertz distribution
library(flexsurv)
f1 <- flexsurvreg(s1 ~ 1, dist="gompertz")
giving
> f1$res
est L95% U95%
shape 0.165351912 0.1281016481 0.202602176
rate 0.001767956 0.0006902161 0.004528537
Note that this is an intercept-only model as all your genotypes are A.
You can loop this over multiple survival objects once you have re-created the per-observation data as above.
From the flexsurv docs:
Gompertz distribution with shape parameter a and rate parameter
b has hazard function
H(x: a, b) = b.e^{ax}
So it appears your alpha is b, the rate, and beta is a, the shape.