Sum of random poisson numbers across data frame - r

In a simulation I am programming, in each iteration I need to compute the sum of some random poisson numbers for each row in a data frame where the parameters are stored in another column of that row.
Here's a sample of the data (called 'studies' in the code below):
phase Sites enroll_rate rec_months stud_months enrolled m_enroll
51 2 1 2.95920139 2.0000000 5.000000 6 0
52 2 24 0.20784867 2.0000000 5.000000 10 0
53 2 3 0.46501736 3.0000000 6.000000 2 0
54 2 2 1.40480769 3.0000000 6.000000 7 0
55 2 1 1.31299020 5.0000000 7.000000 3 0
64 2 29 0.04373204 0.9712526 1.971253 2 0
And here's the code I've been using to achieve this:
for (j in 1:nrow(studies)) {
studies$m_enroll[j] <- sum(rpois(studies$Sites[j],studies$enroll_rate[j]))
}
This does the job, but given that the data frame is hundreds of rows and I'm doing this simulation tens of thousands of times, it is quite inefficient.
I feel like there's a way to do this using one of the apply functions, but my experience with them is limited. Any ideas?

studies <- studies[rep(1:6,3000),]
system.time(for (j in 1:nrow(studies)){studies$m_enroll[j] <-
sum(rpois(studies$Sites[j],studies$enroll_rate[j]))})
user system elapsed
105.74 0.00 106.30
system.time(test <- sapply(1:nrow(studies),function(x)
sum(rpois(studies$Sites[x],studies$enroll_rate[x]))))
user system elapsed
0.36 0.00 0.36

Related

How can I make an object created within ddply available outside the function (in the global environment)?

I am trying to run glm on a number of individual subjects within a data set to return specific coefficients. I found an example here (Multiple glm in for loop) that works really well. Except, the result prints to the screen but is not available afterwards. So, I can't save the result or rename the variables unless I highlight the screen in RStudio then copy/paste into Excel.
Data:
Subject SNRs Prop_Correct Ntrials
1 3 0.65 100
1 0 0.40 100
1 -3 0.15 100
1 -6 0.00 100
1 -9 0.00 100
1 -12 0.00 100
2 3 0.65 100
2 0 0.40 100
2 -3 0.15 100
2 -6 0.00 100
2 -9 0.00 100
2 -12 0.00 100
3 3 0.65 100
3 0 0.40 100
3 -3 0.15 100
3 -6 0.00 100
3 -9 0.00 100
3 -12 0.00 100
My script:
ddply(Data, .(Subject), function (x){
intercept <- coef(summary(glm(Prop_Correct~SNRs, weights=Ntrials,family=quasibinomial(link='logit'),data=x)))[1]
slope <- coef(summary(glm(Prop_Correct~SNRs, weights=Ntrials,family=quasibinomial(link='logit'),data=x)))[2]
SNR50 <- (log(0.5/(1-0.5))/slope) - (intercept/slope)
Data_Summary <- c(SNR50,slope)
})
Which gives me this output:
Subject V1 V2
1 1 1.266165 0.4834356
2 2 1.266165 0.4834356
3 3 1.266165 0.4834356
However:
Data_Summary
Error: object 'Data_Summary' not found
How can I make the result of the ddply function available to the main/global environment?
Assign the results to a variable:
Data_Summary_df <- ddply(Data, .(Subject), function (x){
intercept <- coef(summary(glm(Prop_Correct~SNRs, weights=Ntrials,family=quasibinomial(link='logit'),data=x)))[1]
slope <- coef(summary(glm(Prop_Correct~SNRs, weights=Ntrials,family=quasibinomial(link='logit'),data=x)))[2]
SNR50 <- (log(0.5/(1-0.5))/slope) - (intercept/slope)
Data_Summary <- c(SNR50,slope)
})
Data_Summary_df
Though there are a lot of things going on, treat it as you would any function call, if not assigned to a variable, it just outputs the result, if you want the result retained you assign it to a variable, e.g.
mean(1:25)
calc_mean <- mean(1:25)
calc_mean

Bootstrapping multiple columns with R

I'm relatively new at R and I'm trying to build a function which will loop through columns in an imported table and produce an output which consists of the means and 95% confidence intervals. Ideally it should be possible to bootstrap columns with different sample sizes, but first I would like to get the iteration working. I have something that sort-of works, but I can't get it all the way there. This is what the code looks like, with the sample data and output included:
#cdata<-read.csv(file.choose(),header=T)#read data from selected file, works, commented out because data is provided below
#cdata #check imported data
#Sample Data
# WALL NRPK CISC WHSC LKWH YLPR
#1 21 8 1 2 2 5
#2 57 9 3 1 0 1
#3 45 6 9 1 2 0
#4 17 10 2 0 3 0
#5 33 2 4 0 0 0
#6 41 4 13 1 0 0
#7 21 4 7 1 0 0
#8 32 7 1 7 6 0
#9 9 7 0 5 1 0
#10 9 4 1 0 0 0
x<-cdata[,c("WALL","NRPK","LKWH","YLPR")] #only select relevant species
i<-nrow(x) #count number of rows for bootstrapping
g<-ncol(x) #count number of columns for iteration
#build bootstrapping function, this works for the first column but doesn't iterate
bootfun <- function(bootdata, reps) {
boot <- function(bootdata){
s1=sample(bootdata, size=i, replace=TRUE)
ms1=mean(s1)
return(ms1)
} # a single bootstrap
bootrep <- replicate(n=reps, boot(bootdata))
return(bootrep)
} #replicates bootstrap of "bootdata" "reps" number of times and outputs vector of results
cvr1 <- bootfun(x$YLPR,50000) #have unsuccessfully tried iterating the location various ways (i.e. x[i])
cvrquantile<-quantile(cvr1,c(0.025,0.975))
cvrmean<-mean(cvr1)
vec<-c(cvrmean,cvrquantile) #puts results into a suitable form for output
vecr<-sapply(vec,round,1) #rounds results
vecr
2.5% 97.5%
28.5 19.4 38.1
#apply(x[1:g],2,bootfun) ##doesn't work in this case
#desired output:
#Species Mean LowerCI UpperCI
#WALL 28.5 19.4 38.1
#NRPK 6.1 4.6 7.6
#YLPR 0.6 0.0 1.6
I've also tried this using the boot package, and it works beautifully to iterate through the means but I can't get it to do the same with the confidence intervals. The "ordinary" code above also has the advantage that you can easily retrieve the bootstrapping results, which might be used for other calculations. For the sake of completeness here is the boot code:
#Bootstrapping using boot package
library(boot)
#data<-read.csv(file.choose(),header=TRUE) #read data from selected file
#x<-data[,c("WALL","NRPK","LKWH","YLPR")] #only select relevant columns
#x #check data
#Sample Data
# WALL NRPK LKWH YLPR
#1 21 8 2 5
#2 57 9 0 1
#3 45 6 2 0
#4 17 10 3 0
#5 33 2 0 0
#6 41 4 0 0
#7 21 4 0 0
#8 32 7 6 0
#9 9 7 1 0
#10 9 4 0 0
i<-nrow(x) #count number of rows for resampling
g<-ncol(x) #count number of columns to step through with bootstrapping
boot.mean<-function(x,i){boot.mean<-mean(x[i])} #bootstrapping function to get the mean
z<-boot(x, boot.mean,R=50000) #bootstrapping function, uses mean and number of reps
boot.ci(z,type="perc") #derive 95% confidence intervals
apply(x[1:g],2, boot.mean) #bootstrap all columns
#output:
#WALL NRPK LKWH YLPR
#28.5 6.1 1.4 0.6
I've gone through all of the resources I can find and can't seem to get things working. What I would like for output would be the bootstrapped means with the associated confidence intervals for each column. Thanks!
Note: apply(x[1:g],2, boot.mean) #bootstrap all columns doesn't do any bootstrap. You are simply calculating the mean for each column.
For bootstrap mean and confidence interval, try this:
apply(x,2,function(y){
b<-boot(y,boot.mean,R=50000);
c(mean(b$t),boot.ci(b,type="perc", conf=0.95)$percent[4:5])
})

Subtracting Values in Previous Rows: Ecological Lifetable Construction

I was hoping I could get some help. I am constructing a life table, not for insurance, but for ecology (a cross-sectional of the population of a any kind of wild fauna), so essentially censoring variables like smoker/non-smoker, pregnant, gender, health-status, etc.:
AgeClass=C(1,2,3,4,5,6)
SampleSize=c(100,99,87,46,32,19)
for(i in 1:6){
+ PropSurv=c(Sample/100)
+ }
> LifeTab1=data.frame(cbind(AgeClass,Sample,PropSurv))
Which gave me this:
ID AgeClas Sample PropSurv
1 1 100 1.00
2 2 99 0.99
3 3 87 0.87
4 4 46 0.46
5 5 32 0.32
6 6 19 0.19
I'm now trying to calculate those that died in each row (DeathInt) by taking the initial number of those survived and subtracting it by the number below it (i.e. 100-99, then 99-87, then 87-46, so on and so forth). And try to look like this:
ID AgeClas Sample PropSurv DeathInt
1 1 100 1.00 1
2 2 99 0.99 12
3 3 87 0.87 41
4 4 46 0.46 14
5 5 32 0.32 13
6 6 19 0.19 NA
I found this and this, and I wasn't sure if they answered my question as these guys subtracted values based on groups. I just wanted to subtract values by row.
Also, just as a side note: I did a for() to get the proportion that survived in each age group. I was wondering if there was another way to do it or if that's the proper, easiest way to do it.
Second note: If any R-users out there know of an easier way to do a life-table for ecology, do let me know!
Thanks!
If you have a vector x, that contains numbers, you can calculate the difference by using the diff function.
In your case it would be
LifeTab1$DeathInt <- c(-diff(Sample), NA)

R while loop with vector condition

I want to vectorize a function that uses a while-loop.
The original function is
getParamsLeadtime <- function(leadtimeMean, in_tolerance, tolerance){
searchShape=0
quantil=0
# iterates the parameters until the percentage of values is within the interval of tolerance
while (quantil < in_tolerance){
searchShape = searchShape+1
quantil <- pgamma(leadtimeMean+tolerance,shape=searchShape,rate=searchShape/leadtimeMean) -
pgamma(leadtimeMean-tolerance,shape=searchShape,rate=searchShape/leadtimeMean)
}
leadtimeShape <- searchShape
leadtimeRate <- searchShape/leadtimeMean
return(c(leadtimeShape, leadtimeRate))
}
I would like to have a vectorized call to this function to apply it to a data frame. Currently I am looping through it:
leadtimes <- data.frame()
for (a in seq(92:103)) {
leadtimes <- rbind(leadtimes, getParamsLeadtime(a, .85,2))
}
When I tried to vectorize the function, the while did not seem to accept a vector as condition. The following warning occured:
Warning message:
In while (input["U"] < rep(tolerance, dim(input)[1])) { :
the condition has length > 1 and only the first element will be used
This let me suppose that while does not like vectors. Can you tell me how to vectorize the function?
On a sidenote, I wonder why the column names of the resulting leadtimes-data.frame appear to be values:
> leadtimes
X1 X1.1
1 1 1.000000
2 1 0.500000
3 4 1.333333
4 8 2.000000
5 13 2.600000
6 19 3.166667
7 25 3.571429
8 33 4.125000
9 42 4.666667
10 52 5.200000
11 63 5.727273
12 74 6.166667
Here's an option that is pretty performant.
We vectorize the calculation of pgamma for a given mean lead time, for both the +tol and the -tol case, over a sufficiently large sequence of shp. We calculate a (vectorized) difference, and compare to in_tol. The index (minus 1, since we start our sequence at 0) of the first element of the vector that is greater than in_tol is the lowest value of shp that leads to a pgamma of greater than in_tol.
f <- function(lead, in_tol, tol) {
shp <- which(!(pgamma(lead + tol, 0:10000, (0:10000)/lead) -
pgamma(lead - tol, 0:10000, (0:10000)/lead))
< in_tol)[1] - 1
rate <- shp/lead
c(shp, rate)
}
We can then sapply this over a range of mean lead times.
t(sapply(1:12, f, 0.85, 2))
## [,1] [,2]
## [1,] 1 1.000000
## [2,] 1 0.500000
## [3,] 4 1.333333
## [4,] 8 2.000000
## [5,] 13 2.600000
## [6,] 19 3.166667
## [7,] 25 3.571429
## [8,] 33 4.125000
## [9,] 42 4.666667
## [10,] 52 5.200000
## [11,] 63 5.727273
## [12,] 74 6.166667
system.time(leadtimes <- sapply(1:103, f, 0.85, 2))
## user system elapsed
## 1.28 0.00 1.30
You just need to make sure you choose a sensible upper ceiling for the shape parameter (here I've chosen 10000, which was more than generous). Note that if you don't choose an upper limit that is high enough, some return values will be NA.

Is there a faster way to get percent change?

I have a data frame with around 25000 records and 10 columns. I am using code to determine the change to the previous value in the same column (NewVal) based on another column (y) with a percent change already in it.
x=c(1:25000)
y=rpois(25000,2)
z=data.frame(x,y)
z[1,'NewVal']=z[1,'x']
So I ran this:
for(i in 2:nrow(z)){z$NewVal[i]=z$NewVal[i-1]+(z$NewVal[i-1]*(z$y[i]/100))}
This takes considerably longer than I expected it to. Granted I may be an impatient person - as a scathing letter drafted to me once said - but I am trying to escape the world of Excel (after I read http://www.burns-stat.com/pages/Tutor/spreadsheet_addiction.html, which is causing me more problems as I have begun to mistrust data - that letter also mentioned my trust issues).
I would like to do this without using any of the functions from packages as I would like to know what the formula for creating the values is - or if you will, I am a demanding control freak according to that friendly missive.
I would also like to know how to get a moving average just like rollmean in caTools. Either that or how do I figure out what their formula is? I tried entering rollmean and I think it refers to another function (I am new to R). This should probably be another question - but as that letter said, I don't ever make the right decisions in my life.
The secret in R is to vectorise. In your example you can use cumprod to do the heavy lifting:
z$NewVal2 <- x[1] * cumprod(with(z, 1 +(c(0, y[-1]/100))))
all.equal(z$NewVal, z$NewVal2)
[1] TRUE
head(z, 10)
x y NewVal NewVal2
1 25 4 25.00000 25.00000
2 24 3 25.75000 25.75000
3 23 0 25.75000 25.75000
4 22 1 26.00750 26.00750
5 21 3 26.78773 26.78773
6 20 2 27.32348 27.32348
7 19 2 27.86995 27.86995
8 18 3 28.70605 28.70605
9 17 4 29.85429 29.85429
10 16 2 30.45138 30.45138
On my machine, the loop takes just less than 3 minutes to run, while the cumprod statement is virtually instantaneous.
I got about a 800-fold improvement with Reduce:
system.time(z[, "NewVal"] <-Reduce("*", c(1, 1+z$y[-1]/100), accumulate=T) )
user system elapsed
0.139 0.008 0.148
> head(z)
x y NewVal
1 1 1 1.000
2 2 1 1.010
3 3 1 1.020
4 4 5 1.071
5 5 1 1.082
6 6 2 1.103
7 7 2 1.126
8 8 3 1.159
9 9 0 1.159
10 10 1 1.171
> system.time(for(i in 2:nrow(z)){z$NewVal[i]=z$NewVal[i-1]+
(z$NewVal[i-1]*(z$y[i]/100))})
user system elapsed
37.29 106.38 143.16

Resources