Bootstrapping multiple columns with R - r

I'm relatively new at R and I'm trying to build a function which will loop through columns in an imported table and produce an output which consists of the means and 95% confidence intervals. Ideally it should be possible to bootstrap columns with different sample sizes, but first I would like to get the iteration working. I have something that sort-of works, but I can't get it all the way there. This is what the code looks like, with the sample data and output included:
#cdata<-read.csv(file.choose(),header=T)#read data from selected file, works, commented out because data is provided below
#cdata #check imported data
#Sample Data
# WALL NRPK CISC WHSC LKWH YLPR
#1 21 8 1 2 2 5
#2 57 9 3 1 0 1
#3 45 6 9 1 2 0
#4 17 10 2 0 3 0
#5 33 2 4 0 0 0
#6 41 4 13 1 0 0
#7 21 4 7 1 0 0
#8 32 7 1 7 6 0
#9 9 7 0 5 1 0
#10 9 4 1 0 0 0
x<-cdata[,c("WALL","NRPK","LKWH","YLPR")] #only select relevant species
i<-nrow(x) #count number of rows for bootstrapping
g<-ncol(x) #count number of columns for iteration
#build bootstrapping function, this works for the first column but doesn't iterate
bootfun <- function(bootdata, reps) {
boot <- function(bootdata){
s1=sample(bootdata, size=i, replace=TRUE)
ms1=mean(s1)
return(ms1)
} # a single bootstrap
bootrep <- replicate(n=reps, boot(bootdata))
return(bootrep)
} #replicates bootstrap of "bootdata" "reps" number of times and outputs vector of results
cvr1 <- bootfun(x$YLPR,50000) #have unsuccessfully tried iterating the location various ways (i.e. x[i])
cvrquantile<-quantile(cvr1,c(0.025,0.975))
cvrmean<-mean(cvr1)
vec<-c(cvrmean,cvrquantile) #puts results into a suitable form for output
vecr<-sapply(vec,round,1) #rounds results
vecr
2.5% 97.5%
28.5 19.4 38.1
#apply(x[1:g],2,bootfun) ##doesn't work in this case
#desired output:
#Species Mean LowerCI UpperCI
#WALL 28.5 19.4 38.1
#NRPK 6.1 4.6 7.6
#YLPR 0.6 0.0 1.6
I've also tried this using the boot package, and it works beautifully to iterate through the means but I can't get it to do the same with the confidence intervals. The "ordinary" code above also has the advantage that you can easily retrieve the bootstrapping results, which might be used for other calculations. For the sake of completeness here is the boot code:
#Bootstrapping using boot package
library(boot)
#data<-read.csv(file.choose(),header=TRUE) #read data from selected file
#x<-data[,c("WALL","NRPK","LKWH","YLPR")] #only select relevant columns
#x #check data
#Sample Data
# WALL NRPK LKWH YLPR
#1 21 8 2 5
#2 57 9 0 1
#3 45 6 2 0
#4 17 10 3 0
#5 33 2 0 0
#6 41 4 0 0
#7 21 4 0 0
#8 32 7 6 0
#9 9 7 1 0
#10 9 4 0 0
i<-nrow(x) #count number of rows for resampling
g<-ncol(x) #count number of columns to step through with bootstrapping
boot.mean<-function(x,i){boot.mean<-mean(x[i])} #bootstrapping function to get the mean
z<-boot(x, boot.mean,R=50000) #bootstrapping function, uses mean and number of reps
boot.ci(z,type="perc") #derive 95% confidence intervals
apply(x[1:g],2, boot.mean) #bootstrap all columns
#output:
#WALL NRPK LKWH YLPR
#28.5 6.1 1.4 0.6
I've gone through all of the resources I can find and can't seem to get things working. What I would like for output would be the bootstrapped means with the associated confidence intervals for each column. Thanks!

Note: apply(x[1:g],2, boot.mean) #bootstrap all columns doesn't do any bootstrap. You are simply calculating the mean for each column.
For bootstrap mean and confidence interval, try this:
apply(x,2,function(y){
b<-boot(y,boot.mean,R=50000);
c(mean(b$t),boot.ci(b,type="perc", conf=0.95)$percent[4:5])
})

Related

Episode splitting in survival analysis by the timing of an event in R

Is it possible to split episode by a given variable in survival analysis in R, similar to in STATA using stsplit in the following way: stsplit var, at(0) after(time=time)?
I am aware that the survival package allows one to split episode by given cut points such as c(0,5,10,15) in survSplit, but if a variable, say time of divorce, differs by each individual, then providing cutpoints for each individual would be impossible, and the split would have to be based on the value of a variable (say graduation, or divorce, or job termination).
Is anyone aware of a package or know a resource I might be able to tap into?
Perhaps Epi package is what you are looking for. It offers multiple ways to cut/split the follow-up time using the Lesix objects. Here is the documentation of cutLesix().
After some poking around, I think tmerge() in the survival package can achieve what stsplit var can do, which is to split episodes not just by a given cut points (same for all observations), but by when an event occurs for an individual.
This is the only way I knew how to split data
id<-c(1,2,3)
age<-c(19,20,29)
job<-c(1,1,0)
time<-age-16 ## create time since age 16 ##
data<-data.frame(id,age,job,time)
id age job time
1 1 19 1 3
2 2 20 1 4
3 3 29 0 13
## simple split by time ##
## 0 to up 2 years, 2-5 years, 5+ years ##
data2<-survSplit(data,cut=c(0,2,5),end="time",start="start",
event="job")
id age start time job
1 1 19 0 2 0
2 1 19 2 3 1
3 2 20 0 2 0
4 2 20 2 4 1
5 3 29 0 2 0
6 3 29 2 5 0
7 3 29 5 13 0
However, if I want to split by a certain variable, such as when each individuals finished school, each person might have a different cut point (finished school at different ages).
## split by time dependent variable (age finished school) ##
d1<-data.frame(id,age,time,job)
scend<-c(17,21,24)-16
d2<-data.frame(id,scend)
## create start/stop time ##
base<-tmerge(d1,d1,id=id,tstop=time)
## create time-dependent covariate ##
s1<-tmerge(base,d2,id=id,
finish=tdc(scend))
id age time job tstart tstop finish
1 1 19 3 1 0 1 0
2 1 19 3 1 1 3 1
3 2 20 4 1 0 4 0
4 3 29 13 0 0 8 0
5 3 29 13 0 8 13 1
I think tmerge() is more or less comparable with stsplit function in STATA.

R: Explorative linear regression, setting up a simple model with multiple depentent and independent variables

I have a study with several cases, all containing data from multiple ordinal factor variables (genotypes) and multiple numeric variables (various blood samples (concentrations)). I am trying to set up an explorative model to test linearity between any of the numeric variables (dependent in the model) and any of the ordinal factor variables (independent in the model).
Dataset structure example (independent variables): genotypes
case_id genotype_1 genotype_2 ... genotype_n
1 0 0 1
2 1 0 2
... ... ... ...
n 2 1 0
and dependent variables (with matching case id:s): samples
case_id sample_1 sample_2 ... sample_n
1 0.3 0.12 6.12
2 0.25 0.15 5.66
... ... ... ...
n 0.44 0.26 6.62
Found one similar example in the forum which doesn't solve the problem:
model <- apply(samples,2,function(xl)lm(xl ~.,data= genotypes))
I can't figure out how to make simple linear regressions that go through any combination of a given set of dependent and independent variables. If using apply family I guess the varying (x) term should be the dependent variable in the model since every dependent variable should test linearity for the same set of independent variables (individually).
Extract from true data:
> genotypes
case_id genotype_1 genotype_2 genotype_3 genotype_4 genotype_5
1 1 2 2 1 1 0
2 2 NaN 1 NaN 0 0
3 3 1 0 0 0 NaN
4 4 2 2 1 1 0
5 5 0 0 0 1 NaN
6 6 2 2 1 0 0
7 9 0 0 0 0 1
8 10 0 0 0 NaN 0
9 13 0 0 0 NaN 0
10 15 NaN 1 NaN 0 1
> samples
case_id sample_1 sample_2 sample_3 sample_4 sample_5
1 1 0.16092019 0.08814160 -0.087733372 0.1966070 0.09085343
2 2 -0.21089678 -0.13289427 0.056583528 -0.9077926 -0.27928376
3 3 0.05102400 0.07724300 -0.212567535 0.2485348 0.52406368
4 4 0.04823619 0.12697286 0.010063683 0.2265085 -0.20257192
5 5 -0.04841221 -0.10780329 0.005759269 -0.4092782 0.06212171
6 6 -0.08926734 -0.19925538 0.202887833 -0.1536070 -0.05889369
7 9 -0.03652588 -0.18442457 0.204140717 0.1176950 -0.65290133
8 10 0.07038933 0.05797007 0.082702589 0.2927817 0.01149564
9 13 -0.14082554 0.26783539 -0.316528107 -0.7226103 -0.16165326
10 15 -0.16650266 -0.35291579 0.010063683 0.5210507 0.04404433
SUMMARY: Since I have a lot of data I want to create a simple model to help me select which possible correlations to look further into. Any ideas out there?
NOTE: I am not trying to fit a multiple linear regression model!
I feel like there must be a statistical test for linearity, but I can't recall it. Visual inspection is typically how I do it. Quick and dirty way to test for linearity for a large number of variables would be to test the corr() of each pair of dependent/independent variables. Small multiples would be a handy way to do it.
Alternately, for each dependent ordinal variable, run a corrplot vs. each independent (numerical) variable, a logged version of the independent variable, and the exponentiated version of the independant variable. If the result of CORR for the logged or exponented version has a higher p-value than the regular version, it seems likely you have some linearity issues.

Imputation for longitudinal data using observation before and after missing data

I’m in the process of cleaning some longitudinal data and I have several missing cases. I am trying to use an imputation that incorporates observations before and after the missing case. I’m wondering how I can go about addressing the issues detailed below.
I’ve been trying to break the problem apart into smaller, more manageable operations and objects, however, the solutions I keep coming to force me to use conditional formatting based on rows immediately above and below the a missing value and, quite frankly, I’m at a bit of a loss as to how to do this. I would love a little guidance if you think you know of a good technique I can use, experiment with, or if you know of any good search terms I can use when looking up a solution.
The details are below:
#Fake dataset creation
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
ss <- c(1,3,2,3,NA,0,0,2,4,0,NA,0,0,0,4,1,2,4,2,3,NA,2,1,0,NA,NA,0,0)
mydat <- data.frame(id, time, ss)
*Bold characters represent changes from the dataset above
The goal here is to find a way to get the mean of the value before (3) and after (0) the NA value for ID #1 (variable ss) so that the data look like this: 1,3,2,3,1.5,0,0,
ID# 2 (variable ss) should look like this: 2,4,0,0,0,0,0
ID #3 (variable ss) should use a last observation carried forward approach, so it would need to look like this: 4,1,2,4,2,3,3
ID #4 (variable ss) has two consecutive NA values and should not be changed. It will be flagged for a different analysis later in my project. So, it should look like this: 2,1,0,NA,NA,0,0 (no change).
I use a package, smwrBase, the syntax for only filling in 1 missing value is below, but doesn't address id.
smwrBase::fillMissing(ss, max.fill=1)
The zoo package might be more standard, same issue though.
zoo::na.approx(ss, maxgap=1)
Below is an approach that accounts for the variable id. Current interpolation approaches dont like to fill in the last value, so i added a manual if stmt for that. A bit brute force as there might be a tapply approach out there.
> id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
> time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
> ss <- c(1,3,2,3,NA,0,0,2,4,0,NA,0,0,0,4,1,2,4,2,3,NA,2,1,0,NA,NA,0,0)
> mydat <- data.frame(id, time, ss, ss2=NA_real_)
> for (i in unique(id)) {
+ # interpolate for gaps
+ mydat$ss2[mydat$id==i] <- zoo::na.approx(ss[mydat$id==i], maxgap=1, na.rm=FALSE)
+ # extension for gap as last value
+ if(is.na(mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])])) {
+ mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])] <-
+ mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])-1]
+ }
+ }
> mydat
id time ss ss2
1 1 0 1 1.0
2 1 1 3 3.0
3 1 2 2 2.0
4 1 3 3 3.0
5 1 4 NA 1.5
6 1 5 0 0.0
7 1 6 0 0.0
8 2 0 2 2.0
9 2 1 4 4.0
10 2 2 0 0.0
11 2 3 NA 0.0
12 2 4 0 0.0
13 2 5 0 0.0
14 2 6 0 0.0
15 3 0 4 4.0
16 3 1 1 1.0
17 3 2 2 2.0
18 3 3 4 4.0
19 3 4 2 2.0
20 3 5 3 3.0
21 3 6 NA 3.0
22 4 0 2 2.0
23 4 1 1 1.0
24 4 2 0 0.0
25 4 3 NA NA
26 4 4 NA NA
27 4 5 0 0.0
28 4 6 0 0.0
The interpolated value in id=1 is 1.5 (avg of 3 and 0), id=2 is 0 (avg of 0 and 0, and id=3 is 3 (the value preceding since it there is no following value).

Sample/Species Transformation code assistance

I'm trying to simplify a task in R. I have a community matrix as such:
row.name species1 species2 species3 species4 .... species50
sample 1 1 6 156 4 1
sample 2 0 20 34 5 1
sample 3 3 7 23 0 7
....
sample 10 3 15 9 7 6
These are raw count figures
I'm trying to code (but getting nowhere) a means by which I can cap any species which occurs >10% in a sample/row, to 9%. I.e in this (made up) example it would seem sample1/species3 may need capping.
I would like the the data kept as/reverted back to a raw count. Is this even possible within R?
I'm aware of the ecology transformations in vegan or equivalent to normalise/standardise data, but they are not what I am after here.
I hope that makes sense. If not I can try explain again. Any help greatly appreciated, still fairly new with R.
I would use sweep(), but specify pmin as the function so that it
takes the smaller of 10% and the actual value:
M <- read.table(header=TRUE, row.names = 'row.name',
text='row.name species1 species2 species3 species4 species50
sample_1 1 6 156 4 1
sample_2 0 20 34 5 1
sample_3 3 7 23 0 7
sample_10 3 15 9 7 6')
M <- as.matrix(M)
sweep(M, 1, rowSums(M) %/% 10, pmin)

R: generate possible permutation tables by one column

I have a table that looks like this:
Indikaatori nimi Alamkriteerium Kriteerium Skoor
1 Indikaator 1 1.1 1 100
2 Indikaator 2 1.2 1 100
3 Indikaator 3 1.3 1 100
4 Indikaator 4 1.1 1 0
5 Indikaator 5 2.1 2 0
6 Indikaator 6 2.1 2 0
... and so on...
I need to create all possible permutations of the table by the first column.
There's a total of 50 indicators, from which i want to pick 49 and get all the possible combinations along with the chosen elements other data columns.
With 49 elements out of 50, i will get a total of 50 permutations, but i want to automatically create all these tables without doing it manually (later on 48 elements is also necessary).
Is there any way to generate these 50 tables automatically with the respective data to the chosen elements?
All help and pointers are appreciated!
# The following will give you a list of fifty data frames,
# Each data frame has a 49 row subset of the original
listoftables <- apply(combn(1:50, 49), 2, FUN = function(x) df[x,])
This solution uses loops, which are rather slow compared to vectorized operations in R, but it will get you what you need in the form of a list of data.frames.
datatable = read.table(textConnection(
"2 Indikaator 2 1.2 1 100
3 Indikaator 3 1.3 1 100
4 Indikaator 4 1.1 1 0
5 Indikaator 5 2.1 2 0
6 Indikaator 6 2.1 2 0"))
x = rep(list(data.frame(NULL)),times = 2^nrow(datatable))
a = 1
for (i in 1:nrow(datatable)){
sets = combn(nrow(datatable),i)
for (j in 1:ncol(sets)){
x[[a]] = datatable[sets[,j],]
a = a+1
}
}
View(x[[10]])

Resources